Axes of a revolution: challenges and promises of big data in healthcare
Abstract
Health data are increasingly being generated at a massive scale, at various levels of phenotyping and from different types of resources. Concurrent with recent technological advances in both data-generation infrastructure and data-analysis methodologies, there have been many claims that these events will revolutionize healthcare, but such claims are still a matter of debate. Addressing the potential and challenges of big data in healthcare requires an understanding of the characteristics of the data. Here we characterize various properties of medical data, which we refer to as ‘axes’ of data, describe the considerations and tradeoffs taken when such data are generated, and the types of analyses that may achieve the tasks at hand. We then broadly describe the potential and challenges of using big data in healthcare resources, aiming to contribute to the ongoing discussion of the potential of big data resources to advance the understanding of health and disease.
Main
Health has been defined as “a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity”. This definition may be expanded to view health not as a single state but rather as a dynamic process of different states in different points in time that together assemble a health trajectory. The ability to understand the health trajectories of different people, how they would unfold along different pathways, how the past affects the present and future health, and the complex interactions between different determinants of health over time are among the most challenging and important goals in medicine.
Following technological, organizational and methodological advances in recent years, a new and promising direction has emerged toward achieving those goals: the analysis of large medical and biological datasets. With the rapid increase in the amount of medical information available, the term ‘big data’ has become increasingly popular in medicine. This increase is anticipated to continue as data from electronic health records (EHRs) and other emerging data sources such as wearable devices and multinational efforts for collection and storage of data and biospecimens in designated biobanks will expand.
Analyses of large-scale medical data have the potential to identify new and unknown associations, patterns and trends in the data that may pave the way to scientific discoveries in pathogenesis, classification, diagnosis, treatment and progression of disease. Such work includes using the data for constructing computational models to accurately predict clinical outcomes and disease progression, which have the potential to identify people at high risk and prioritize them for early intervention strategies, and to evaluate the influence of public health policies on ‘real-world’ data. However, many challenges remain for the fulfillment of these ambitious goals.
In this Review, we first define big data in medicine and the various axes of medical data, and describe data-generation processes, more specifically considerations for constructing longitudinal cohorts for obtaining data. We then discuss data-analysis methods, the potential goals of these analyses and the challenges for achieving them.
Big data in medicine
The definition of ‘big data’ is diverse, in part because ‘big’ is a relative term. Although some definitions are quantitative, focusing on the volume of data needed for a dataset to be considered big, other definitions are qualitative, focusing on the size or complexity of data that are too large to be properly analyzed by traditional data-analysis methods. In this Review, we refer to ‘big data’ as qualitatively defined.
Medical data have unique features compared with big data in other domains. The data may include administrative health data, biomarker data, biometric data (for example, from wearable technologies) and imaging, and may originate from many different sources, including EHRs, clinical registries, biobanks, the internet and patient self-reports. Medical data can also be characterized and vary by states such as (i) structured versus unstructured (for example, diagnosis codes versus free text in clinical notes); (ii) patient-care-oriented versus research-oriented (for example, hospital medical records versus biobanks); (iii) explicit versus implicit (for example, checkups versus social media), and (iv) raw versus ascertained (data without processing versus data after standardization and validation processes).
Defining axes of data
Health data are complex and have several different properties. As these properties are quantitative, we can view them as ‘axes’ of the data. Some properties may be easy to quantify, such as the number of participants, the duration of longitudinal follow up, and the depth, which may be calculated as the number of different types of data being measured. Other properties may be more challenging to quantify, such as heterogeneity, which may be computed using various diversity indices. In this context, healthcare data may be viewed as having the axes described below (Figs. 1 and 2b).
Number of participants (axis N)
Sample size in an important consideration in every medical data source. In longitudinal cohorts, planning the desired cohort size—calculated on the basis of an estimate of the number of predefined clinical endpoints expected to occur during the follow-up period—is critical to reaching sufficient statistical power. As a result, a study of rare disease trajectory before symptom onset would require a very large number of subjects and is often impractical. Retention rate is also important in determining the cohort size. The main limitations for increasing sample size are the recruitment rate, and financial and organizational constraints.
Depth of phenotyping (axis D)
Medical data may range from the molecular level up to the level of social interactions among subjects. It may be focused on one specific organ or system in the body (such as the immune system) or may be more general and contain information about the entire body (as with total-body magnetic resonance imaging).
At the molecular level, data may be obtained by a variety of methods that analyze a diverse array of ‘omics’ data, which broadly represents the information contained within a person’s genome and its biological derivatives. Omics data may include transcriptional, epigenetic, proteomic and metabolomic data. Another rich source of omic-level information is the human microbiome, the collective genome of trillions of microbes that reside in the human body.
Additional phenotypes that may be obtained include demographics and socioeconomic factors (for example, ethnicity and material status), anthropometrics (for example, weight and height measurements), lifestyle habits (for example, smoking, exercise, nutrition), physiome or continuous physiological measurements (for example, blood pressure, heart rhythm and glucose measurements, which can be measured by wearable devices), clinical phenotyping (for example, diagnoses, medication use, medical imaging and procedure results), psychological phenotyping and environmental phenotyping (for example, air pollution and radiation level by environmental sensors that connect with smartphones). Diverse data types pose an analytical challenge, as their processing and integration requires in-depth technical knowledge about how these data were generated, the relevant statistical analyses, and the quantitative and qualitative relationship of different data types.
In the construction of a prospective cohort, the choice of the type and the depth of information to measure is challenging and depends on many considerations. Each test should be evaluated on the basis of its relevance, reliability and required resources. Relevance relies on other epidemiological studies that found significant associations with the studied health outcomes. Reliability includes selecting methods that pass quality testing, including calibration, maintenance, ease of use, training, monitoring and data transfer. Resources include both capital and recurrent costs.
Additional considerations include finding the right balance between exploiting known data types (such as genomic information) and exploring new types of data (such as new molecular assays) that have not been previously studied for the scientific question and are therefore more risky but may lead to new and exciting discoveries (hence exploration versus exploitation). It is also important to consider that the rapid acceleration of newer and cheaper technologies for data processing, storage and analysis will hopefully enable measurements of more data types and for larger cohorts as time progresses. One striking example is the cost of DNA sequencing, which decreased over one-million-fold in the past two decades. Another consideration is the possibility that the mechanisms sought, and the answers to the scientific questions, depend on components that we cannot currently measure; therefore, considering which biospecimens to store for future research is also important.
Longitudinal follow-up (axis F)
Longitudinal follow-up includes the total duration of follow-up, the time intervals between data points (or follow-up meetings, in the case of longitudinal cohorts), and the availability of different data types in each point. Long-term follow-up allows observation of the temporal sequence of events.
It has been hypothesized that the set point of several physiological and metabolic responses in adulthood is affected by stimulus or insults during the critical period of embryonic and fetal life development, a concept known as ‘fetal programming’. For example, associations between low birthweight and type 2 diabetes mellitus, coronary heart disease and elevated blood pressure have been demonstrated. Therefore, for full exploration of disease mechanisms, the follow-up period should ideally be initiated as early as possible, with data collection starting from the preconception stage, followed by the pregnancy period, delivery, early and late childhood, and adulthood (hence the ‘from pre-womb to tomb’ approach). Although such widespread information is rarely available in most data sources, large longitudinal studies that recruit women at pregnancy are emerging, such as The Born in Guangzhou Cohort Study and the Avon Longitudinal Study of Parents and Children.
Another important consideration in longitudinal cohorts is adherence of the participants to follow-ups. Selection bias owing to loss to follow-up may negatively affect the internal validity of the study. For example, the UKBiobank was criticized as having selection bias because of the low response rate by participants (5.5%). Disadvantaged socioeconomic groups, including ethnic minorities, are more likely to drop out and thus possibly bias the results. It is therefore important to consider the effect of various retention strategies on different subpopulations in longitudinal studies, specifically for studies with a long follow-up period. To increase adherence to follow-ups, incentives are sometimes used. For example, the Genes for Good study uses incentives such as interactive graphs and visualizations of survey responses, as well as personal estimates of genetic ancestry, for participant retention.
Interactions between subjects included in the data (axis I)
The ability to connect each subject in the data to other people who are related to him or her is fundamental to the ability to explore mechanisms of disease onset and progression, and gene-environment interactions. Such relations may be genetic, which would allow calculation of the genetic distance between different people, or environmental, such as identifying people who share the same household, workplace, neighborhood or city. Intentional recruitment of subjects with genetic or environmental interactions increases the power to answer these scientific questions. One example is twin cohorts, such as the Finnish Twin Cohort or recruitment of family triads of mothers, fathers and their offspring, such as The Norwegian Mother and Child Cohort Study. Of note, recruitment of genetically related people or people from the same environment may result in decreased heterogeneity and diversity of the cohort.
Heterogeneity and diversity of the cohort population (axis H)
Including factors such as age, sex, race, ethnicity, disability status, socioeconomic status, educational level and geographic location is important. The process of selecting a cohort that will fully represent the real-world population is challenging. Challenges arise from a variety of historical, cultural, scientific and logistical factors, as the inclusion process involves several steps: selection of a subject for inclusion in the study, consent of the subject, and selection of the subject data to be analyzed by the study researchers. Sampling bias may arise at each of these steps, as different factors may affect them. One example is volunteer bias, as it has been shown that people who are willing to participate in studies may be systematically different from the general population.
However, high heterogeneity in the study population and inclusion of disadvantaged socioeconomic groups are important for generalization of the results to the entire population. Medical research of under-represented minorities and people of non-European ancestry is often lacking in many fields. One of the most prominent examples of this is in genetics, in which the vast majority of participants in genome-wide association studies are of European descent. Many other fundamental studies in medicine have included only a relatively homogenous population. For example, the original Framingham Heart Study, which included residents of the city of Framingham, Massachusetts, and the Nurses’ Health Study, which included registered American nurses, were relatively homogeneous in environmental exposures and education level, respectively. Thus, although many important studies were based on these cohorts, the question of whether their conclusions apply to the general population remains open. Current studies such as the All of Us Research Program define heterogeneity as one of their explicit goals, with more than 80% of the participants recruited so far being from historically under-represented groups.
Nonetheless, increasing the heterogeneity of the study population (for example, by including participants of a young age) may increase the variability in the phenotype tested and decrease the anticipated rate of clinical endpoints expected to occur during the study period, and therefore will require a larger sample size to reach significant results.
Standardization and harmonization of data (S)
Health data may come from many disparate data sources. Using these sources to answer desired clinical research questions requires comparing and analyzing these sources concurrently. Thus, harmonizing data and maintaining a common vocabulary are important. Data can be either collected in a standardized way (for example, ICD-9 diagnoses, structured and validated questionnaires) or can be categorized at a later stage by standard definitions.
Standardizing medical data into a universal format will enable collaborations across multiple countries and resources. For example, the Observational Health Data Sciences and Informatics initiative is an international collaborative effort to create open-source unified common data models from a transformed large network of health databases. This enables a significant increase in sample size and in heterogeneity of data, as shown in a recent study that examined the effectiveness of second-line treatment of type-2 diabetes, using data made available by the Observational Health Data Sciences and Informatics initiative from 246 million patients from multiple countries and cohorts.
Another interesting solution is to characterize and standardize descriptions of datasets in a short identification document that will accompany them, a concept described as ‘datasheets for datasets’. Such a document will include the characteristics, motivations and potential biases of the dataset.
Linkage between data sources (L)
The ability to link different data sources and thereby retrieve information on a specific person from several data sources is also of great value. For example, UKBiobank data are partially linked to existing health records, such as those from general practice, hospitals and central registries. Linking EHRs with genetic data collected in large cohorts enables the correlation of genetic information with hundreds to thousands of phenotypes identified by the EHR.
For this linkage to be possible, each person should be issued a unique patient identifier that will apply across databases. However, mostly due to privacy and security concerns, unique patient identifiers are currently not available. For tackling this, two main approaches have been suggested. The first is to create regulation and legislative standards to ensure the privacy of the participants. The second is to give patients full ownership of their own information and thereby allow them to choose whether they permit linkage to some or all of their medical information. For example, Estonia was the first country to give its citizens full access to their EHRs. The topic of data ownership is debatable and has been discussed elsewhere.
Additional aspects of medical data have been previously described as part of the FAIR principles for data management: findable, accessible, interoperable and reusable. The data should be (i) findable, specifically registered or indexed in a searchable resource, because knowing which data exist is not always easy; (ii) accessible, as access to data by the broad scientific community is important for it to reaching its full scientific potential; (iii) interoperable, with a formal and accessible applicable language for knowledge representation, which is also a part of the standardization axis described above; and (iv) reusable, which includes developing tools for scalable and replicable science, a task that requires attention and resources.
How is big data generated?
Longitudinal population studies and biobanks represent two sources of big data. Whereas much of the medical data available for analysis is passively generated in healthcare systems, new forms of biobanks, which actively generate data for research purposes, have been emerging in recent years. Biobanks were traditionally defined as collections of various types of biospecimens. This definition has been expanded to “a collection of biological material and the associated data and information stored in an organized system, for a population or a large subset of a population”. Biobanks have increased in variety and capacity, combining different types of phenotyping data; this has created rich data resources for research. Unlike traditional, single-hypothesis-driven studies, these rich datasets try to address many different scientific questions. The prospective nature of these studies is especially important, because the effects of different factors on disease onset can be analyzed.
Although the concept of mega biobanks is not well defined in the literature, it can be viewed qualitatively as biobanks that integrate many of the data axes mentioned above at a broad scale and includes data measured on large sample sizes (axis N) together with deep phenotyping of each subject (axis D) for a long follow-up period (axis F), collected and stored with standardization (axis S), and allowing interactions between participants (axis I) and with external sources (axis L) to be studied. Prominent examples of these include UKBiobank, All of Us Research, Kadoorie biobank, Million Veteran program and Mexico City study, as well as others. A comprehensive survey of existing biobanks is presented in the review in ref.
‘Deep cohorts’: a tradeoff between axes
In the construction of a biobank or a longitudinal cohort, each of the axes of data mentioned above has to be carefully assessed, as each has its costs and benefits. Limited research resources dictate an inherent tradeoff between different axes, and the ideal dataset that measures everything on everybody is unattainable. One necessary tradeoff is between the scale of the data gathered (axis N) and the depth of the data (axis D). For example, EHRs can contain medical information on millions of people but rarely include any molecular phenotypes or lifestyle assessments. Another example is ‘N-of-1 trials’. These could be used as a principled way to design trials for personalized medicine or run a deep multidimensional profile of carefully selected subjects.
Medium-sized cohorts of hundreds or tens of thousands of people represent an interesting operating point, as they allow collection of full molecular and phenotypic data on a large enough population and thus enable the study of a wide variety of scientific questions. We can term such cohorts ‘deep cohorts’.
Since delicate disease patterns may be detected only when the data include a deep enough phenotyping (axis D) of a sufficient sample size (N), deep cohorts that apply the most-advanced technologies to phenotype, collect and analyze data from medium-sized cohorts may have an immense scientific potential. For example, we previously collected data for a cohort of over 1,000 healthy people and deeply phenotyped it for genetics, oral and gut microbiome, immunological markers, serum metabolites, medical background, bodily physical measures, lifestyle, continuous glucose levels and dietary intake. This cohort allowed us to study many scientific questions, such as the inter-person variability in post-meal glucose responses, the ability to predict human traits from microbiome data, factors that shape the composition of the microbiome, and associations between microbial genomic structural variants and host disease risk factors. We are following this cohort longitudinally and expanding its number of participants by tenfold, as well as adding new types of assays, with the goal of identifying molecular markers for disease with diagnostic, prognostic and therapeutic value. Other examples of medium-sized cohorts include the University College London-Edinburgh–Bristol Consortium, which performs large-scale, integrated genomics analyses and includes roughly 30,000 subjects, and the Lifelines cohort, which deeply phenotyped subset of ~1,000 of its ~167,000-subject cohort for microbiome, genetics and metabolomics.
The other axes of medical data mentioned above also require financial resources. Therefore, planning a prospective cohort warrants careful consideration of these tradeoffs and utilization of cost-effective strategies. For example, both the duration of longitudinal follow-up, and the number and types of tests that are performed during follow-up visits (axis F) have financial costs. Increasing the heterogeneity of the cohort (axis H) may also come at a cost: in the All of Us Research Program, US National Institutes of Health funding was provided to support recruitment of community organizations to increase the cohort’s racial, ethnic and geographic diversity. Additional tradeoffs are very likely to come up when collecting data, some of which we discussed above in the individual axes sections. The tradeoffs between different axes of medical data and specifically between scale (axis N) and depth (axis D) are presented in Fig. 2.
Numerous additional challenges exist in the construction of a large longitudinal cohort. Many of the challenges that arise from the collection, storage, processing and analysis of any medical data (as discussed in the ‘Potential and challenges’ subsection below) are amplified as the scale and the complexity of the data increase. In most cases, specialized infrastructure and expertise are needed to overcome these challenges, as the generation of new cost-effective high-throughput data requires expertise in different fields. In addition, many research applications emanating from these sources of data are interdisciplinary in nature. This presents an organizational challenge in creating collaborations between clinicians and data scientists, and in educating physicians to understand and apply tools for large-scale data sources.
Ensuring participant compliance with the study protocol is also essential for ensuring scientific merit of the data. Several examples of this include fasting before blood tests and accurate logging of daily nutrition and activities in a designated application55. Compliance assessment by itself can also be challenging, as it often relies on self-reporting by participants. Finally, maintaining public trust and careful consideration of legal and ethical issues, especially those regarding privacy and de-identification of study participants, are crucial to the success of these studies.
Constructing a biobank requires considerable resources and, as a result, biobanks are much harder to establish in low- and middle-income countries. As a result, these populations remain under-represented and under-studied. The geographical distribution of the main biobanks worldwide is presented in Fig. 3.
How is big data analyzed?
How can utilization of these massive datasets achieve the potential of medical data analyses? How can we bridge the gap between the collected data, and our understanding and knowledge of human health? The answer to these questions can be broadly described by the common term ‘data science’. Data science has been defined by as being segregated into three distinct forms of analysis tasks: description, prediction and counterfactual prediction. This distinction holds true for medical data of any type and scale, and helps with the temptation to conflate different types of questions about analysis of the data. These tasks can be defined and used as described below.
Descriptive analysis
Descriptive analysis can be broadly defined as “using data to provide a quantitative summary of certain features of the world”. A few examples include retrospective analyses of the dynamics of body mass index (BMI) in children over time in order to define the age at which development of sustained obesity occurs, and correlation of the differences in the normal body temperature of different people and mortality. Descriptive analysis approaches are useful for unbiased exploratory study of the data and for finding interesting patterns in the data, which may lead to testable hypotheses.
Prediction analysis
Prediction analysis aims to learn a mapping from a set of inputs to some outcome of interest, such that the mapping can later be used to predict the outcome from the inputs in a different unseen set. It is thus applied in settings in which there is a well-defined task. Prediction analysis holds the potential for improving disease diagnostic and prognostic (as discussed in the ‘Potential and challenges’ subsection below). Of note, the ability to construct accurate predictive models is heavily reliant on the availability of big data. Perhaps the most striking and famous examples are the recent advances in neural networks, which rely heavily on data at a large-enough scale and on advances in computing infrastructure; this enables the construction of prediction models.
Algorithmic advances in images, sequences and text processing have been phenomenal in recent years, riding on the wave of big data and deep learning methods. Taking the field of image recognition as an example, one of the most important factors for the phenomenal recent success was the creation and curation of a massive image dataset known as ‘ImageNet’. One hope is that the accumulation of similarly large, ascertained datasets in the medical domain can advance healthcare tasks at a magnitude similar to that of the change in image-recognition tasks. Prominent examples are Physionet and the MIMIC dataset, which have been instrumental in advancing machine-learning efforts in health research. These data have been used for competitions and as a benchmark for quite a few years, and are increasing in size and depth. Reviews on the potential of machine learning in health are provided in refs.
One particularly promising direction of deep learning combined with massive datasets is that of ‘representation learning’; that is, finding the appropriate data representation, especially when the data are high-dimensional and complex. Healthcare data are usually unstructured and sparse, and can be represented by different techniques, based on domain knowledge to fully automated approaches. The representation of medical data with all of its derivatives (clinical narratives, examination reports, lab tests and others) should be in a form that will enable machine-learning algorithms to learn models with the best performance from it. In addition, the data representation may transform the raw data into a form that allows human interpretability with the appropriate model design.
Counterfactual prediction
One major limitation of any observational study is its inability to answer causal questions, as observational data may be heavily confounded and contain other limiting flaws. These confounders may lead to high predictive power of a model that is driven by a variety of health processes rather than a true physiological signal. Although proper study design and use of appropriate methods tailored to the use of observational data for causal analysis may alleviate some of these issues, this remains an important open problem. One promising direction that uses some of the data collected at large scale to tackle causal questions is Mendelian randomization. Studies involving large-scale genetic data and phenotypes combined with prior knowledge may have some ability to estimate causal effects. Counterfactual prediction thus aims to construct models that address limiting flaws inherent to observational data for inferring causality.
Potential and challenges
The promise of medical big data depends on the ability to extract meaningful information from large-scale resources in order to improve the understanding of human health. We discussed some of the potentials and challenges of medical data analysis above. Additional broad categories that can be transformed by medical data include those discussed below.
Disease diagnosis, prevention and prognosis
The use of computational approaches to accurately predict future onset of clinical outcomes has the potential for early diagnoses, and either prevention or decrease in the occurrence of disease in both community and hospital settings. As some clinical outcomes have well-established modifiable risk factors, such as cardiovascular disease, prediction of these outcomes may enable early, cost-effective and focused preventive strategies for high-risk populations in the community setting. In the hospital setting, and specifically in intensive care units, early recognition of life-threatening conditions enables an earlier response from the medical team, which may lead to better clinical outcomes. Numerous prediction models have been developed in recent years. One recent example is the prediction of inpatient episodes of acute kidney injury. Another example is the prediction of sepsis, as the early administration of antibiotics and intravenous fluids is considered crucial for the management of sepsis. Several machine-learning-based sepsis-prediction algorithms have been published, and a randomized clinical trial demonstrated the beneficial real-life potential of this approach, decreasing patient length of stay in the hospital and in-hospital mortality.
Similarly, the same approach can be used to predict the prognosis of a patient with a given clinical diagnosis. Identifying subgroups of patients who are most likely to deteriorate or develop a certain complications of the disease can enable targeting of these patients and the use of strategies such as more frequent follow-up schedule, changes in medication regime or a shift from traditional care to palliative care.
Devising a clinically useful prediction model is challenging for several reasons. The predictive model should be continuously updated, accurate, well-calibrated and delivered at the individual level with adequate time for early and effective intervention by clinicians. It should help identify the population in which an early diagnostic or prognostic will benefit a patient. Therefore, prediction of unpreventable or incurable disease is of less immediate use, although such models may be clinically relevant in the future, as new therapeutics and prevention strategies emerge. Another important consideration is model interpretability, which includes understanding of the mechanism by which the model works; that is, model transparency or post hoc explanations of the model. Defining the very notion of interpretability is not so straightforward, and it may mean different things. Finally, a predictive model should strive to be cost-effective and broadly applicable. A model based on existing information in EHR data is much more economical than a model based on costly molecular measurement.
The real-life success of a predictive model depends both on its performance and on the efficacy of prevention strategies that physicians can apply when they receive the information output by the model. One of the concerns about the real-life implementation of prediction models is that it will eventually result in over-diagnosis. Through the use of highly sensitive technologies, it is possible to detect abnormalities that would either disappear spontaneously or have a very slow and clinically unimportant progression. As a result, it is possible that more people will be unnecessarily labeled as being at high risk. Another concern is algorithmic bias, which may be introduced in many ways. For example, it has been shown that an algorithm that is widely used by health systems exhibits racial bias. Thus far, very few predictive models have been assessed in a real-life setting, and more studies are needed to validate the clinical utility of these tools per each specific clinical endeavor.
Modeling disease progression
Chronic diseases often progress slowly over a long period of time. Whereas some medical diagnoses are currently based on predefined thresholds, such as a hemoglobin A1C percentage of 6.5% or above for the diagnosis of diabetes mellitus, or a BMI of 30 kg/m2 or above for the diagnosis of obesity (https://www.who.int/topics/obesity/en/), these diseases may be viewed as a continuum, rather than as a dichotomic state. Modeling the continuous nature of chronic diseases and progression over time is often challenging due to many reasons, such as incompleteness and irregularity of data, and heterogeneity of patient comorbidities and medication usage. Large-scale deep phenotyping of subjects can help overcome these challenges and allow a better understanding of disease progression97. Notably, this view of disease as a continuum may allow the study of early stages of disease in healthy cohorts, without confounders such as medications and treatments, provided that the disease markers are well defined, measured and span enough variation in the studied population. Diabetes (diagnosed via hemoglobin A1C percentage), obesity (diagnosed via BMI) and hyperlipidemia (diagnosed via cholesterol values) are good examples in which this can be done, and may lead to the definition of disease risk scores for various diseases.
Genetic and environmental influence on phenotypes
The information on genetic and environmental exposures collected in biobanks combined with data on health outcomes can also lead to many discoveries on the effects of genetic and environmental determinants on disease onset and progression—that is, nature versus nurture—and quantification of the magnitude of each of these determinants. Despite many advances in genetic research over the past decades, major challenges such as small sample sizes and low population heterogeneity still remain. This has led to the emergence of a new approach that uses EHR-driven genomic research, which combines data available in the EHR and phenotypic characterizations, and enables calculation of the effect size of a genetic variant not for one disease or trait but for all diseases simultaneously, also called a ‘phenome-wide association study’. However, the use of large-scale data sources also raises challenges in standards for defining disease and in efforts to extract characteristics of patients from EHRs, which is not always a straightforward task. To do so, one needs to incorporate medical knowledge on the data-generation process and validate the algorithms of extraction from raw data (https://www.phekb.org/).
Target identification
The development of new drugs is a very complex process, with over 90% of the chemical entities tested not making it to the market. This process starts with identification of disease-relevant phenotypes and includes basic research, target identification and validation, lead generation and optimization, preclinical testing, phased clinical trials in humans, and regulatory approval (Fig. 4). Target identification, defined as ‘identifying drug targets for a disease’, and target validation, defined as ‘demonstrating an effect of perturbation of the target on disease outcomes and related biomarkers’, are essential parts in drug discovery and development.
The traditional pharmaceutical industry’s screening process for the identification of new drug targets is costly and long, and includes activity assays, in which the compounds are tested through the use of high-throughput methods, based on interaction with the relevant target proteins or selected cell lines, and low-throughput methods, run on tissues, organs or animal models. This traditional screening method is characterized by a high dropout rate, with thousands of failures per one successful drug candidate. Animal models are often used for these tasks, but they have a substantial disadvantage in the development of new drugs because their limited congruence with many human diseases severely affects their translational reliability.
There is thus a great need for new approaches to drug development. Human multi-omics data and physiological measurements at scale from deeply phenotyped cohorts is one such direction and is considered one of the most promising potentials of analyzing big data in medicine, as humans themselves will serve as the future model organisms. First, analysis of large-scale health data may identify new, unknown associations and therefore may allow the discovery of new biomarkers and novel drug targets, such as by mapping existing genetic-association findings to drug targets and compounds. Second, analysis of biological and medical data may be used to evaluate the chances of success of drugs discovered and tested on animal models before the costly and time-consuming stages of preclinical and clinical trials. Third, potential therapeutic interventions discovered via human data analysis with an established safety profile, such as nutritional modification or supplements and drugs with existing approval by the US Food and Drug Administration, may be considered for direct evaluation in human clinical trials (Fig. 4). Finally, human data can be used to investigate differences in drug response and potential side effects. Since some drugs affect only a subset of the treated target patient population, using human data to distinguish responders from non-responders, and to prioritize responders for clinical trials, can have great utility. Analysis of large-scale human omics data therefore has the potential to accelerate drug development and reduce its cost. Indeed, it has been estimated that selecting targets with evidence from human genetics data may double the success rate of the clinical development of drugs.
Systematic analysis of large-scale data by various computational approaches can also be used to obtain meaningful interpretations for the repurposing of existing drugs. For example, clinical information from over 13 years of EHRs that originated from a tertiary hospital has led to the identification of over 17,000 known drug–disease associations and to the identification of terbutaline sulfate, an anti-asthmatic drug, as a candidate drug for the treatment of amyotrophic lateral sclerosis. Another example is the use of publicly available molecular data for the discovery of new candidate therapies for inflammatory bowel disease.
Improvement of health processes
Big-data analysis can allow the investigation of health-policy changes and optimization of health processes. It has the potential to reduce diagnostic and treatment errors, eliminate redundant tests and provide guidance for better distribution of health resources. Realizing the potential of this direction requires close interaction with medical organizations in order to map the existing processes, understand the clinical implications, and decide on the desired operating points, tradeoffs and costs of mis- and over-diagnoses.
Disease phenotyping
Phenotyping of disease and health and the study of variation between people represent another potential of studying rich and novel types of data. For example, we previously characterized the variation between healthy people in response to food, based on deep phenotyping of a 1,000-person cohort that included, to our knowledge, the first large-scale continuous glucose monitoring and gut microbiota profiling of healthy people.
Another potential is to refine current phenotyping of disease. For example, there have been attempts to refine the classification of type 2 diabetes and find subgroups from available data. Another example is Parkinson’s disease, for which recent advances in genetics, imaging and pathologic findings coupled with observed clinical variability, have profoundly changed the understanding of the disease. Parkinson’s disease is now considered to be a syndrome rather than a single entity, and the International Parkinson and Movement Disorders Society have commissioned a task force for the redefinition of this disease.
Precision medicine
Analysis of big data in health that takes into account individual variability in omics data, environment and lifestyle factors may facilitate the development of precision medicine and novel prevention and treatment strategies. However, caution should be taken, with careful assessments of how much of the change observed in the phenotype tested is due to variability within people. It is not obvious that many of the medical questions of interest will be answered through big datasets. Historically, small and well-designed experiments were the primary drivers of medical knowledge, and the burden of showing a change in this paradigm is now put on new methodologies.
Conclusion
Big data in medicine may provide the opportunity to view human health holistically, through a variety of lenses, each presenting an opportunity to study different scientific questions. Here we characterized health data by several axes that represent different properties of the data. The potential scientific value of collecting large amounts of health data on human cohorts has recently been recognized, with a rapid rise in the creation of large-scale cohorts aiming to maximize these axes. However, since maximizing each axis requires both resources and effort, it is inevitable that some axes come at the expense of others. Analysis of big data in health has many challenges and is in some sense a double-edged sword. On one hand, it provides a much wider perspective on states of health and disease, but on the other hand, it provides the temptation to delve into the details of molecular descriptions that may miss the big picture (as in the ‘seeing the whole elephant’ analogy). In addition, real-world evidence that it will translate into improved quality of care is currently lacking. However, the potential to improve healthcare is still immense, especially as patients’ conditions and medical technologies become more and more complex over time. With the collection of more deeply phenotyped large-scale data, many scientific questions about disease pathogenesis, classification, diagnosis, prevention, treatment and prognosis can be studied and can potentially lead to new discoveries that may eventually revolutionize medical practice.
https://www.nature.com/articles/s41591-019-0727-5