Data as a Surrogate for Clinical Phenomena

Using Electronic Health Record Data in Pragmatic Clinical Trials

Section 2

Data as a Surrogate for Clinical Phenomena


Rachel Richesson, MS, PhD, MPH

Richard Platt, MD, MSc

Gregory Simon, MD, MPH

Lesley Curtis, PhD

Reesa Laws, BS

Adrian Hernandez, MD, MSH

Jon Puro, MPA-HA

Doug Zatzick, MD

Erik van Eaton, MD, FACS

Vincent Mor, PhD


Contributing Editor

Karen Staman, MS

When designing pragmatic trials that include the use of EHR systems and data, it is important to understand and be continually aware at every step —from the conception of the study to the design, conduct, and analysis —that the data obtained from EHR and administrative systems of health care organizations by definition are potentially incomplete and may be influenced or biased by the conditions and incentives of clinical practice. Other limiting factors are the technical constraints of the given EHR system, the amount of time available in a clinical encounter, the focus of the encounter on other matters (such as acute care), and the ability of clinic staff and providers to adequately document all aspects of the encounter that might be important for future research or investigation.

As shown in figure below, the data from EHR and administrative systems can be used as a surrogate for some real event, but do not indicate the presence of a condition or clinical phenomenon with certainty. Different forces (e.g., organizational, sociopolitical, psychological, and technical) influence how clinical observations are made, documented, and interpreted. Each of the steps shown in the figure below is a possible source of information loss or error. Investigators who are designing pragmatic trials should be aware of all of these possible sources of error and bias for the data sources used, and also should identify proactive strategies to reduce the error and its impact on a trial.


Error Impact on Trials

Adapted from Hripcsak et al. 2009. Used with permission.


Because data recorded in an EHR provide only a partial and likely incomplete representation of the conditions and events they describe, these data may be appropriate for some uses yet inadequate for others. Even research quality measurement instruments rarely offer the “truth” about patients’ health status; rather they offer measures that should be reliably reproducible. An understanding of the nature, limitations, and structure of EHR data is important to evaluate its fitness for different possible uses. Consequently, any consideration for using EHR data in a PCT study should begin with several key questions:

  • What exactly is the phenomenon you are trying to identify or measure?
  • In what type of health care activity, event, documentation or data value could a “signal” of that phenomenon be detected? (e.g., Ordering of a test? Documenting a diagnosis? Prescribing a medication? Noting an abnormal laboratory test value? Referring a patient to another provider.)
  • What are the sources of error for each of those health care activities, events, documentation or data collection? (Who makes the clinical judgement recorded in the EHR? Who orders the clinical test entered into the EHR?)
  • How can an investigator assess and reduce that error?

Understanding and Controlling Variation

In traditional clinical trials, the deliberate and prospective collection of data is meant to limit and control any variation in observation and documentation. When using EHR data for research, the researcher loses much of this control. In this context, therefore, it is the obligation of the researcher to identify and understand the variability in the observation and documentation of the data, and control it to the greatest extent possible. Variation between how providers treat various patients and conditions is well known, as is that fact that treatment patterns can vary by geographic regions for a number of reasons (patient population, regulations, costs, training and incentives for providers). When using EHR data for research, new sources of potential variation occur—emerging from, for example, the EHR system used, the interfaces, and ordering or filtering of entry terms. This challenges the fundamental premise of research, which is to understand and control variation. Fortunately, a number of data quality assessment reporting (Zozus et al. 2014) data quality methods (Weiskopf and Weng 2013) provide ways for researchers to compare metrics (between providers) to detect this variation between different data sources or study sites.

Variability of Data Documentation and Clinical Phenomena Across Providers and Sites

Because the structure and representation of clinical data is imposed at the facility according to their standards for clinical documentation and business needs, they are subject to variation across sites. Local and regional variation for different health care activities, events, documentation, treatment patterns, or data collection is to be expected. In addition to regional variation on health care delivery and documentation patterns, there are natural variations across populations from one region to another. This variation, in turn, can be seen in the clusters that are a common design feature in pragmatic trials. For example, one Collaboratory trial involves randomization of clinics located in Hawaii, the Pacific Northwest, and Georgia—and these populations would be expected to be different across a number of variables, such as age, race, and risk exposures. Consequently, differences between clinics could reflect pre-existing regional differences in patient populations, regional differences in practice patterns, or “true” differences attributable to study interventions. However, it is possible to use facility/provider fixed effects to reduce the effect of the provider on measurement IF the investigator believes that the demographic differences in the populations across providers are real and can be adjusted for.  The real problem is IF the measurement error is systematic and biased in one provider vs. another because this makes it challenging to differentiate the effect of the intervention vs. the provider effect.

The quality and completeness of EHR data will vary by EHR, by institution and their workflows, by features of patients and their conditions, and over time. (For example, providers may struggle to assess and document all visit data for complex patients with many chronic conditions.) In multisite trials, different organizational and business approaches (e.g., reimbursement policies, EHR systems) may introduce new sources of variation in addition to provider preferences or treatment variation across providers. Thus, each study may encounter more or less utility from a given set of EHR data depending on the objectives and clinical topic of the study, the design of the trial, and whether the organization collected relevant data during the given study time period. In the design of PCTs, researchers need to identify these types of variation and evaluate the potential impact on the trial.

In the sections that follow, we present the process of designing pragmatic trials that include the use of EHR data. The error and bias that are inherent to EHR data limit the use of these data for many research purposes, and the variation that is inherent across sites needs to be assessed and reported. Potential investigators must understand the nature, limitations, and structure of EHR data they plan to (or consider to) use for their research, and this must be done in the context of a specific research question. Therefore, we start with developing and refining the research question and then explore specific uses of EHR data based on the major activities of a trial.





The Collaboratory Electronic Health Records Core developed a white paper Assessing Data Quality for Healthcare Systems Data Used in Clinical Research (V. 1.0) that provides guidance, based on the best available evidence and practice, for assessing data quality in pragmatic clinical trials (PCTs) conducted through the Collaboratory. Topics covered include an overview of data quality issues in clinical research settings, data quality assessment dimensions (completeness, accuracy, and consistency), and a series of recommendations for assessing data quality. Also included as appendices are a set of data quality definitions and review criteria, as well as a data quality assessment plan inventory. An abbreviated version of the white paper, Assessing Data Quality of Clinical Data for PCTs, describes data quality dimensions and recommendations for assessments.

Grand Rounds: OHDSI: Drawing Reproducible Conclusions from Observational Clinical Data

Podcast: OHDSI: Drawing Reproducible Conclusions From Observational Clinical Data (George Hripcsak, MD, MS)


back to top

Hripcsak G, Elhadad N, Chen Y-H, Zhou L, Morrison FP. 2009. Using Empiric Semantic Correlation to Interpret Temporal Assertions in Clinical Texts. J Am Med Inform Assoc. 16:220–227. doi:10.1197/jamia.M3007. PMID: 19074297.

Weiskopf NG, Weng C. 2013. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 20:144–151. doi:10.1136/amiajnl-2011-000681. PMID: 22733976.

Zozus MN, Hammond WE, Green BB, et al. 2014. Assessing Data Quality for Healthcare Systems Data Used in Clinical Research.

Version History

November 30, 2018: Updated text as part of annual update and added a resource (changes made by K. Staman).

Published August 25, 2017


Richesson R, Platt P, Simon G, et al. Using Electronic Health Record Data in Pragmatic Clinical Trials: Data as a Surrogate for Clinical Phenomena. In: Rethinking Clinical Trials: A Living Textbook of Pragmatic Clinical Trials. Bethesda, MD: NIH Health Care Systems Research Collaboratory. Available at: Updated December 3, 2018. DOI: 10.28929/031.