Identifying the Study Population and Assessing Baseline Prognostic Characteristics

Using Electronic Health Record Data in Pragmatic Clinical Trials

Section 5

Identifying the Study Population and Assessing Baseline Prognostic Characteristics


Rachel Richesson, MS, PhD, MPH

Richard Platt, MD, MSc

Gregory Simon, MD, MPH

Lesley Curtis, PhD

Reesa Laws, BS

Adrian Hernandez, MD, MSH

Jon Puro, MPA-HA

Doug Zatzick, MD

Erik van Eaton, MD, FACS

Vincent Mor, PhD


Contributing Editor

Karen Staman, MS

For both of these activities—defining the study population and assessing baseline characteristics—investigators will need to know what fields are used in the EHR, what the sources of the data are, and why and how the data were collected. This information is also necessary for outcome data, and we discuss this in greater detail below. In particular, it is important to identify which biases are inherent in the data based upon their source. For example, health care data sets only include people that seek and receive medical care. There may be many with that condition that aren’t included in the data set because they do not have insurance or access to that health center for some reason (e.g., cancer center data is limited to patients who have a diagnosis and care plan with the center—people who are undiagnosed or have early diagnoses might not be included), or they simply might not have sought care during the study timeframe.

Health systems collect data for a number of different reasons: for clinical documentation of clinical events, payment/reimbursement, quality improvement, etc. The data may or may not be structured, and there may be differences in data collection workflows across sites. Many people assume that EHR data remain consistent over time, but this is rarely—if ever— the case. Things like EHR system upgrades, changing workflows and EHR interfaces, autonomy of clinicians for implementing different processes, the availability of charting and abstraction support, and organizational changes can affect data over time. These must all be accounted for in any PCT design. Coding support tools implemented for business purposes may influence recording of diagnoses and procedures, and these influences often change over time.

Case Example: Unstructured Data and Varying Sources of Data

The Trauma Survivors Outcomes and Support (TSOS) Study (NCT02655354), was developed to coordinate care and improve outcomes for trauma survivors with post-traumatic stress syndrome (PTSD) and comorbidity and is being conducted at 24 US trauma centers. The study used EHR data collected during the routine delivery of trauma care to identify injury patterns of enrolled participants. Clinical sites generally describe traumatic injuries in free text on admission, or pick one injury and use it to select an ICD code. On admission, this recorded EHR information may not accurately reflect the true burden of injury or psychiatric comorbidity that eventually will be diagnosed during the entire hospitalization. To improve completeness of data for TSOS, data from local trauma registries (using data definitions published by the National Trauma Data Bank) can be used. These registry data are often manually abstracted from the EHR and entered to the trauma registry. For example, handwritten results that appear as scanned forms in the EHR are manually entered as structured text to the trauma registry. The study team for the TSOS study collects the trauma registry reports for enrolled patients and must manually link and reconcile these different trauma patient data with those data already collected at the time of admission. Patient identifiers permit positive linkage but – as is common in multisite IT projects – variations in site IT configurations and resources are best solved by manual work by the research team. Baseline data is collected in real time upon patient recruitment by the TSOS study team. Trauma registry data is collected months later from each of the 25 participating sites.


When considering using EHR data for a trial, researchers should ask a series of questions about the patients and health system features that may affect the completeness or quality of the documentation and data collection for each EHR data source under consideration.

  • What patients are included and excluded from the data source?
  • Is bias introduced by the study of specific populations (e.g., insured vs uninsured)?
  • Are any of the needed data derived or calculated from other data? If so, who does this and at what point in the data lifecycle does this happen?
  • Are there standardized data collection and documentation practices or are clinicians or clinical sites allowed the freedom to implement their own processes?
  • Are the data structured or unstructured?
  • Are the data of interest captured or generated in multiple places in the EHR? If so, which source is best to address the study question?

Health system processes or financial incentives can influence how care events are represented in EHRs or claims. Such incentives may affect completeness and possible confounding, as there may be contextual factors depending on health care system priorities. Some examples include:

  • Payors that require certain data be collected for reimbursement: Federally Qualified Health Center (FQHC) clinics are required by the Health Services and Resources Administration (HRSA) to collect various elements such as Federal Poverty Level and the patient’s primary language.
  • Some research or quality improvement projects may encourage that certain data be collected while the funding is available, leading to high quality data, but when the funding expires, the data may then be of lesser quality.
  • In capitated financing arrangements, such as Medicare Advantage, risk-adjustment policies may create financial incentives to identify and record particular diagnoses.
  • In fee-for-service financing arrangements, financial incentives may increase recording of particular diagnoses, procedures or services.
  • Provider productivity incentives or performance improvement initiatives might incentivize use of some procedure codes over others.

These processes or incentives can vary across health systems or across time within a health system. For PCTs, it is particularly important to identify incentives or business processes that might influence the actual care (e.g., formulary policies encourage prescription of one drug vs another) and those that might influence the recording or representation of care (e.g., risk-adjusted reimbursement favoring one depression diagnosis over another).

Integrating Data from Heterogeneous Systems

If the planned study is a multi-site trial, then the investigator must consider if clinically “equivalent” populations can be identified from multiple sites. What assessments or validation plan can be used to ensure that sample populations at each site are clinically equivalent? How much heterogeneity is there between sites? To answer these questions, a researcher can compare summary data (e.g., counts, distributions) and examine clinical workflows across sites. This information about workflows can help explain issues with data quality and completeness, or even proactively alert a researcher about impending unexpected data issues. Ideally, sources of heterogeneity across study sites (clinics, health plans, etc.) should be explored both quantitatively (by comparing relevant indicators across sites during the planning phase) and qualitatively (by interviewing clinicians and managers regarding workflows and incentives).

Investigators will also need to effectively communicate and coordinate IT with the business/healthcare organizations in multi-site trials. With embedded research and PCTs, researchers must engage stakeholders to work within systems that were built to optimize clinical operation, not research. There are socio-political challenges to obtaining, evaluating, and interpreting clinical data in PCTs. In fact, the governance issues around linkage of multiple data sources often prove to be much more problematic than the technical approaches to the linkage itself.

Case Example: Integrating Data From Heterogeneous Systems

The Pragmatic Trial of Video Education in Nursing Homes PROVEN trial (NCT02612688) is being conducted to determine if showing advanced care planning videos in nursing homes affects the rates of hospitalization. PROVEN has two health system partners. While investigators benefit from both using the federally mandated minimum data sets (MDS) for assessments of nursing home residents on Medicare and Medicaid, the sites use different EHR vendors and so have different non-mandatory assessment modules that cover things like physician orders, the electronic Medication Administration Record (eMAR), nursing notes, transfer notes, social service records, etc. While this would present a problem if investigators were seeking to use data from one of these non-systematic sources as an outcome or independent variable, there is also the problem of inconsistent or incomplete implementation of the various modules in an EHR within a health care system. For example, one nursing home partner had implemented the physician order—where the advance directives are located—module in its EHR several years ago. However, not all facilities that were supposed to have this module adopted had actually implemented it. This was determined after a deliberate facility-by-facility review of data completeness. Thus, before using the data from any facility, investigators first had to determine, based upon the degree of use, whether a facility was using the record at all and then, based upon the dates on which completed records were indicated, when they started to use the record. In the end, only a minority of the sites had what appeared to be useable data on advance directives.

To solve the missing data problem and include the other facilities in the trial, investigators used an earlier version of the mandatory nursing home resident assessment that included data fields in which residents' "code status" was noted. Although, the current MDS 3.0 no longer had this information because it wasn't necessarily updated as a clinically meaningful field, one of the Health Care System partners had incorporated physician order sets into their EHR, but only about 1/3 of their facilities have instituted this feature into their EHR. From the point of view of the PROVEN pragmatic trial, investigators have about 30 intervention facilities with Advanced Directive in the physician order set and some 60 control facilities. Since the number of facilities with this data feature was insufficient for our overall study, we couldn't restrict our study to only these. Nonetheless, we will be able to compare the effect of the intervention on the adoption of advanced directives like do-not-resuscitate (DNR) or do-not-hospitalize (DNH) in this sub-set of facilities.

Another example of heterogeneous data sources for PROVEN relates to the use of the CMS Virtual Research Data Center (VRDC) for monitoring mortality and hospital transfers.  However, the data on hospital transfers, the primary outcome, are only available for those who are fee for service because Medicare Advantage encounter data are not available in the VRDC.  This is another example of how only a subset (75% to 80% of the population) have the relevant data for outcomes from the primary data source.



Version History

November 30, 2018: Updated text as part of annual update (changes made by K. Staman).

Published August 25, 2017


Richesson R, Platt P, Simon G, et al. Using Electronic Health Record Data in Pragmatic Clinical Trials: Identifying the Study Population and Assessing Baseline Prognostic Characteristics. In: Rethinking Clinical Trials: A Living Textbook of Pragmatic Clinical Trials. Bethesda, MD: NIH Health Care Systems Research Collaboratory. Available at: Updated December 3, 2018. DOI: 10.28929/034.