Acquiring Real-World Data
Section 4
Acquiring Electronic Health Record Data
One problem with relying on electronic health record (EHR) data as part of a longitudinal data collection strategy is obtaining data from all the sites where a patient is treated. How does the study team obtain EHR data from a hospital that is outside the participating healthcare system? If a patient has a heart attack while on vacation, how will the study team capture that information? It is generally recommended to use multiple mechanisms to obtain secondary use data on longitudinal outcomes, such as directly from patients via a mobile app or call center, or through sources like administrative claims.
An important factor with EHR data, compared with sources like administrative claims, is that there is considerable variation in the ways data are captured in the EHR, as well as the terminologies used to represent those data. This challenge makes harmonization a key step in the use of EHR data. Although the Office of the National Coordinator for Health Information Technology (ONC) is creating a framework for greater standardization and research capacity for EHR systems, the use of these standards is variable across the healthcare industry (ONC 2022).
The “Data Formats” section of this chapter describes some of the common data formats that are used to represent real-world data. This section describes some of the challenges that arise in obtaining EHR data from the enrolling healthcare system and other healthcare systems for 3 of the most frequently used approaches: (1) database extraction from individual sites, (2) use of a common data model (CDM), and (3) use of an application programming interface (API).
Database Extract
One of the more straightforward approaches for obtaining EHR data is to work with a site’s IT staff or an EHR vendor–based IT resource to generate a database extract at a given site. These extracts, which often are delivered as flat files or database tables, can be large in scale and automated, but complex queries rely on the skills of the local analyst, and extracts are more difficult to do at smaller sites that may not have the resources (Marsolo 2019). Hence, while this method requires fewer site resources than something like a CDM, it still may have limited applicability in multicenter pragmatic trials.
Common Data Model
Information can be recorded in different ways across sites (and across diagnoses). Sites that participate in distributed research networks (such as PCORnet and Sentinel) and national registries (such as the American College of Cardiology’s National Cardiovascular Data Registry) have agreed to harmonize their source data in a prespecified way by using a CDM. CDMs that are populated from sources like EHRs and administrative claims typically contain data that are captured in those sources in a structured format (such as diagnoses, procedures, laboratory results, medication orders, and medication administrations). Information captured in an unstructured format or with varying reliability may not be present in a CDM. For example, if a cause for an event is needed, the level of specificity required may or may not be available in the CDM because of variability in how that information is documented in the source system. In such cases, manual chart review may be a more efficient way of obtaining the cause of an event.
CDMs often use standard controlled terminologies to represent data, such as ICD-10 codes for diagnoses or RxNorm or the National Drug Code (NDC) for medications. However, these terminologies alone do not guarantee that data from different facilities are comparable. For example, different site coding practices can cause differences in whether patients are counted as having one condition vs another. Moreover, CDMs require mapping data from an EHR repository or institutional data warehouse to the target definition. In many cases, the latter is a more abstract model, and useful context and other detail can be lost in mapping the data to the CDM (Garza et al 2016). As an example, the encounter endpoints that include hospitalization for PCORnet and Sentinel’s CDMs are described below.
Sentinel is a US medical product surveillance system designed to monitor medical products regulated by the US Food and Drug Administration. The Patient-Centered Outcomes Research Institute (PCORI) funded the National Patient-Centered Clinical Research Network (PCORnet) to build a network-of-networks to support clinical research. Their CDM was built based on the Sentinel model but was extended to support many of the data elements found in EHRs. The table below shows the “type of encounter” definitions for the Sentinel CDM and PCORnet.
| “Type of Encounter” Definitions for the Sentinel CDM and PCORnet | |||
|---|---|---|---|
| Name | Definition | Sentinel | PCORnet |
| Ambulatory Visit (AV) | Includes visits at outpatient clinics, same day surgeries, urgent care visits, and other same-day ambulatory hospital encounters, but excludes emergency department encounters. | x | x |
| Emergency Department (ED) | Includes ED encounters that become inpatient stays (in which case inpatient stays would be a separate encounter). Excludes urgent care visits. ED claims should be pulled before hospitalization claims to ensure that ED with subsequent admission won't be rolled up in the hospital event. | x | x |
| ED Admit to Inpatient (EI) | Emergency Department Admit to Inpatient Hospital Stay: Permissible substitution for preferred state of separate ED and IP records. Only for use with data sources where the individual records for ED and IP cannot be distinguished. | x | |
| Inpatient Hospital (IP) | Includes all inpatient stays, same-day hospital discharges, hospital transfers, and acute hospital care where the discharge is after the admission date. (PCORnet only: Does not include observation stays, where known.) | x | x |
| Observation Stay (OS) | Hospital outpatient services given to help the doctor decide if the patient needs to be admitted as an inpatient or can be discharged. Observations services may be given in the emergency department or another area of the hospital.” Definition from Medicare, CMS Product No. 11435, https://www.medicare.gov/Pubs/pdf/11435.pdf | x | |
| Institutional Professional Consult (IC) | Permissible substitution when services provided by a medical professional cannot be combined with the given encounter record, such as a specialist consult in an inpatient setting; this situation can be common with claims data sources. This includes physician consults for patients during inpatient encounters that are not directly related to the cause of the admission (e.g. a ophthalmologist consult for a patient with diabetic ketoacidosis) guidance updated in v4.0). | x | |
| Non-Acute Inst. Stay (IS) | Includes hospice, skilled nursing facility (SNF), rehab center, nursing home, residential, overnight non-hospital dialysis and other non-hospital stays. | x | x |
| Other Ambulatory (OA) | Includes other non overnight AV encounters such as hospice visits, home health visits, skilled nursing facility visits, other non-hospital visits, as well as telemedicine, telephone and email consultations. (PCORnet only: May also include "lab only" visits [when a lab is ordered outside of a patient visit], "pharmacy only" [e.g., when a patient has a refill ordered without a face-to-face visit], "imaging only", etc.] | x | x |
| Other (OT) | x | ||
| Unknown (UN) | x | ||
| No Information (NI) | x | ||
There are many possible encounter types, and it is important for investigators to understand encounter type definitions and to harmonize them across sites if possible. As an example, in 2014-2015, the University of California Research Exchange (UCReX) harmonized data from EHR sources across 5 medical campuses of the University of California to establish a common definition against which a single query would return patient counts against the geographically distributed but federated system architecture (Gabriel et al 2014). The data harmonization team discovered that there were 60 unique encounter types across the sites contributing EHR data extracts (Gabriel et al 2014). Many EHRs have 100 to 200 encounter types, which leads to 2 important considerations: (1) whether sites can correctly map their encounter types to the specified CDM value set; and (2) whether those value sets contain enough granularity for the research question. In some cases, there may be benefit in using the “raw” encounter types instead of the harmonized encounter types. If this occurs frequently, however, it would be better to expand the value set of the underlying CDM to better support such research questions.
Application Programming Interfaces
As noted previously, Fast Healthcare Interoperability Resources (FHIR) is emerging as a standard to obtain data from EHRs (Garza et al 2019; Duda et al 2022). Many data collection tools, including REDCap, have developed or are developing middleware services that allow data to be pulled from FHIR resources and to populate a study database or case report form. Some of these solutions are not compliant with CFR 21 Part 11 and may not be appropriate for all trials (Campion et al 2017).
While FHIR-based methods of data acquisition hold great promise, there are several caveats. First, because EHR data are not collected in a standard way, there is potential for mapping discrepancies (Marsolo 2019). Two sites with the same EHR may map FHIR resources in slightly different ways, and the same FHIR request could result in 2 slightly different data sets. Second, many sites have limited experience delivering data in this way, and the skill set to develop, maintain, and deliver data through data exchange is highly specialized (Marsolo 2019). Some of these issues are being mitigated through the US Core and Argonaut (consensus mappings) and by EHR vendors implementing these mappings as a part of their products. Nonetheless, facility-specific implementation decisions will affect the EHR vendor standard mappings. A final caveat is that all ONC-certified EHRs are supposed to allow patients to request copies of their data via FHIR APIs (see the “Participant-Reported Data” section of this chapter). These are nominally the same APIs as those on the “clinical” side, but there may be restrictions on the data that are available via participant-facing APIs. For example, a query for real-time laboratory results would yield different results if a healthcare system delays information for clinician review before releasing it to a patient.
See the Using Electronic Health Record Data chapter of the Living Textbook for more information about interoperability, data as a surrogate for clinical phenomena, and uses of EHR data in pragmatic clinical trials.
SECTIONS
Resources

PCORnet COVID-19 Common Data Model Design and Results
NIH Pragmatic Trials Collaboratory PCT Grand Rounds; June 5, 2020
EHR-Based Outcome Measurement in the LIRE Trial
NIH Pragmatic Trials Collaboratory EHR Workshop video module. Dr. Jerry Jarvik of the University of Washington summarizes the technical and cultural challenges of embedding a radiology reporting intervention into the EHRs of multiple healthcare systems in the LIRE NIH Collaboratory Trial.
EHR Pragmatic Innovation Beyond Follow-up in the TSOS Study
NIH Pragmatic Trials Collaboratory EHR Workshop video module. Dr. Doug Zatzick of the University of Washington describes the unique challenges of using electronic health records in the TSOS NIH Collaboratory Trial.
REFERENCES
Campion TR Jr, Sholle ET, Davila MA Jr. 2017. Generalizable middleware to support use of REDCap dynamic data pull for integrating clinical and research data. AMIA Jt Summits Transl Sci Proc. 2017:76-81. PMID: 28815111.
Gabriel D, Meeker D, Bell D, Matheny M. 2014. Data Harmonization and Synergies: OMOP, PCORnet CDM and the CTSA cohort identification models. Presented at: Data Integration, Analysis & Sharing Symposium; La Jolla, California; September 16, 2014.
Duda SN, Kennedy N, Conway D, et al. 2022. HL7 FHIR-based tools and initiatives to support clinical research: a scoping review. J Am Med Inform Assoc. 29:1642–1653. doi: 10.1093/jamia/ocac105. PMID: 35818340.
Garza M, Del Fiol G, Tenenbaum J, Walden A, Zozus MN. 2016. Evaluating common data models for use with a longitudinal community registry. J Biomed Inform. 64:333–341. doi:10.1016/j.jbi.2016.10.016. PMID: 27989817.
Garza M, Myneni S, Nordo A, Eisenstein EL, Hammond WE, Walden A, Zozus M. eSource for standardized health information exchange in clinical research: a systematic review. Stud Health Technol Inform. 2019;257:115-124. PMID: 30741183.
Marsolo K. 2019. Approaches to Patient Follow-Up for Clinical Trials: What’s the Right Choice for Your Study? NIH Pragmatic Trials Collaboratory PCT Grand Rounds; March 1, 2019. Available at: https://rethinkingclinicaltrials.org/news/approaches-to-patient-follow-up-for-clinical-trials-whats-the-right-choice-for-your-study-keith-marsolo-phd/. Accessed October 14, 2022.
Office of the National Coordinator for Health Information Technology (ONC). Interoperability Standards Advisory (ISA). https://www.healthit.gov/isa/. Accessed July 21, 2022.