Acquiring Real-World Data

Section 3 Data Formats

Contributors

Real-world data are often stored in a variety of systems, so project teams should be prepared to receive and process data in multiple formats. This section describes several of the most common formats.

Flat Files

The most common data format used when transferring information to study teams is a simple flat file (for example, Microsoft Excel spreadsheet, comma-delimited text file, database extract). Flat files are a least-common-denominator approach to obtaining data from sites, in that they are easy for sites to generate, but there may be little standardization across data fields. For example, in extracts of demographic data from electronic health records (EHRs), one site may represent the gender values male and female as "M" and "F" while another may represent them as "1" and "0." The study team must harmonize these data before using them for analysis. When obtaining flat files, the study team should request that the files contain headers to indicate the content of each column, as well as a data dictionary that provides a description of each field and the possible values.

Common Data Models

Data from different sites can be extracted and transformed into a common representation using a common data model (CDM). CDMs are typically developed by distributed research networks or multicenter consortia to ensure that participating sites represent their data in a consistent manner (for example, everyone represents male and female as "M" and "F"). Each site maps its data to a target data element or value set within the CDM. This allows an analyst to execute a single query across the network with little to no modification because all the data have a consistent representation (Holmes et al 2008; Brown et al 2009).

CDMs can differ based on the type and specificity of data included and the variations in mapping rules for transforming local data to the CDM representation. Depending on the data needs and design philosophies of the distributed research network, a CDM may prioritize representational efficiency over interpretability or require that data within a given domain be transformed to a specific vocabulary or terminology, as opposed to leaving those records coded in the terminology used in the source system, such as converting all diagnosis codes from International Classification of Diseases (ICD) to SNOMED. CDMs are updated iteratively, with the release of new versions that include additional fields or tables or to accommodate evolution of the network's research priorities. Many CDM specifications also include guidance for sites that wish to extend it to support their own studies. The table below describes some of the CDMs used in healthcare research and the primary research networks that use them.

Table. Common Data Models Used for Healthcare Research in the United States and Their Design Characteristics
CDM	Primary Network	Focus	Design Characteristics
Virtual Data Warehouse (Ross et al 2014)	Health Care Systems Research Network (HCSRN) (Vogt et al 2004; Steiner et al 2014)	Population-based research	One of the original CDMs from a distributed research network. Standardized table structure for common EHR/claims domains—diagnoses, procedures, dispensing, etc. Curated value sets for most data elements.
Sentinel (Curtis et al 2012; Curtis et al 2014)	Sentinel	Medical product safety surveillance	Adapted from the Virtual Data Warehouse to support public health surveillance, primarily through claims data partners. Data are represented as they exist in source systems (eg, diagnoses as ICD-10).
PCORnet (Califf 2014; Curtis et al 2014; Fleurence et al 2014)	PCORnet	Comparative effectiveness research, patient-centered outcomes research	Based on the Sentinel CDM with additional tables to support data from EHRs. Data represented as they exist in source systems. Value sets aligned with national standards where possible.
Observational Medical Outcomes Partnership (OMOP) (FitzHenry et al 2015)	Observational Health Data Sciences & Informatics (OHDSI)* (Hripcsak et al 2015)	Comparative effectiveness research, surveillance, risk prediction	Data for a given domain are mapped to specified terminology (eg, diagnoses as SNOMED); records in one domain may result in data generated in a second (eg, history of transplant diagnosis generates a transplant procedure record). Maintains centralized vocabulary tables with customized mappings between terminologies.
Informatics for Integrating Biology & the Bedside (i2b2) (Murphy et al 2010)	Accrual for Clinical Trials (ACT)	Cohort identification	Fact table structure that stores most data in entity–attribute–value format; queries constructed based on ontologies/hierarchies stored in concept table.
^a OHDSI does not operate as a governed network but instead functions as a community of interested researchers.

There are some downsides to CDMs. The effort required to populate them is nontrivial, requiring dedicated infrastructure funding in many cases. Even then, it may be outside the purview of many nonacademic medical centers. Another drawback is that converting local data to a harmonized value set can result in a loss of information (for example, by mapping several dozen internal encounter types to a constrained list of 5 to 10 categories). This drawback can be mitigated somewhat by allowing sites to store the unmapped "raw" values alongside the harmonized values, but the raw values are more problematic to use in a distributed analysis. Despite these limitations, at present a CDM-based approach is often the best option for pulling together heterogeneous data across multiple centers, particularly from those that have prior experience with research networks or other multicenter consortia.

Structured Documents

Other formats for transferring data include documents that use extensible markup language (XML) or HL7 messaging standards, though these are becoming less commonly used for research. XML is a markup language that defines rules for encoding data, with the resulting document following a hierarchical tree-like structure. The rules provide a way to determine whether the data in the document conform to specifications. XML is used as the standard for the HL7 Clinical Document Architecture (CDA) (Dolin et al 2001), which supports a number of defined document templates. The CDA was promoted as a standard for interoperable data exchange between healthcare organizations as part of the "meaningful use" legislation (Blumenthal and Tavenner 2010; D’Amore et al 2014), and the CDA templates were implemented by many EHR vendors as a way of providing care summary documents (Continuity of Care Documents) directly to patients. Many of the personal health records that arose in the 2010s used these summary documents as input.

HL7 messages are used to transmit information between disparate clinical information systems, either within or across organizations. These messages are based on version 2.x of HL7 and are line-based, with content embedded in delimited segments. One common use case is for the transmission of laboratory results from a laboratory information system to an EHR. Another example is a health information exchange that uses HL7 messages to transmit results between members (Maloney et al 2014). Version 2.x is widely adopted across the healthcare industry, though the standard permits local variation, which can result in implementation differences across organizations. However, if the relevant study information is available from an organization via an HL7 feed, that feed may be a viable mechanism of transmission. Receiving and processing HL7 messages in real time typically requires specialized "listener" software; however, if received in bulk, such as daily or weekly digests of messages, they can be processed through standard approaches for text parsing.

Both the XML and HL7 formats have had somewhat less uptake in the research community than flat files or CDMs, because the interfaces used to transmit these data are primarily for clinical operations and require more expertise to receive and process. More importantly, they are beginning to be superseded by application programming interfaces (APIs), a more "modern" and web-friendly approach to data exchange. However, XML and HL7 are still widely used for data exchange for operational purposes, and historical data that have already been encoded in this manner may not be coded in newer formats, meaning some studies may need to retain the ability to process such data.

Application Programming Interfaces (APIs)

There is considerable enthusiasm for APIs, which provide a standardized interface that allow data to be requested and returned via a series of function calls, often in JavaScript Object Notation (JSON). Commonly used with mobile technologies such as smartphones and tablets, APIs allow for discrete amounts of information to be queried securely. This is a particular advantage over prior attempts to support data interoperability, which relied on exchange of summary documents, as noted above. APIs are particularly well suited for use in decision support or precision medicine algorithms, in which a small number of inputs are used to make a prediction or recommendation. They can also be used to prepopulate specific fields in case report forms (such as demographic intake forms), as described in more detail on the HL7 Structured Data Capture webpage and by Rocca et al (2019). An added benefit is that queries that rely on APIs can be reused across all sites that support a given API.

The overall adoption of APIs in healthcare has been relatively slow, but as part of its mandate to implement provisions of the 21st Century Cures Act, the Office of the National Coordinator for Health Information Technology recently announced that all qualified EHRs support APIs using the Fast Healthcare Interoperability Resource (FHIR) standard and be able to exchange the information specified in the United States Core Data for Interoperability (USCDI; Federal Register 2020), which should drive uptake across the industry. Outside the United States, implementation of the International Patient Summary will allow exchanges of basic patient data via standards like FHIR. Other specialty groups are defining core data elements that are captured and exchanged using FHIR. For example, the Minimal Common Oncology Data Elements (mCODE) is an open-source, common data language for cancer that facilitates transmission of data. The FHIR implementation guide for mCODE was presented at the American Society of Clinical Oncology meeting in June 2019, and pilot interventions are underway to test this infrastructure for sharing patient data for cancer research (Osterman et al 2020).

With regard to health insurance plans and administrative claims data, the Centers for Medicare & Medicaid Services (CMS) announced that CMS-regulated payers will also need to provide patient access via FHIR APIs (CMS 2020). Furthermore, a number of "accelerators" are developing extensions to FHIR to better support activities related to research, value-based care, social determinants of health, and more, including HL7’s Gravity, Argonaut, and Da Vinci Projects, and the Vulcan Accelerator.

Although APIs simplify data interchange across sites, there are a few potential downsides. If the underlying records in the source system are not captured in the same format as the API (for example, lab records coded to a local terminology instead of LOINC), some level of mapping is needed to convert them to the appropriate standard. This activity can result in mapping errors, particularly as APIs are deployed for the first time. Completeness may also be a concern, if a healthcare system maps only a portion of its historical results or limits what is available via the interface. Therefore, it will be important to validate any information received via an API and compare the results against other reference datasets, such as CDM extracts, if available. Another potential drawback of APIs is the volume of data that can be retrieved as part of a given request. APIs are typically designed to send or receive data at an observation level (for example, to retrieve the results of a single laboratory test or provide a list of a single patient’s current medications). As a result, trying to retrieve all data for a study population could result in millions of individual API calls. In the FHIR development community, efforts are underway to address this shortcoming with bulk API approaches. Until these solutions are widely implemented, care must be taken to tailor API requests so they do not overwhelm available resources.

Previous Section Next Section

SECTIONS

CHAPTER SECTIONS

sections

Resources

In this video, Dr. Lesley Curtis explores Common Data Models (CDMS). Topics include: How CDMs work, customizing CDMs for specific research questions, challenges and benefits of CDMs, and the future of CDMs.

REFERENCES

Blumenthal D, Tavenner M. 2010. The "meaningful use" regulation for electronic health records. N Engl J Med. 363:501-504. doi:10.1056/NEJMp1006114. PMID: 20647183.

Brown J, Holmes J, Maro J, et al. 2009. Design Specifications for Network Prototype and Cooperative To Conduct Population-Based Studies and Safety Surveillance. Effective Health Care Program, Agency for Healthcare Research and Quality. https://effectivehealthcare.ahrq.gov/products/distributed-network-safety/research. Accessed July 21, 2022.

Califf RM. 2014. The Patient-Centered Outcomes Research Network: a national infrastructure for comparative effectiveness research. N C Med J. 75:204-210. doi:10.18043/ncm.75.3.204. PMID: 24830497.

Centers for Medicare & Medicaid Services. 2022. Interoperability and Patient Access final rule. https://www.cms.gov/Regulations-and-Guidance/Guidance/Interoperability/index. Accessed July 21, 2022.

Curtis LH, Brown J, Platt R. 2014. Four health data networks illustrate the potential for a shared national multipurpose big-data network. Health Aff (Millwood). 33:1178-1186. doi:10.1377/hlthaff.2014.0121. PMID: 25006144.

Curtis LH, Weiner MG, Boudreau DM, et al. 2012. Design considerations, architecture, and use of the Mini-Sentinel distributed data system: use of the Mini-Sentinel distributed database. Pharmacoepidemiol Drug Saf. 21:23-31. doi:10.1002/pds.2336. PMID: 22262590.

D'Amore JD, Mandel JC, Kreda DA, et al. 2014. Are meaningful use stage 2 certified EHRs ready for interoperability? Findings from the SMART C-CDA Collaborative. J Am Med Inform Assoc. 21:1060-1068. doi:10.1136/amiajnl-2014-002883. PMID: 24970839.

Dolin RH, Alschuler L, Beebe C, et al. 2001. The HL7 Clinical Document Architecture. J Am Med Inform Assoc. 8:552-569. doi:10.1136/jamia.2001.0080552. PMID: 11687563.

Federal Register : 21st Century Cures Act: Interoperability, Information Blocking, and the ONC Health IT Certification Program. https://www.federalregister.gov/documents/2020/05/01/2020-07419/21st-century-cures-act-interoperability-information-blocking-and-the-onc-health-it-certification. Accessed July 21, 2022.

FitzHenry F, Resnic FS, Robbins SL, et al. 2015. Creating a common data model for comparative effectiveness with the Observational Medical Outcomes Partnership. Appl Clin Inform. 06:536-547. doi:10.4338/ACI-2014-12-CR-0121. PMID: 26448797 .

Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. 2014. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 21:578-582. doi:10.1136/amiajnl-2014-002747. PMID: 25006148.

Holmes JH, Brown J, Hennessy S, et al. 2008. Developing a distributed research network to conduct population-based studies and safety surveillance. AMIA Annu Symp Proc. Nov 6:973. PMID: 18999251.

Hripcsak G, Duke JD, Shah NH, et al. 2015. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inform. 216:574-578. PMID: 26262116.

Interoperability Standards Advisory (ISA). https://www.healthit.gov/isa/. Accessed July 21, 2022.

Maloney N, Heider AR, Rockwood A, Singh R. 2014. Creating a connected community: lessons learned from the Western New York Beacon Community. EGEMS (Wash DC). 2:1091. PMID: 25848618.

Murphy SN, Weber G, Mendis M, et al. 2010. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 17:124-130. doi:10.1136/jamia.2009.000893. PMID: 20190053.

Osterman TJ, Terry M, Miller RS. 2020. Improving Cancer Data Interoperability: The Promise of the Minimal Common Oncology Data Elements (mCODE) Initiative. JCO Clinical Cancer Informatics. 993–1001. doi:10.1200/CCI.20.00059. PMID: 33136433.

Rocca M, Asare A, Esserman L, Dubman S, Gordon G. 2019. Source Data Capture From EHRs: Using Standardized Clinical Research Data. US Food and Drug Administration. https://www.fda.gov/media/132130/download. Accessed July 21, 2022.

Ross TR, Ng D, Brown JS, et al. 2014. The HMO Research Network Virtual Data Warehouse: a public data model to support collaboration. eGEMs. 2:2. doi:10.13063/2327-9214.1049. PMID: 25848584.

Steiner JF, Paolino AR, Thompson EE, Larson EB. 2014. Sustaining research networks: the twenty-year experience of the HMO Research Network. EGEMS (Wash DC). 2(2):1067. PMID: 25848605.

Vogt TM, Lafata JE, Tolsma DD, Greene SM. 2004. The role of research in integrated health care systems: the HMO Research Network. Perm J. 8:10-17. PMID: 26705313.

Version History

December 3, 2025: Updated hyperlinks (changes made by G. Uhlenbrauck).

October 14, 2022: Made minor updates to the subsection “Application Programming Interfaces (APIs),” made nonsubstantive changes to the text, added Seils as a contributing editor, and reordered the section within the chapter as part of the annual content update (changes made by D. Seils).

January 18, 2021: Added two sentences on mCODE (changes made by K. Staman).

Published August 25, 2020

COVID-19 Resources

COVID-19 Resources

Rethinking Clinical Trials

A Living Textbook of Pragmatic Clinical Trials

Data Formats

Acquiring Real-World Data

Section 3

Data Formats

Flat Files

Common Data Models

Structured Documents

Application Programming Interfaces (APIs)

SECTIONS

sections

Resources

REFERENCES

current section :

Data Formats

Citation: