Introduction

Assessing Fitness for Use of Real-World Data Sources


Section 1

Introduction

Many of the real-world data sources used in clinical research are considered "secondary" sources, because the data were collected for a purpose other than the research project for which they are being used (eg, billing or clinical care). This contrasts with primary data sources, where the data are captured specifically for clinical care, billing, or a specific prospective study. (See the Acquiring Real-World Data chapter of the Living Textbook for more information about the different types of real-world data.) Consequently, before a real-world data source can be used in an analysis, one must understand its characteristics and limitations to determine whether it can be appropriately used to answer the question at hand. A given dataset may be well suited to answer a research question for a specific patient population over a certain time period, but not suitable for a different population or time frame. For this reason, data must be assessed to determine whether they are fit for their intended use or purpose prior to their use in research settings. This chapter describes several approaches that can be used to facilitate such assessments.

SECTIONS

CHAPTER SECTIONS


Version History

August 26, 2022: Minor revisions as part of annual update (changes made by K. Staman).

Published August 25, 2020

Methods of Access

Acquiring Real-World Data


Section 8

Methods of Access

There are several approaches to obtaining real-world data. Real-world data may be obtained directly from a site (such as a healthcare organization) or data holder, via a distributed research network, or directly from patients. Depending on the data needed, real-world data may be provisioned into a protected computing environment, often referred to as an enclave. We detail the trade-offs between the different approaches below.

Direct From Sites or Data Holders

Healthcare organizations, particularly those that participate in research, can often provide data in a variety of formats, which need to be aligned with the requirements of the project. Many other data holders, such as those that maintain disease or device registries have similar capabilities. Examples include:

  • Clinician-generated reports: Most electronic health records (EHRs) provide functionality that allows clinicians to generate on-demand reports geared toward answering care management questions (for example, who received a flu shot in last 30 days, who was in the emergency department last night). Creating these reports has relatively low cost, and the reports typically take seconds to run, with real-time results. The drawback is that they have limited ability to include longitudinal results. They are geared toward "most recent" values—most recent lab result, date of last test, etc. Adoption and uptake also varies. Clinicians may not realize that they have the capabilities to generate such reports, as training and support vary by healthcare system.
  • Database reports: Almost every EHR includes a reporting database and/or data warehouse. Extracts can be programmed against these repositories, and it may be possible to reuse the same query, or a large portion of it, across sites that use the same vendor. Once a query is developed, it is usually possible to automate the production and delivery of the data. This approach may not be feasible for smaller sites or sites without local information technology support, and complex queries will rely heavily on the skill set and knowledge of the local analyst responsible for programming. This can lead to variation in quality across sites.
  • Common data model (CDM) extracts: Many academic medical centers that participate in distributed research networks may have their local source data transformed into a CDM. After the cohort or study population has been defined, local analysts can generate extracts of the relevant tables and fields.
  • Application programming interface (API): Healthcare organizations (both healthcare systems and payers) are working to make their data available via API standards like FHIR. Most sites have limited experience delivering data in this format, which means they may not yet have robust processes for allowing access from external parties.

Distributed Research Networks

Study teams can partner directly with distributed research networks to obtain data to support their trials. The process to develop and distribute a query within these networks is usually straightforward, though there are often governance processes that must be followed. One query can be distributed in order to retrieve results from the whole network (or participating sites). An added benefit is that most distributed research networks perform some level of curation or quality assessment on data within their networks. The major drawbacks to this approach are that the data elements of interest for the study may not be in the CDM of the network, which means they must be obtained through other means (for example, added to the CDM or abstracted through chart review), and that large studies will likely need to go beyond a single distributed research network, meaning study teams will need to deal with data in multiple formats.

Direct From Patient

While patients in the United States have always had the right to receive their health records from providers under the Health Insurance Portability and Accountability Act (HIPAA), it was not always possible to receive them in a machine-readable, electronic format (for example, not a scanned PDF). Spurred by efforts of the US federal government over the past decade to promote interoperability and patients' access to their own healthcare data, it has become increasingly viable to obtain data directly from them. Certified EHRs historically have provided patients with the ability to download structured documents, which contain information about most recent visit and some longitudinal values.

More recent regulations will require that EHRs provide data via FHIR APIs, which should streamline the process somewhat, especially given that technology companies such as Apple have made it easy for users to download their EHRs into their local Health app. Once the records have been downloaded, users can decide whether to share them with other applications, including those designed for research. CMS has enabled similar workflows through its BlueButton 2.0 initiative, for Medicare beneficiaries as well as for CMS-regulated payers, including those that support Medicare Advantage, Medicaid, CHIP, and Qualified Health Plans (QHPs) on the federally facilitated exchanges.

There are some drawbacks to this approach: (1) the "completeness" of the implementation of the standard varies by site and/or EHR vendor (D’Amore et al 2014); (2) study teams must broker access through a secondary app such as Apple Health, Hugo, or 1upHealth; and (3) if a patient receives care in multiple healthcare systems, they must make multiple requests to receive all of their records. Despite this, there may be studies that can benefit from such an approach, for instance, a study on a rare disease with a small number of patients who receive care across multiple healthcare systems. Negotiating agreements with multiple systems is time-consuming, so it may be faster to engage directly with patients.

The research community has much to learn about best practices in engaging with patients to obtain data in this manner (including how to encourage high response rates and how to ensure access is provided for the life of the study), but it remains an encouraging possibility.

Protected Computing Environments

Most academic medical centers and many healthcare organizations struggle with the need to provide access to clinical data for research while protecting sensitive data from EHRs and other systems. For this reason, many organizations set up protected computing environments, or limited-access platforms, where only individuals with the proper permissions can access data in a protected, secure environment that is separated from the rest of the network used for clinical and/or research purposes. (Another term for such a platform is a data or computing enclave.) In such an environment, users generally do not have the ability to download data to their local machines, access is provided via a remote or virtual computer, and analyses are contained within the protected space. In order to remove data from such an environment, users must go through an honest broker process, where the content is reviewed to ensure that it can be removed from the secure environment. For example, users may be able to transfer aggregate counts without additional approvals but may need special permission to remove patient-level records.

Examples of organizations that use protected computing environments include CMS and the Veterans Affairs, and they exist within many academic medical centers as well, such as PACE (Protected Analytics Computing Environment). When used as part of a pragmatic clinical trial, prospective study data can be uploaded into the computing enclave, linked with the relevant records stored there, and used in the resulting analysis. Summary or analysis datasets can then be downloaded by the study team. Despite these extra steps, there can be benefits to using a protected computing environment. In some cases, it is the only way to gain access to a particular data source, while in others, the data may be refreshed or updated more frequently, since it is not necessary to generate stand-alone flat files. As concerns grow about the organizational risk of sharing sensitive patient information, these protected environments are likely to increase in prevalence.

Real-world data have particular relevance to pragmatic clinical trials, as they generally represent data collected or generated in the course of routine operations. There are many different types of real-world data and different approaches by which to obtain them. However, real-world data sources are not interchangeable, and any given source may not be applicable for a specific study. As a result, care must be taken to ensure that the real-world data source aligns with the study in question and that the data are obtained in a format that support the proposed analysis.

SECTIONS

CHAPTER SECTIONS

REFERENCES

back to top

D'Amore JD, Mandel JC, Kreda DA, et al. 2014. Are meaningful use stage 2 certified EHRs ready for interoperability? Findings from the SMART C-CDA Collaborative. J Am Med Inform Assoc. 21(6):1060-1068. doi: 10.1136/amiajnl-2014-002883. PMID: 24970839.


Version History

December 3, 2025: Updated hyperlinks (changes made by G. Uhlenbrauck).

October 14, 2022: Made nonsubstantive changes to the text, added images to the Resources sidebar, added Seils as a contributing editor, and reordered the section within the chapter as part of the annual content update (changes made by D. Seils).

Published August 25, 2020

Data Formats

Acquiring Real-World Data


Section 3

Data Formats

Real-world data are often stored in a variety of systems, so project teams should be prepared to receive and process data in multiple formats. This section describes several of the most common formats.

Flat Files

The most common data format used when transferring information to study teams is a simple flat file (for example, Microsoft Excel spreadsheet, comma-delimited text file, database extract). Flat files are a least-common-denominator approach to obtaining data from sites, in that they are easy for sites to generate, but there may be little standardization across data fields. For example, in extracts of demographic data from electronic health records (EHRs), one site may represent the gender values male and female as "M" and "F" while another may represent them as "1" and "0." The study team must harmonize these data before using them for analysis. When obtaining flat files, the study team should request that the files contain headers to indicate the content of each column, as well as a data dictionary that provides a description of each field and the possible values.

Common Data Models

Data from different sites can be extracted and transformed into a common representation using a common data model (CDM). CDMs are typically developed by distributed research networks or multicenter consortia to ensure that participating sites represent their data in a consistent manner (for example, everyone represents male and female as "M" and "F"). Each site maps its data to a target data element or value set within the CDM. This allows an analyst to execute a single query across the network with little to no modification because all the data have a consistent representation (Holmes et al 2008; Brown et al 2009).

CDMs can differ based on the type and specificity of data included and the variations in mapping rules for transforming local data to the CDM representation. Depending on the data needs and design philosophies of the distributed research network, a CDM may prioritize representational efficiency over interpretability or require that data within a given domain be transformed to a specific vocabulary or terminology, as opposed to leaving those records coded in the terminology used in the source system, such as converting all diagnosis codes from International Classification of Diseases (ICD) to SNOMED. CDMs are updated iteratively, with the release of new versions that include additional fields or tables or to accommodate evolution of the network's research priorities. Many CDM specifications also include guidance for sites that wish to extend it to support their own studies. The table below describes some of the CDMs used in healthcare research and the primary research networks that use them.

Table. Common Data Models Used for Healthcare Research in the United States and Their Design Characteristics
CDM Primary Network Focus Design Characteristics
Virtual Data Warehouse (Ross et al 2014) Health Care Systems Research Network (HCSRN) (Vogt et al 2004; Steiner et al 2014) Population-based research One of the original CDMs from a distributed research network. Standardized table structure for common EHR/claims domains—diagnoses, procedures, dispensing, etc. Curated value sets for most data elements.
Sentinel (Curtis et al 2012; Curtis et al 2014) Sentinel Medical product safety surveillance Adapted from the Virtual Data Warehouse to support public health surveillance, primarily through claims data partners. Data are represented as they exist in source systems (eg, diagnoses as ICD-10).
PCORnet (Califf 2014; Curtis et al 2014; Fleurence et al 2014) PCORnet Comparative effectiveness research, patient-centered outcomes research Based on the Sentinel CDM with additional tables to support data from EHRs. Data represented as they exist in source systems. Value sets aligned with national standards where possible.
Observational Medical Outcomes Partnership (OMOP) (FitzHenry et al 2015) Observational Health Data Sciences & Informatics (OHDSI)*  (Hripcsak et al 2015) Comparative effectiveness research, surveillance, risk prediction Data for a given domain are mapped to specified terminology (eg, diagnoses as SNOMED); records in one domain may result in data generated in a second (eg, history of transplant diagnosis generates a transplant procedure record). Maintains centralized vocabulary tables with customized mappings between terminologies.
Informatics for Integrating Biology & the Bedside (i2b2) (Murphy et al 2010) Accrual for Clinical Trials (ACT) Cohort identification Fact table structure that stores most data in entity–attribute–value format; queries constructed based on ontologies/hierarchies stored in concept table.
a OHDSI does not operate as a governed network but instead functions as a community of interested researchers.

There are some downsides to CDMs. The effort required to populate them is nontrivial, requiring dedicated infrastructure funding in many cases. Even then, it may be outside the purview of many nonacademic medical centers. Another drawback is that converting local data to a harmonized value set can result in a loss of information (for example, by mapping several dozen internal encounter types to a constrained list of 5 to 10 categories). This drawback can be mitigated somewhat by allowing sites to store the unmapped "raw" values alongside the harmonized values, but the raw values are more problematic to use in a distributed analysis. Despite these limitations, at present a CDM-based approach is often the best option for pulling together heterogeneous data across multiple centers, particularly from those that have prior experience with research networks or other multicenter consortia.

Structured Documents

Other formats for transferring data include documents that use extensible markup language (XML) or HL7 messaging standards, though these are becoming less commonly used for research. XML is a markup language that defines rules for encoding data, with the resulting document following a hierarchical tree-like structure. The rules provide a way to determine whether the data in the document conform to specifications. XML is used as the standard for the HL7 Clinical Document Architecture (CDA) (Dolin et al 2001), which supports a number of defined document templates. The CDA was promoted as a standard for interoperable data exchange between healthcare organizations as part of the "meaningful use" legislation (Blumenthal and Tavenner 2010; D’Amore et al 2014), and the CDA templates were implemented by many EHR vendors as a way of providing care summary documents (Continuity of Care Documents) directly to patients. Many of the personal health records that arose in the 2010s used these summary documents as input.

HL7 messages are used to transmit information between disparate clinical information systems, either within or across organizations. These messages are based on version 2.x of HL7 and are line-based, with content embedded in delimited segments. One common use case is for the transmission of laboratory results from a laboratory information system to an EHR. Another example is a health information exchange that uses HL7 messages to transmit results between members (Maloney et al 2014). Version 2.x is widely adopted across the healthcare industry, though the standard permits local variation, which can result in implementation differences across organizations. However, if the relevant study information is available from an organization via an HL7 feed, that feed may be a viable mechanism of transmission. Receiving and processing HL7 messages in real time typically requires specialized "listener" software; however, if received in bulk, such as daily or weekly digests of messages, they can be processed through standard approaches for text parsing.

Both the XML and HL7 formats have had somewhat less uptake in the research community than flat files or CDMs, because the interfaces used to transmit these data are primarily for clinical operations and require more expertise to receive and process. More importantly, they are beginning to be superseded by application programming interfaces (APIs), a more "modern" and web-friendly approach to data exchange. However, XML and HL7 are still widely used for data exchange for operational purposes, and historical data that have already been encoded in this manner may not be coded in newer formats, meaning some studies may need to retain the ability to process such data.

Application Programming Interfaces (APIs)

There is considerable enthusiasm for APIs, which provide a standardized interface that allow data to be requested and returned via a series of function calls, often in JavaScript Object Notation (JSON). Commonly used with mobile technologies such as smartphones and tablets, APIs allow for discrete amounts of information to be queried securely. This is a particular advantage over prior attempts to support data interoperability, which relied on exchange of summary documents, as noted above. APIs are particularly well suited for use in decision support or precision medicine algorithms, in which a small number of inputs are used to make a prediction or recommendation. They can also be used to prepopulate specific fields in case report forms (such as demographic intake forms), as described in more detail on the HL7 Structured Data Capture webpage and by Rocca et al (2019). An added benefit is that queries that rely on APIs can be reused across all sites that support a given API.

The overall adoption of APIs in healthcare has been relatively slow, but as part of its mandate to implement provisions of the 21st Century Cures Act, the Office of the National Coordinator for Health Information Technology recently announced that all qualified EHRs support APIs using the Fast Healthcare Interoperability Resource (FHIR) standard and be able to exchange the information specified in the United States Core Data for Interoperability (USCDI; Federal Register 2020), which should drive uptake across the industry. Outside the United States, implementation of the International Patient Summary will allow exchanges of basic patient data via standards like FHIR. Other specialty groups are defining core data elements that are captured and exchanged using FHIR. For example, the Minimal Common Oncology Data Elements (mCODE) is an open-source, common data language for cancer that facilitates transmission of data. The FHIR implementation guide for mCODE was presented at the American Society of Clinical Oncology meeting in June 2019, and pilot interventions are underway to test this infrastructure for sharing patient data for cancer research (Osterman et al 2020).

With regard to health insurance plans and administrative claims data, the Centers for Medicare & Medicaid Services (CMS) announced that CMS-regulated payers will also need to provide patient access via FHIR APIs (CMS 2020). Furthermore, a number of "accelerators" are developing extensions to FHIR to better support activities related to research, value-based care, social determinants of health, and more, including HL7’s GravityArgonaut, and Da Vinci Projects, and the Vulcan Accelerator.

Although APIs simplify data interchange across sites, there are a few potential downsides. If the underlying records in the source system are not captured in the same format as the API (for example, lab records coded to a local terminology instead of LOINC), some level of mapping is needed to convert them to the appropriate standard. This activity can result in mapping errors, particularly as APIs are deployed for the first time. Completeness may also be a concern, if a healthcare system maps only a portion of its historical results or limits what is available via the interface. Therefore, it will be important to validate any information received via an API and compare the results against other reference datasets, such as CDM extracts, if available. Another potential drawback of APIs is the volume of data that can be retrieved as part of a given request. APIs are typically designed to send or receive data at an observation level (for example, to retrieve the results of a single laboratory test or provide a list of a single patient’s current medications). As a result, trying to retrieve all data for a study population could result in millions of individual API calls. In the FHIR development community, efforts are underway to address this shortcoming with bulk API approaches. Until these solutions are widely implemented, care must be taken to tailor API requests so they do not overwhelm available resources.

SECTIONS

CHAPTER SECTIONS

Resources

In this video, Dr. Lesley Curtis explores Common Data Models (CDMS). Topics include: How CDMs work, customizing CDMs for specific research questions, challenges and benefits of CDMs, and the future of CDMs.

REFERENCES

back to top

Blumenthal D, Tavenner M. 2010. The "meaningful use" regulation for electronic health records. N Engl J Med. 363:501-504. doi:10.1056/NEJMp1006114. PMID: 20647183.

Brown J, Holmes J, Maro J, et al. 2009. Design Specifications for Network Prototype and Cooperative To Conduct Population-Based Studies and Safety Surveillance. Effective Health Care Program, Agency for Healthcare Research and Quality. https://effectivehealthcare.ahrq.gov/products/distributed-network-safety/research. Accessed July 21, 2022.

Califf RM. 2014. The Patient-Centered Outcomes Research Network: a national infrastructure for comparative effectiveness research. N C Med J. 75:204-210. doi:10.18043/ncm.75.3.204. PMID: 24830497.

Centers for Medicare & Medicaid Services. 2022. Interoperability and Patient Access final rule. https://www.cms.gov/Regulations-and-Guidance/Guidance/Interoperability/index. Accessed July 21, 2022.

Curtis LH, Brown J, Platt R. 2014. Four health data networks illustrate the potential for a shared national multipurpose big-data network. Health Aff (Millwood). 33:1178-1186. doi:10.1377/hlthaff.2014.0121. PMID: 25006144.

Curtis LH, Weiner MG, Boudreau DM, et al. 2012. Design considerations, architecture, and use of the Mini-Sentinel distributed data system: use of the Mini-Sentinel distributed database. Pharmacoepidemiol Drug Saf. 21:23-31. doi:10.1002/pds.2336. PMID: 22262590.

D'Amore JD, Mandel JC, Kreda DA, et al. 2014. Are meaningful use stage 2 certified EHRs ready for interoperability? Findings from the SMART C-CDA Collaborative. J Am Med Inform Assoc. 21:1060-1068. doi:10.1136/amiajnl-2014-002883. PMID: 24970839.

Dolin RH, Alschuler L, Beebe C, et al. 2001. The HL7 Clinical Document Architecture. J Am Med Inform Assoc. 8:552-569. doi:10.1136/jamia.2001.0080552. PMID: 11687563.

Federal Register : 21st Century Cures Act: Interoperability, Information Blocking, and the ONC Health IT Certification Program. https://www.federalregister.gov/documents/2020/05/01/2020-07419/21st-century-cures-act-interoperability-information-blocking-and-the-onc-health-it-certification. Accessed July 21, 2022.

FitzHenry F, Resnic FS, Robbins SL, et al. 2015. Creating a common data model for comparative effectiveness with the Observational Medical Outcomes Partnership. Appl Clin Inform. 06:536-547. doi:10.4338/ACI-2014-12-CR-0121. PMID: 26448797 .

Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. 2014. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 21:578-582. doi:10.1136/amiajnl-2014-002747. PMID: 25006148.

Holmes JH, Brown J, Hennessy S, et al. 2008. Developing a distributed research network to conduct population-based studies and safety surveillance. AMIA Annu Symp Proc. Nov 6:973. PMID: 18999251.

Hripcsak G, Duke JD, Shah NH, et al. 2015. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inform. 216:574-578. PMID: 26262116.

Interoperability Standards Advisory (ISA). https://www.healthit.gov/isa/. Accessed July 21, 2022.

Maloney N, Heider AR, Rockwood A, Singh R. 2014. Creating a connected community: lessons learned from the Western New York Beacon Community. EGEMS (Wash DC). 2:1091. PMID: 25848618.

Murphy SN, Weber G, Mendis M, et al. 2010. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 17:124-130. doi:10.1136/jamia.2009.000893. PMID: 20190053.

Osterman TJ, Terry M, Miller RS. 2020. Improving Cancer Data Interoperability: The Promise of the Minimal Common Oncology Data Elements (mCODE) Initiative. JCO Clinical Cancer Informatics. 993–1001. doi:10.1200/CCI.20.00059. PMID: 33136433.

Rocca M, Asare A, Esserman L, Dubman S, Gordon G. 2019. Source Data Capture From EHRs: Using Standardized Clinical Research Data. US Food and Drug Administration. https://www.fda.gov/media/132130/download. Accessed July 21, 2022.

Ross TR, Ng D, Brown JS, et al. 2014. The HMO Research Network Virtual Data Warehouse: a public data model to support collaboration. eGEMs. 2:2. doi:10.13063/2327-9214.1049. PMID: 25848584.

Steiner JF, Paolino AR, Thompson EE, Larson EB. 2014. Sustaining research networks: the twenty-year experience of the HMO Research Network. EGEMS (Wash DC). 2(2):1067. PMID: 25848605.

Vogt TM, Lafata JE, Tolsma DD, Greene SM. 2004. The role of research in integrated health care systems: the HMO Research Network. Perm J. 8:10-17. PMID: 26705313.


Version History

December 3, 2025: Updated hyperlinks (changes made by G. Uhlenbrauck).

October 14, 2022: Made minor updates to the subsection “Application Programming Interfaces (APIs),” made nonsubstantive changes to the text, added Seils as a contributing editor, and reordered the section within the chapter as part of the annual content update (changes made by D. Seils).

January 18, 2021: Added two sentences on mCODE (changes made by K. Staman).

Published August 25, 2020

Gaining Permission to Use Real-World Data – ARCHIVED

ARCHIVED PAGE

Archived on October 6, 2025. Go to the latest version.

Acquiring Real-World Data

Section 7

Gaining Permission to Use Real-World Data – ARCHIVED

In the United States, using patient data for research generally requires institutional review board (IRB) approval if the study is a clinical investigation that supports applications for research or marketing permits for products regulated by the US Food and Drug Administration (21 CFR Parts 50 and 56) or, more broadly, research involving human subjects conducted, supported, or otherwise subject to regulation by any federal department or agency (45 CFR part 46 [the Common Rule]). Recent changes to the Common Rule have made it easier to reuse data collected as part of routine healthcare operations for research purposes, including electronic health record (EHR) data, with additional categories of studies now exempt from IRB oversight. However, to publish research involving data from human subjects, virtually all peer-reviewed journals require that the study must have been reviewed and approved by an IRB or ethics board or have received a determination that the research is exempt from oversight or is not human subjects research (Zozus et al 2015). In the latter case, investigators are typically not able to make the determination on their own, so investigators are advised to locate an appropriate IRB or ethics board before embarking on research using patient data, even if their institution does not require it.

Data From Healthcare Organizations

Most healthcare organizations have procedures in place that define the permissible internal uses of the data they collect and store. Routine uses typically fall into the categories of treatment, payment, or operations. Examples include data access for members of the care team, information exchange for care transitions, data use in quality improvement projects, and administrative reporting for organizational management. Facilities that conduct research also have procedures in place for secondary use of these health data. Secondary use of data for research is governed by federal regulations and by procedures established by the facility's IRB or research compliance office.

Additional contractual agreements and regulatory compliance are required when investigators want to use data from institutions with which they are not directly associated (for example, a university researcher who wants to use data from local community hospitals). This will almost certainly be the case when using healthcare data as part of a multicenter pragmatic clinical trial. The Health Insurance Portability and Accountability Act (HIPAA) requires that covered entities and their business associates release protected health information (PHI) only in certain controlled situations, including release to healthcare reimbursement departments or operations, to individual patients, to regulatory authorities, for national priority purposes, with authorization from the individual, and as a limited dataset. The last 2 mechanisms are the ones primarily used when creating datasets for research that contain PHI.

Covered entities include health plans, healthcare clearinghouses, and healthcare providers who electronically transmit any health information in connection with transactions for which the US Department of Health and Human Services (DHHS) has adopted standards.

Under HIPAA, research datasets are considered either deidentified, a limited dataset, or a dataset containing more protected health information (PHI) than allowed in a limited dataset. Of the 3 types, deidentified datasets come with the fewest restrictions. Research using deidentified data is not considered to be human subjects research, and these datasets can often be obtained without additional usage agreements. There are 2 approaches that DHHS has outlined for covered entities to follow in creating deidentified datasets in accordance with HIPAA: Safe Harbor or the Expert Determination Method.

The Safe Harbor method involves the removal of all 18 types of HIPAA identifiers before the dataset is shared with an outside party. The Expert Determination Method involves verifying that the resulting dataset is statistically deidentified. Due to its simplicity, most deidentified datasets have historically been produced using the Safe Harbor approach. The Expert Determination Method is often used with projects that involve privacy-preserving record linkage. A dataset containing a series of hashed, encrypted PHI along with other deidentified clinical variables is demonstrated to be statistically deidentified. Multiple centers can generate similar datasets with the same privacy-preserving methods and link matching records to generate a longitudinal view of a patient's history. Given the growth in the use of real-world data for research and analytics in the life science and healthcare industries, however, the use of the Expert Determination Method has increased, and several companies now provide privacy frameworks and other services that can be used to attest that a dataset is deidentified.

A limited dataset contains more identifiers than a deidentified dataset, including dates and elements of a patient's address (such as zip code and state), but the other HIPAA identifiers must still be removed or masked. To receive a dataset that contains more PHI than is allowable in a limited dataset, it is almost always required that investigators obtain patient consent. For both limited datasets and datasets containing more PHI than allowed in a limited dataset, the recipient of the data must agree to a data use agreement (DUA) in which the purpose of the research and proposed uses for the data are described. The DUA will also include language securing the data, and in most cases will prohibit re-identification of patients or linking to other data from the patients (if patient consent is obtained and a project explicitly involves linkage, this language would not apply). Thus, use of healthcare data from organizations requires both a contractual agreement with the organization, as well as compliance with the Health Insurance Portability and Accountability Act (HIPAA) with respect to use and disclosure of the data. Recipients of deidentified datasets may also need to agree to a DUA, though with fewer restrictions (for example, language limited to securing the data and not attempting to reidentify patients).

Safe Harbor: A method of de-identifying health information that involves removing eighteen identifiers from the data before sharing them with an outside party. The identifiers include name, name, address, Social Security number, phone and fax numbers, email addresses, biometric information, and other individually unique information (45 CFR 164.514(b)(2)Guidance on the Safe Harbor Method).

Data use agreement: A DUA is a contractual document for the transfer of PHI that describes the purposes for which the data can be used and prohibits re-identification (45 CFR 164.514).

Deidentified datasets are most appropriate for retrospective, observational studies, though the lack of identifiers like dates of service and date of death can prove problematic for certain analyses. In order to limit risk, many healthcare organizations will try to ensure that the data released are the “minimum necessary” to support a project. As a result, even in prospective studies where investigators have obtained patient consent, organizations may not wish to release more PHI than necessary.

When approaching a healthcare organization for a DUA, a prospective researcher should be prepared to provide a detailed, precise statement of what data elements are required, from what sources, and over what time period. In addition, the investigator must describe how the data will be used and transferred securely to the investigator and provide a list of all personnel who will be permitted to use the information. Timelines for working out DUAs between stakeholders at healthcare facilities and external investigators vary greatly; in our experience, intervals range from 6 months to more than 2 years.

Data From Patients

Obtaining healthcare data directly from patients is more straightforward from a regulatory perspective because the HIPAA regulations that pertain to covered entities do not apply. While it is necessary to obtain patient permission, additional DUAs are typically not required. The logistics of obtaining the data may be more complicated, as data must be brokered through each patient (see the Methods of Access section of this chapter), but the regulatory process is much simpler.

Data From Other Sources

Datasets managed by government agencies (such the US Census Bureau or the Environmental Protection Agency) will often have similar restrictions to those of healthcare organizations. IRB approval and DUAs may be required, and there may be requirements that any future publication cite the organization that provided the data. Similar processes may also be in place for groups that manage product, device, or disease registries.

SECTIONS

CHAPTER SECTIONS

sections

Resources

Using Electronic Health Record Data in Pragmatic Clinical Trials
Living Textbook chapter from the NIH Pragmatic Trials Collaboratory's Electronic Health Records Core

REFERENCES

back to top

Zozus MN, Richesson RL, Hammond WE, Simon GE. 2015. Acquiring and Using Electronic Health Record Data. NIH Collaboratory Electronic Health Records Core. https://dcricollab.dcri.duke.edu/sites/NIHKR/KR/Acquiring%20and%20Using%20Electronic%20Health%20Record%20Data.pdf. Accessed August 21, 2020.


Version History

July 14, 2025: Updated resources (changes made by G. Uhlenbrauck).

October 14, 2022: Made nonsubstantive changes to the text, added Seils as a contributing editor, and reordered the section within the chapter as part of the annual content update (changes made by D. Seils).

Published August 25, 2020

current section :

Gaining Permission to Use Real-World Data – ARCHIVED

Common Real-World Data Sources

Acquiring Real-World Data


Section 2

Common Real-World Data Sources

As part of its framework for using real-world evidence derived from real-world data to support regulatory decision making, the US Food and Drug Administration (FDA) has identified several potential sources of real-world data and information (FDA 2017):

Electronic health records: Electronic health records (EHRs) contain information collected during the course of clinical care. They may include multiple care settings—outpatient visits, inpatient stays, emergency and urgent care visits, home health, etc. EHRs can include a variety of data from structured domains, including diagnoses, procedures, laboratory results, vital signs, medication orders, and medication administrations. They may also include less standardized data, such as information captured in inpatient flowsheets, questionnaires and surveys completed directly by patients, signs and symptoms recorded by clinicians, data on surgical care and anesthesia, and provider and nursing documentation.

Administrative claims: Administrative claims are insurance claims related to services from healthcare providers. In the United States, federal insurance programs include Medicare and Medicaid. The Medicare population includes adults 65 years and older, patients with certain disabilities, and patients with end-stage renal disease. Medicaid is an insurance program for people with low income. Private insurance claims include those in employer-sponsored health plans, insurance claims for those who are self-employed, and claims for insurance plans administered on behalf of the federal government. These administrative data can include information about physician services, institutional costs, demographic characteristics, deaths, dispensed medications, home health services, and skilled nursing facilities.

Patient-reported outcomes: Patient-reported outcomes (PROs) are defined by the FDA as any report of the status of a patient's health that comes directly from the patient, without interpretation of the patient's response by a clinician or anyone else (FDA Guidance for Industry 2009). PROs might include information about symptoms, functioning, satisfaction with care or symptoms, adherence to prescribed medications or other therapy, and perceived value of treatment. Typically captured in the form of surveys or questionnaires, PROs may be obtained via paper forms, online portals, or mobile apps. See the Patient-Reported Outcomes chapter of the Living Textbook for more information.

Patient-generated health data: Patient-generated health data are data generated from devices that provide information on a patient’s status (for example, internet-connected scales, pedometers, home blood pressure monitors). These data may be obtained directly from the device via a mobile application or through some other type of instrument. Patient-generated health data can include the raw sensor values and summary statistics calculated from the underlying data.

Medical product/device registries: Registries are typically created after a product or device has been approved in order to support postmarketing surveillance. These registries often contain rich information about the product or device but limited data on the characteristics or health status of patients, generally far less data than what is available in EHRs.

Condition-specific or disease registries: Registries contain information from patients who have a specific condition or disease. These patient-focused registries often include information about disease onset, symptoms, changing phenotypes, treatments, and outcomes. Because they are designed for research or targeted care, condition or disease-specific registries often have more condition-specific data than is collected in EHRs.

Environmental factors and social determinants of health: Environmental factors and social determinants of health (for example, food insecurity, access to transportation) are increasingly being captured in EHRs as healthcare systems focus more on population health. The data may be collected directly from patients through surveys or derived from community or geographically organized resources (for example, the American Community Survey) based on a patient’s current or historical address. Environmental sources can also be used in this manner, such as to estimate exposure to pollution based on distance to a freeway, power plant, or other industrial source.

In most cases, PROs and patient-generated health data are obtained directly from patients as part of a specific trial or study. In this context, PROs and patient-generated health data are collected prospectively using the procedures that govern prospective data collection, such as patient consent. PROs and patient-generated health data collected in the EHR or as part of a registry for other purposes (such as follow-up to a surgical procedure or monitoring of patients with a chronic disease) will typically be treated like the rest of the data that are contained in that source.

Identifying the Appropriate Source

Since data in secondary real-world data sources were collected or generated for purposes other than research, they include gaps and biases that reflect the nature of the underlying activity (Kahn and Ranade 2010; Hersh et al 2013; Weiskopf et al 2013; Raebel et al 2014; Rusanov et al 2014). Therefore, given a specific research question or study, it is important to assess whether the real-world data source is relevant and can reliably fulfill its intended purpose (FDA 2017; FDA 2021), whether it is for patient identification or recruitment, monitoring outcomes, or assessing endpoints. (See the Assessing Fitness-for-Use of Real-World Data chapter of the Living Textbook.)

In many cases, the same study concept can be present in multiple real-world data sources. For example, a disease diagnosis can be identified through a query of the EHR, administrative claims, a patient-reported medical history, or a disease registry. When designing a study, investigators should understand the trade-offs between different sources in terms of completeness across a potential study population, length of follow-up, etc. Depending on how data are captured, multiple sources may be needed to adequately support a study. In this case, an adjudication process is often necessary to decide what to do if there is discordance between sources (Rockhold et al 2020). Investigators should prepare to implement such a plan.

In addition, when combining study data with real-world data sources, some form of record linkage is usually required to match patients across sources. There are a number of techniques that can be used. Some rely on deterministic matches of clear-text identifiers, while others rely on probabilistic weighting of encrypted tokens generated from combinations of identifiers (such as first and last names, date of birth, and current zip code) (Grannis et al 2002; Durham et al 2010; Kum et al 2014; Setoguchi et al 2014; Durojaiye et al 2018; Karr et al 2019). Not all data holders are able to support all of these methods, so it is important to understand their capabilities, as well as the identifiers to which they have access. Study teams may then need to collect these same identifiers to allow for linkage to occur.

SECTIONS

CHAPTER SECTIONS

Resources

Real-World Data and Real-World Evidence in Regulatory Decisions
NIH Pragmatic Trials Collaboratory EHR Workshop video module. Jacqueline Corrigan-Curay of the US Food and Drug Administration discusses recent trends in incorporating real-world data and real-world evidence in regulatory decisions.


Screenshot of Grand Rounds presentation
Using Real-World Data to Plan Eligibility Criteria and Enhance Recruitment
NIH Pragmatic Trials Collaboratory PCT Grand Rounds; July 31, 2020

REFERENCES

back to top

Durham E, Xue Y, Kantarcioglu M, Malin B. 2010. Private medical record linkage with approximate matching. AMIA Annu Symp Proc. 2010:182-186. PMID: 21346965.

Durojaiye AB, Puett LL, Levin S, et al. 2018. Linking electronic health record and trauma registry data: assessing the value of probabilistic linkage. Methods Inf Med. 57:261-269. doi:10.1055/s-0039-1681087. PMID: 30875705.

Grannis SJ, Overhage JM, McDonald CJ. 2002. Analysis of identifier performance using a deterministic linkage algorithm. Proc AMIA Symp. 2002:305-309. PMID: 12463836.

Hersh WR, Weiner MG, Embi PJ, et al. 2013. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care. 51:S30-S37. doi:10.1097/MLR.0b013e31829b1dbd. PMID: 23774517.

Kahn MG, Ranade D. 2010. The impact of electronic medical records data sources on an adverse drug event quality measure. J Am Med Inform Assoc. 17:185-191. doi:10.1136/jamia.2009.002451. PMID: 20190062.

Karr AF, Taylor MT, West SL, et al. 2019. Comparing record linkage software programs and algorithms using real-world data. PLoS One. 14:e0221459. doi:10.1371/journal.pone.0221459. PMID: 32352389.

Kum H-C, Krishnamurthy A, Machanavajjhala A, Reiter MK, Ahalt S. 2014. Privacy preserving interactive record linkage (PPIRL). J Am Med Inform Assoc. 21:212-220. doi:10.1136/amiajnl-2013-002165. PMID: 24201028.

Raebel MA, Haynes K, Woodworth TS, et al. 2014. Electronic clinical laboratory test results data tables: lessons from Mini-Sentinel. Pharmacoepidemiol Drug Saf. 23:609-618. doi:10.1002/pds.3580. PMID: 24677577.

Rockhold FW, Tenenbaum JD, Richesson R, Marsolo KA, O'Brien EC. 2020. Design and analytic considerations for using patient-reported health data in pragmatic clinical trials: report from an NIH Collaboratory roundtable. J Am Med Inform Assoc. 27:634-638. doi:10.1093/jamia/ocz226. PMID: 32027359.

Rusanov A, Weiskopf NG, Wang S, Weng C. 2014. Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med Inform Decis Mak. 14:51. doi:10.1186/1472-6947-14-51. PMID: 24916006.

Setoguchi S, Zhu Y, Jalbert JJ, Williams LA, Chen C-Y. 2014. Validity of deterministic record linkage using multiple indirect personal identifiers: linking a large registry to claims data. Circ Cardiovasc Qual Outcomes. 7:475-480. doi:10.1161/CIRCOUTCOMES.113.000294. PMID: 24755909.

US Food and Drug Administration. Guidance for Industry. 2009. Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/patient-reported-outcome-measures-use-medical-product-development-support-labeling-claims. Accessed August 21, 2020.

US Food and Drug Administration. 2017. Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices Guidance for Industry and Food and Drug Administration Staff. https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence. Accessed August 20, 2020.

US Food and Drug Administration. 2021. Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products—Draft Guidance for Industry. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory. Accessed July 21, 2022.

Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. 2013. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform. 46:830-836. doi:10.1016/j.jbi.2013.06.010. PMID: 23820016.


Version History

October 14, 2022: Made nonsubstantive changes to the text, added an image and updated links in the Resources sidebar, and added Seils as a contributing editor as part of the annual content update (changes made by D. Seils).

January 18, 2021: Added EHR Workshop video module to the resource bar (changes make by K. Staman).

Published August 25, 2020

Introduction

Acquiring Real-World Data


Section 1

Introduction

"Real-world data," as defined by the US Food and Drug Administration, are data relating to the health status of a patient or the delivery of healthcare services. Common sources of real-world data include electronic health records, administrative claims, patient-reported outcomes, patient-generated health data, and medical product and device registries, as well as databases relating to environmental factors or social determinants of health. Real-world data can be used to support a number of activities in pragmatic clinical trials, such as patient identification and recruitment, monitoring of outcomes, and ascertainment of endpoints.

Most real-world data are considered secondary data sources when used for research, because they were originally generated for another purpose and thus reflect the context of that activity (hence their "real-world" nature). Therefore, it is necessary to ensure that real-world data are fit for use before including them in a study. (See the Assessing Fitness for Use of Real-World Data Sources chapter of the Living Textbook.) This chapter outlines strategies for obtaining real-world data for use in research.

SECTIONS

CHAPTER SECTIONS


Version History

December 3, 2025: Updated hyperlinks (changes made by G. Uhlenbrauck).

October 14, 2022: Made nonsubstantive changes to the text, added an image to the Resources sidebar, and added Seils as a contributing editor as part of the annual content update (changes made by D. Seils).

January 18, 2021: Added EHR Video module to the resource bar (changes made by K. Staman).

Published August 25, 2020.

Pragmatic Implementation Process Assessments- ARCHIVED

ARCHIVED PAGE

Archived on August 7, 2025. Go to the latest version.

Dissemination and Implementation


Section 9


Pragmatic Implementation Process Assessments- ARCHIVED

Case Example: The Trauma Survivors Outcomes & Support (TSOS) Pragmatic Trial Research Team: Rapid Assessment Procedure Informed Clinical Ethnography (RAPICE)

The use of qualitative/mixed methods to assess the process of implementing ePCTs can potentially be time consuming and expensive. Furthermore, implementation process assessments are often not integrated into routine workflow, and can make a trial less pragmatic (Palinkas and Zatzick 2019; Zatzick 2019). Implementation assessments may involve lengthy semi-structured interviews, central adjudication, and time-intensive qualitative coding procedures. To make implementation process assessments more pragmatic (and less resource-intensive), data could be captured reliably as part of routine pragmatic trial rollout.

One option for pragmatic implementation process assessment was developed by the Trauma Survivors Outcomes & Support (TSOS) pragmatic trial research team: Rapid Assessment Procedure Informed Clinical Ethnography (RAPICE) (Palinkas and Zatzick 2019). The RAPICE approach rolled out in the TSOS trial derives from clinical field experiences in disaster mental health contexts, including the response efforts to school shootings and the 2010 Haiti earthquake (Palinkas et al. 2004; Zatzick et al., 2010). RAPICE also derives from prior single-site comparative effectiveness trial clinical ethnographic methods that incorporate research team participant observation coupled with external mixed-method consultation (Zatzick et al. 2011). It is informed by other clinical ethnographic perspectives that advocate for researcher immersion in daily activities (Green et al. 2015), and by rapid assessment procedures, which use participant observation, non-directed interviewing, and other data collection methods, such as informal interviews, newspaper accounts, agency reports, and statistics, to verify information (Harris et al. 1997).

RAPICE and TSOS

As part of site visits and to provide training, the TSOS research team spent hundreds of hours annually immersed in trauma care system pragmatic trial rollout (Zatzick 2019). During these activities, the principal investigator and other team members logged field notes/jottings of their clinical research experiences. These data were reviewed regularly (e.g., monthly) with a mixed-method expert consultant. Themes related to intervention delivery, sustainable implementation, barriers, and facilitators were iteratively discussed and documented. When appropriate, observations were fed back to front-line providers rolling out the TSOS intervention; other implementation process observations were collected by the study team and are anticipated to be presented at the study team policy summit or other presented in other formats (e.g., peer-reviewed publications). It is noteworthy that these procedures, because they were embedded as part of the trial, did not substantially drive up trial time investments or costs.

“RAPICE applied in a pragmatic clinical trial is distinguished by the following: (1) formation of a multidisciplinary research team including a member or members with clinical and/or administrative expertise and ethno-graphic and mixed methods training, enabling efficiency in data collection and analysis through division of labor; (2) development of materials to train team members in ethnographic methods and rapid assessment procedures that minimize the burden placed on any single study participant; (3) use of several data collection methods (e.g., participant observation, informal and semi-structured interviews, field jottings and logs, quantitative surveys) to verify information through triangulation; (4) iterative data collection and analysis in real time to facilitate continuous adjustment of the research question and methods to answer that question; and (5) rapid completion of the mixed method component of the project, which may vary depending on project aims and mixed method design." (Palinkas and Zatzick 2019)

Since its development within the NIH Health Care Systems Research Collaboratory, the RAPICE method has been extended to additional investigative contexts. The study team has extended the RAPICE approach to target the RE-AIM evaluation framework honed to policy activities (Scheuer at al. 2020). Also, the RAPICE method was recently used during the Washington State COVID-19 outbreak in order to rapidly identify primary and secondary COVID-19 prevention strategies that could be readily delivered within the TSOS care management platform. The nimble RAPICE mixed method allowed initial observations of study team intervention activities to naturalistically evolve to incorporate the larger pandemic context (Moloney et al. 2020).

For more information, see the April 19, 2019 Grand Rounds: Trauma Survivors Outcomes & Support (TSOS) Pragmatic Trial: Revisiting Effectiveness & Implementation Aims (Doug Zatzick, MD)

SECTIONS

CHAPTER SECTIONS

REFERENCES

back to top

American College of Surgeons. 2014. Resources for Optimal Care of the Injured Patient. https://www.facs.org/~/media/files/quality%20programs/trauma/vrc%20resources/resources%20for%20optimal%20care.ashx. Accessed August 1, 2017.

Green CA, Duan N, Gibbons RD, Hoagwood KE, Palinkas LA, Wisdom JP. 2015. Approaches to mixed methods dissemination and implementation research: methods, strengths, caveats, and opportunities. Adm Policy Ment Health. 42(5):508–523. doi:10.1007/s10488-014-0552-6. PMID: 24722814.

Harris KJ, Jerome NW, Fawcett SB. 1997. Rapid assessment procedures: a review and critique. Hum Organizat. 56(3):375–378. https://www.jstor.org/stable/44127200.

Moloney K, Scheuer H, Engstrom A, et al. 2020. Experiences and insights from the early US COVID-19 epicenter: a rapid assessment procedure informed clinical ethnography case series. Psychiatry. 1–13. doi:10.1080/00332747.2020.1750214. PMID: 32338566.

Palinkas LA, Zatzick D. 2019. Rapid assessment procedure informed clinical ethnography (RAPICE) in pragmatic clinical trials of mental health services implementation: methods and applied case study. Adm Policy Ment Health. 46(2):255–270. doi:10.1007/s10488-018-0909-3. PMID: 30488143.

 

Palinkas LA, Prussing E, Reznik VM, Landsverk JA. 2004. The San Diego East County school shootings: a qualitative study of community-level post-traumatic stress. Prehospital and Disaster Med. 19(1):113-121. PMID: 15453168.

Scheuer H, Engstrom A, Thomas P, et al. 2020. A comparative effectiveness trial of an information technology enhanced peer-integrated collaborative care intervention versus enhanced usual care for US trauma care systems: Clinical study protocol. Contemp Clin Trials. 91:105970. doi:10.1016/j.cct.2020.105970. PMID: 32119926.

Zatzick DF. 2019. Trauma Survivors Outcomes & Support (TSOS) Pragmatic Trial: Revisiting Effectiveness & Implementation Aims. https://rethinkingclinicaltrials.org/news/april-19-2019-trauma-survivors-outcomes-support-tsos-pragmatic-trial-revisiting-effectiveness-implementation-aims-doug-zatzick-md/.

Zatzick DF, Russo J, Darnell D, et al. 2016. An effectiveness-implementation hybrid trial study protocol targeting posttraumatic stress disorder and comorbidity. Implement Sci. 11:58. doi:10.1186/s13012-016-0424-4. PMID: 27130272.

Zatzick D, Coq N, Frederic J, et al. 2010. Psychosocial support training for HIV health care providers in response to the Haitian earthquake. Consortium of Universities for Global Health Annual Meeting, University of Washington.

Zatzick D, Rivara F, Jurkovich G, et al. 2011. Enhancing the population impact of collaborative care interventions: mixed method development and implementation of stepped care targeting posttraumatic stress disorder and related comorbidities after acute trauma. Gen Hosp Psychiatry. 33(2):123-134. doi:10.1016/j.genhosppsych.2011.01.001. PMID: 21596205.


Version History

Published August 4, 2020

Interoperability

Using Electronic Health Record Data in Pragmatic Clinical Trials


Section 2

Interoperability

Interoperability between electronic health record systems has the potential to benefit patients, payers, health systems, researchers, and clinicians because interoperability enables data gathered as a part of care to be used to support care between systems, quality improvement, and research.  However, in actual practice, interoperability has been hard to achieve.

Historically, the focus of interoperability of data has been on the ability to share information between settings, and the strategy has been to identify various standards for data (eg, data formatting). Over the years, these standards have become more mature, and health organizations have growing incentives (due to both regulation and changes in healthcare reimbursement, and payment to reward quality care) to share patient data across care teams and organizations.

A final rule to implement provisions of the 21st Century Cures Act was intended to advance interoperability and support the access, exchange, and use of electronic health information by patients and their caregivers. The rule establishes the United States Core Data for Interoperability (USCDI) standard, which sets forth data classes and elements that support nationwide interoperability; it also includes a broad range of data elements, such as clinical notes, test results, and medications. Specific standards that are mandated as part of the final rule include:

  • Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT)
    • SNOMED is a standard set of clinical terminology that includes clinical findings, disorders, and observable findings related to health, designed to support electronic health information exchange.
  • Logical Observation Identifiers, Names and Codes (LOINC)
    • LOINC is a standard that allows users to map health data, such as labs, vital signs, and clinical documents to support data exchange across systems.
  • RxNorm
    • RxNorm is a free, publicly available resource from the National Library of Medicine that provides “normalized” names and unique identifiers that make it possible to clearly identify a given drug. RxNorm coding allows information about medications to be exchanged across EHRs.
    • Concept Unique Identifier (RxCUI)
      • The RxCUI is a unique, unambiguous identifier that is assigned to an individual drug entity in RxNorm and used to relate to all things associated with that drug. The RxCUI is used to link one entity in RxNorm to every other entity it’s related to, such as name to ingredient to class.
  • International Classification of Diseases, version 10 (ICD-10)
    • Provides standard codes for medical conditions, diagnoses, and institutional procedures

While having a defined interface standard solves one part of the interoperability problem (ie, syntatic interoperability), as long as data are natively captured in nonstandard formats such as using local codes for greater specificity, they will need to be mapped to terminologies before they are truly interoperable (ie, semantic interoperability). Previous federal attempts to foster interoperability resulted in highly variable implementations (D’Amore et al. 2014), so study teams should be prepared to validate the data that they receive via these new interfaces. Additionally, teams on retrospective studies also should be prepared to budget for staff to map internal use codes to standard terminologies. There have been efforts to define more standardized content that are exchanged between systems, such as through the US Core Data for Interoperability (USCDI), so the situation may improve over time. These efforts may have limited impact on data that have been previously collected, so teams on retrospective studies also should be prepared to budget for staff to map internal use codes to standard terminologies.

Historically, the exchange of data within the healthcare industry for operational purposes has occurred via the exchange of messages that often looked like structured documents, while researchers would often receive population or cohort-level extracts from reporting databases or data warehouses.  More recent efforts have established the use of techniques like application programming interfaces (APIs), which tend to send and receive data in more piecemeal fashion (e.g., a record or an observation at a time).  APIs provide a standard interface, make it easier to develop new applications that can interact with healthcare data, but given that data are requested an observation or a record at a time, it can be more difficult to understand the information that is NOT being delivered. For instance, if an investigator were interested in receiving hemoglobin lab results for a set of patients using an API, they would requesting results via a LOINC code.  That would return the lab results that have been mapped to LOINC, but they would not have knowledge about any hemoglobin tests that were not mapped to LOINC by the institution, for instance any historical results that were only mapped to an internal code or reference. Ideally all of them will be captured, but it is possible that the data were generated starting on the day they turned on the interface (ie, only prospective results), or perhaps represent results from the most recent lab system upgrade. Since it is not possible to request all data for an entire population via an API (the way data would be validated by looking at trends and distributions if extracted via a database), users view results one patient at a time, making it very difficult to identify problems since any given patient may have significant gaps in their data due to normal practice patterns. While the use of APIs and the adoption of standards like the USCDI have been generally positive for the healthcare industry in terms of standardization, we are still in the early stages, somewhere between “buyer beware” and “trust, but verify.”

Additionally, care must be taken to make sure that information collected at one site actually matches information at another site. The process of extracting, transforming, and loading data from one system to another can require the “janitorial” work of mapping and data cleaning (Lohr 2014).

To advance interoperability among EHRs and registries, the Pew Data Interoperability Project was developed as a collaboration between the Duke Clinical Research Institute and the Pew Charitable Trusts. In a Grand Rounds describing the project (Envisioning Data Liquidity – The Pew Data Interoperability Project). Dr. James E. Tcheng, suggests that standardization is best accomplished through “native” data standardization, as opposed to standardizing data after it has been collected. For registries, the keys to success are well-defined clinical concepts, specified representation of the concepts and data in database systems, and integrating data capture into the workflow.

Patient Access to Data

New developments in regulation and technology—particularly related to patient- or consumer-mediated data exchange—are changing the nature of how researchers might access data (Cimino et al 2014; Bracha et al 2019). To implement provisions of the 21st Century Cures Act, on May 1, 2020, the Office of the National Coordinator for Health Information Technology (ONC), now the Assistant Secretary for Technology Policy / Office of the National Coordinator for Health IT, and the Centers for Medicare and Medicaid Services (CMS) announced a final rule; one aspect of the rule is intended to support the access, exchange, and use of electronic health information by patients and their caregivers.

“Patients should be able to access their electronic medical record at no extra cost. Providers should be able to choose their IT tools that allow them to provide the best care for patients, without excessive costs or technical barriers.” —ONC Cures Act Final Rule Fact Sheet

To enable the access and exchange of healthcare data, the rule requires standardized, open application programming interfaces (APIs) to be built using HL7’s FHIR (Fast Health Interoperability Standard). Part of the intention of the rule is to promote competition and support provider and patient independence and prevent information blocking.

 

SECTIONS

CHAPTER SECTIONS

Resources

The Big Picture: Healthcare Data and Interoperability

In this video, Dr. Lesley Curtis explores how data flow into EHRs and move between systems, the role of data standards, and the barriers to building a more streamlined and connected healthcare system.

Grand Rounds

May 10, 2019: Treating Data as an Asset: Data Entrepreneurship in the Service of Patients (Eric Perakslis, PhD)

November 9, 2018: Data Linkage Within, Across, and Beyond PCORnet (Thomas W. Carton, PhD, MS, Keith Marsolo, PhD)

October 19, 2018: A New Path Forward for Using Decentralized Clinical Trials (Jeffry Florian, PhD, Annemarie Forrest, Penny Randall, MD, MBA)

Podcast

November 20, 2018: Data Linkage Within, Across, and Beyond PCORnet (Thomas W. Carton, PhD, MS, Keith Marsolo, PhD)

REFERENCES

back to top

Bracha Y, Bagwell J, Furberg R, Wald JS. 2019. Consumer-Mediated Data Exchange for Research: Current State of US Law, Technology, and Trust. JMIR Med Inform. (2):e12348. doi: 10.2196/12348. PMID: 30946692.

Cimino JJ, Frisse ME, Halamka J, et al. 2014. Consumer-mediated health information exchanges: the 2012 ACMI debate. J Biomed Inform. 48:5-15. doi: 10.1016/j.jbi.2014.02.009. PMID: 24561078.

 

D’Amore, JD, Mandel JC, Kreda DA, et al. 2014. Are Meaningful Use Stage 2 Certified EHRs ready for interoperability? Findings from the SMART C-CDA Collaborative. J Am Med Informat Assoc. 21:1060-1068. https://doi.org/10.1136/amiajnl-2014-002883.

Lohr S. 2014. For big-data scientists, ‘janitor work’ is key hurdle to insights. New York Times. August 17, 2014. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html.


Version History

October 7, 2025: Updated text as part of annual review (changes made by K. Staman).

August 26, 2022: Minor corrections as part of annual update (changes made by K. Staman).

Update January 18, 2021: Added EHR Workshop video module to resource bar (changes made by K. Staman).

Published July 3, 2020.

Using Phenotypes in PCTs—How Do I Get Started?

Electronic Health Records–Based Phenotyping


Section 6

Using Phenotypes in PCTs—How Do I Get Started?

Before beginning development of any phenotype definition, researchers should search for existing phenotype definitions and consider their performance in validation testing. They should then assess the candidate phenotype definitions for feasibility in particular settings (for example, determining whether available domains match the authoritative source phenotype definition). If a suitable phenotype definition cannot be found from authoritative sources, then a definition must be developed and validated. In any case, once a candidate phenotype definition is identified, it must be validated against a gold standard in clinical populations, as shown in the figure below.

Figure. Phenotype Evaluation Process

 

The Figure is a flow diagram of the phenotype evaluation process.
Abbreviations: AHRQ, Agency for Healthcare Research and Quality; CMS, Center for Medicare and Medicaid Services. Adapted with permission from Shelley Rusincovitch, Center for Predictive Medicine, Duke Clinical Research Institute.

If a new phenotype definition is needed, the researchers must first operationalize a disease concept against electronic health record (EHR) data. The researchers must explicitly define how a concept should be measured, observed, or manipulated within a particular study and available data sources. A theoretical or conceptual variable of interest (such as a disease) must be translated into a set of specific diagnoses or procedures paired with implementation specifications that define the variable's meaning in a specific study. In the context of healthcare data, this means explicitly defining diagnoses, treatments, and clinical and patient characteristics that are indicative or suggestive of the condition. The researchers must specify the clinical condition they are looking for and how the condition would be represented in various EHRs.

For example, to identify obesity, the researchers would first identify diagnostic and procedure codes for the condition and investigate whether the codes are reliable and are applied consistently. If the researchers cannot reasonably assume that all patients with obesity would be coded with a given diagnosis or procedure code, they must use other data sources.

The next step is to review the available data sources (such as EHR data, claims data, registry data, and patient-reported outcomes data). If a phenotype definition is to be applied in multiple organizations, the researchers must consider the data sources that are available in other organizations. Possible data sources for obesity might include patient height and weight, the ordering or dispensing of medications associated with weight management, or patient-reported data on weight or a previous diagnosis of obesity. It is also important to consider other factors that may affect these measurements (such as the effect of pregnancy on weight, or the effect of amputation on height). Within each data type, the researchers should identify which data are available to them (for example, some EHR data include medication orders but not administration data, or billing diagnoses rather than problem lists). Knowing the types of data available can support an early feasibility assessment of existing phenotype definitions.

SECTIONS

CHAPTER SECTIONS

Resources

A User’s Guide to Computable Phenotypes
Master’s thesis providing a practical framework to help physicians, clinical researchers, and informaticians evaluate published phenotype algorithms for reuse for various purposes. The framework is divided into 3 phases aligned with expected user roles: overall assessment, clinical validation, and technical review.

ACKNOWLEDGMENTS

back to top

Key contributors to previous versions of this chapter included Michelle Smerek, Shelley Rusincovitch, Meredith Nahm Zozus, Paramita Saha Chaudhuri, Ed Hammond, Robert Califf, Greg Simon, Beverly Green, Michael Kahn, and Reesa Laws.

The Electronic Health Records Core Working Group (formerly the Phenotypes, Data Standards, and Data Quality Core Working Group) of the NIH Collaboratory influenced much of this content through monthly meetings. These additional contributors included Monique Anderson, Nick Anderson, Alan Bauck, Denise Cifelli, Lesley Curtis, John Dickerson, Chris Helker, Michael Kahn, Cindy Kluchar, Melissa Leventhal, Rosemary Madigan, Renee Pridgen, Jon Puro, Jennifer Robinson, Jerry Sheehan, and Kari Stephens. We are also grateful to the Duke Center for Predictive Medicine for development and clarification of the scientific validity and evaluation of phenotype definitions.


Version History

June 23, 2022: Updated the name of the NIH Collaboratory in the contributors list, added a Resources sidebar, and made nonsubstantive changes to the text as part of the annual content update (changes made by D. Seils).

July 22, 2020: Added the alt text attribute and corrected the caption for the Figure (changes made by D. Seils).

July 8, 2020: Updated links in the list of contributors; and made minor corrections to layout and formatting (changes made by D. Seils).

July 1, 2020: Minor corrections to layout and formatting (changes made by D. Seils).

Published June 30, 2020

Data Quality

Electronic Health Records–Based Phenotyping


Section 5

Data Quality

The quality of the data in healthcare information systems has the potential to affect the results of phenotype-based queries in such a way that the resulting data may not be useful. Secondary use of healthcare data is defined as use of the data for a purpose other than that for which the data were originally collected (Safran et al 2007). This means that secondary users should not expect the data to meet their needs. For these reasons, data quality assessment should accompany phenotype validation.

Using healthcare data in the absence of an understanding of their accuracy, consistency, missingness, and possible biases can lead to misleading answers. The capacity of the data to support research conclusions is so important that requests for applications for the NIH Pragmatic Trials Collaboratory Trials require that data validation be addressed. A recent methodology report from the Patient-Centered Outcomes Research Institute (PCORI) (Kahn et al 2018) recommends reporting of data quality along with study results for observational and comparative effectiveness research. The report also provides a data quality assessment model and framework. Other guidelines from research networks provide practical advice for data quality checks and reporting (Brown, Kahn, and Toh 2013; Kahn et al 2015).

The NIH Pragmatic Trials Collaboratory has developed a data quality assessment framework to help investigators and research teams identify and implement necessary assessments. (See “Assessing Data Quality for Healthcare Systems Data Used in Clinical Research.”) There are few validated electronic methods for data quality assessment that can be executed on a dataset. Instead, current methods for data quality assessment are comparison-based, involving comparison of chart review to data returned from a phenotype-based query, or comparison of 2 datasets to quantify the number and type of discrepancies and understand how they might be distributed in a dataset.

SECTIONS

CHAPTER SECTIONS

REFERENCES

back to top

Brown JS, Kahn M, Toh S. 2013. Data quality assessment for comparative effectiveness research in distributed data networks. Med Care. 51(8 Suppl 3):S22-S29. doi:10.1097/MLR.0b013e31829b1e2c. PMID: 23793049.

Kahn MG, Brown JS, Chun AT, et al. 2015. Transparent reporting of data quality in distributed data networks. EGEMS (Wash DC). 3(1):1052. doi:10.13063/2327-9214.1052. PMID: 25992385.

Kahn M, Ong T, Barnard J, Maertens J. 2018. Developing Standards for Improving Measurement and Reporting of Data Quality in Health Research. Washington, DC: Patient-Centered Outcomes Research Institute. https://doi.org/10.25302/3.2018.ME.13035581. Accessed June 30, 2020.

Safran C, Bloomrosen M, Hammond WE, et al. 2007. Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. J Am Med Inform Assoc. 14:1-9. doi:10.1197/jamia.M2273. PMID: 17077452.

ACKNOWLEDGMENTS

back to top

Key contributors to previous versions of this chapter included Michelle Smerek, Shelley Rusincovitch, Meredith Nahm Zozus, Paramita Saha Chaudhuri, Ed Hammond, Robert Califf, Greg Simon, Beverly Green, Michael Kahn, and Reesa Laws.

The Electronic Health Records Core Working Group (formerly the Phenotypes, Data Standards, and Data Quality Core Working Group) of the NIH Collaboratory influenced much of this content through monthly meetings. These additional contributors included Monique Anderson, Nick Anderson, Alan Bauck, Denise Cifelli, Lesley Curtis, John Dickerson, Chris Helker, Michael Kahn, Cindy Kluchar, Melissa Leventhal, Rosemary Madigan, Renee Pridgen, Jon Puro, Jennifer Robinson, Jerry Sheehan, and Kari Stephens. We are also grateful to the Duke Center for Predictive Medicine for development and clarification of the scientific validity and evaluation of phenotype definitions.


Version History

June 23, 2022: Updated the name of the NIH Collaboratory in the contributors list and throughout the text as part of the annual content update (changes made by D. Seils).

July 8, 2020: Updated links in the list of contributors; and made nonsubstantive corrections to the text (changes made by D. Seils).

July 1, 2020: Addition of Resources sidebar; and minor corrections to layout and formatting (changes made by D. Seils).

Published June 30, 2020