Acquiring Real-World Data

Section 2 Common Real-World Data Sources

Contributors

As part of its framework for using real-world evidence derived from real-world data to support regulatory decision making, the US Food and Drug Administration (FDA) has identified several potential sources of real-world data and information (FDA 2017):

Electronic health records: Electronic health records (EHRs) contain information collected during the course of clinical care. They may include multiple care settings—outpatient visits, inpatient stays, emergency and urgent care visits, home health, etc. EHRs can include a variety of data from structured domains, including diagnoses, procedures, laboratory results, vital signs, medication orders, and medication administrations. They may also include less standardized data, such as information captured in inpatient flowsheets, questionnaires and surveys completed directly by patients, signs and symptoms recorded by clinicians, data on surgical care and anesthesia, and provider and nursing documentation.

Administrative claims: Administrative claims are insurance claims related to services from healthcare providers. In the United States, federal insurance programs include Medicare and Medicaid. The Medicare population includes adults 65 years and older, patients with certain disabilities, and patients with end-stage renal disease. Medicaid is an insurance program for people with low income. Private insurance claims include those in employer-sponsored health plans, insurance claims for those who are self-employed, and claims for insurance plans administered on behalf of the federal government. These administrative data can include information about physician services, institutional costs, demographic characteristics, deaths, dispensed medications, home health services, and skilled nursing facilities.

Patient-reported outcomes: Patient-reported outcomes (PROs) are defined by the FDA as any report of the status of a patient's health that comes directly from the patient, without interpretation of the patient's response by a clinician or anyone else (FDA Guidance for Industry 2009). PROs might include information about symptoms, functioning, satisfaction with care or symptoms, adherence to prescribed medications or other therapy, and perceived value of treatment. Typically captured in the form of surveys or questionnaires, PROs may be obtained via paper forms, online portals, or mobile apps. See the Patient-Reported Outcomes chapter of the Living Textbook for more information.

Patient-generated health data: Patient-generated health data are data generated from devices that provide information on a patient’s status (for example, internet-connected scales, pedometers, home blood pressure monitors). These data may be obtained directly from the device via a mobile application or through some other type of instrument. Patient-generated health data can include the raw sensor values and summary statistics calculated from the underlying data.

Medical product/device registries: Registries are typically created after a product or device has been approved in order to support postmarketing surveillance. These registries often contain rich information about the product or device but limited data on the characteristics or health status of patients, generally far less data than what is available in EHRs.

Condition-specific or disease registries: Registries contain information from patients who have a specific condition or disease. These patient-focused registries often include information about disease onset, symptoms, changing phenotypes, treatments, and outcomes. Because they are designed for research or targeted care, condition or disease-specific registries often have more condition-specific data than is collected in EHRs.

Environmental factors and social determinants of health: Environmental factors and social determinants of health (for example, food insecurity, access to transportation) are increasingly being captured in EHRs as healthcare systems focus more on population health. The data may be collected directly from patients through surveys or derived from community or geographically organized resources (for example, the American Community Survey) based on a patient’s current or historical address. Environmental sources can also be used in this manner, such as to estimate exposure to pollution based on distance to a freeway, power plant, or other industrial source.

In most cases, PROs and patient-generated health data are obtained directly from patients as part of a specific trial or study. In this context, PROs and patient-generated health data are collected prospectively using the procedures that govern prospective data collection, such as patient consent. PROs and patient-generated health data collected in the EHR or as part of a registry for other purposes (such as follow-up to a surgical procedure or monitoring of patients with a chronic disease) will typically be treated like the rest of the data that are contained in that source.

Identifying the Appropriate Source

Since data in secondary real-world data sources were collected or generated for purposes other than research, they include gaps and biases that reflect the nature of the underlying activity (Kahn and Ranade 2010; Hersh et al 2013; Weiskopf et al 2013; Raebel et al 2014; Rusanov et al 2014). Therefore, given a specific research question or study, it is important to assess whether the real-world data source is relevant and can reliably fulfill its intended purpose (FDA 2017; FDA 2021), whether it is for patient identification or recruitment, monitoring outcomes, or assessing endpoints. (See the Assessing Fitness-for-Use of Real-World Data chapter of the Living Textbook.)

In many cases, the same study concept can be present in multiple real-world data sources. For example, a disease diagnosis can be identified through a query of the EHR, administrative claims, a patient-reported medical history, or a disease registry. When designing a study, investigators should understand the trade-offs between different sources in terms of completeness across a potential study population, length of follow-up, etc. Depending on how data are captured, multiple sources may be needed to adequately support a study. In this case, an adjudication process is often necessary to decide what to do if there is discordance between sources (Rockhold et al 2020). Investigators should prepare to implement such a plan.

In addition, when combining study data with real-world data sources, some form of record linkage is usually required to match patients across sources. There are a number of techniques that can be used. Some rely on deterministic matches of clear-text identifiers, while others rely on probabilistic weighting of encrypted tokens generated from combinations of identifiers (such as first and last names, date of birth, and current zip code) (Grannis et al 2002; Durham et al 2010; Kum et al 2014; Setoguchi et al 2014; Durojaiye et al 2018; Karr et al 2019). Not all data holders are able to support all of these methods, so it is important to understand their capabilities, as well as the identifiers to which they have access. Study teams may then need to collect these same identifiers to allow for linkage to occur.

Previous Section Next Section

SECTIONS

CHAPTER SECTIONS

sections

Resources

Real-World Data and Real-World Evidence in Regulatory Decisions
NIH Pragmatic Trials Collaboratory EHR Workshop video module. Jacqueline Corrigan-Curay of the US Food and Drug Administration discusses recent trends in incorporating real-world data and real-world evidence in regulatory decisions.

Screenshot of Grand Rounds presentation
Using Real-World Data to Plan Eligibility Criteria and Enhance Recruitment
NIH Pragmatic Trials Collaboratory PCT Grand Rounds; July 31, 2020

REFERENCES

Durham E, Xue Y, Kantarcioglu M, Malin B. 2010. Private medical record linkage with approximate matching. AMIA Annu Symp Proc. 2010:182-186. PMID: 21346965.

Durojaiye AB, Puett LL, Levin S, et al. 2018. Linking electronic health record and trauma registry data: assessing the value of probabilistic linkage. Methods Inf Med. 57:261-269. doi:10.1055/s-0039-1681087. PMID: 30875705.

Grannis SJ, Overhage JM, McDonald CJ. 2002. Analysis of identifier performance using a deterministic linkage algorithm. Proc AMIA Symp. 2002:305-309. PMID: 12463836.

Hersh WR, Weiner MG, Embi PJ, et al. 2013. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care. 51:S30-S37. doi:10.1097/MLR.0b013e31829b1dbd. PMID: 23774517.

Kahn MG, Ranade D. 2010. The impact of electronic medical records data sources on an adverse drug event quality measure. J Am Med Inform Assoc. 17:185-191. doi:10.1136/jamia.2009.002451. PMID: 20190062.

Karr AF, Taylor MT, West SL, et al. 2019. Comparing record linkage software programs and algorithms using real-world data. PLoS One. 14:e0221459. doi:10.1371/journal.pone.0221459. PMID: 32352389.

Kum H-C, Krishnamurthy A, Machanavajjhala A, Reiter MK, Ahalt S. 2014. Privacy preserving interactive record linkage (PPIRL). J Am Med Inform Assoc. 21:212-220. doi:10.1136/amiajnl-2013-002165. PMID: 24201028.

Raebel MA, Haynes K, Woodworth TS, et al. 2014. Electronic clinical laboratory test results data tables: lessons from Mini-Sentinel. Pharmacoepidemiol Drug Saf. 23:609-618. doi:10.1002/pds.3580. PMID: 24677577.

Rockhold FW, Tenenbaum JD, Richesson R, Marsolo KA, O'Brien EC. 2020. Design and analytic considerations for using patient-reported health data in pragmatic clinical trials: report from an NIH Collaboratory roundtable. J Am Med Inform Assoc. 27:634-638. doi:10.1093/jamia/ocz226. PMID: 32027359.

Rusanov A, Weiskopf NG, Wang S, Weng C. 2014. Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med Inform Decis Mak. 14:51. doi:10.1186/1472-6947-14-51. PMID: 24916006.

Setoguchi S, Zhu Y, Jalbert JJ, Williams LA, Chen C-Y. 2014. Validity of deterministic record linkage using multiple indirect personal identifiers: linking a large registry to claims data. Circ Cardiovasc Qual Outcomes. 7:475-480. doi:10.1161/CIRCOUTCOMES.113.000294. PMID: 24755909.

US Food and Drug Administration. Guidance for Industry. 2009. Patient-Reported Outcome Measures: Use in Medical Product Development to Support Labeling Claims. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/patient-reported-outcome-measures-use-medical-product-development-support-labeling-claims. Accessed August 21, 2020.

US Food and Drug Administration. 2017. Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices Guidance for Industry and Food and Drug Administration Staff. https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence. Accessed August 20, 2020.

US Food and Drug Administration. 2021. Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products—Draft Guidance for Industry. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/real-world-data-assessing-electronic-health-records-and-medical-claims-data-support-regulatory. Accessed July 21, 2022.

Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. 2013. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform. 46:830-836. doi:10.1016/j.jbi.2013.06.010. PMID: 23820016.

Version History

October 14, 2022: Made nonsubstantive changes to the text, added an image and updated links in the Resources sidebar, and added Seils as a contributing editor as part of the annual content update (changes made by D. Seils).

January 18, 2021: Added EHR Workshop video module to the resource bar (changes make by K. Staman).

Published August 25, 2020

COVID-19 Resources

COVID-19 Resources

Rethinking Clinical Trials

A Living Textbook of Pragmatic Clinical Trials

Common Real-World Data Sources

Acquiring Real-World Data

Section 2

Common Real-World Data Sources

Identifying the Appropriate Source

SECTIONS

sections

Resources

REFERENCES

current section :

Common Real-World Data Sources

Citation: