Assessing Fitness for Use of Real-World Data Sources
Section 2
Defining Fitness for Use
Given the widespread adoption of electronic health records (EHRs) and the general availability of administrative claims in electronic formats, there is a corresponding interest in leveraging these data sources for research (Bayley et al 2013; Botsis et al 2010; Coorevits et al 2013; Etheredge 2007; Friedman et al 2010; Hersh et al 2013; Jensen et al 2012; Weiner and Embi 2009). This interest has spurred the development of approaches to assess the underlying data quality, often by defining data checks that can be executed against a dataset (Kahn and Todd 2008; Kahn et al 2010; Kahn et al 2012; Khare et al 2017; Qualls et al 2018; Rogers et al 2019).
Data checks, or metrics, can describe characteristics of a dataset, including missing values, outliers, and frequency distributions. However, determining whether the result or value of a particular metric is good or bad depends on the needs of the research project and the intended use of the data. For example, in determining eligibility for a study, it may be sufficient to simply know if a patient has an available laboratory result, as study coordinators will need to complete a screening form that involves chart review. In this case, using the presence of a result (regardless of unit or value) is an adequate filter. However, if a lab result were going to serve as the biomarker endpoint for a trial, more rigorous thresholds might be needed. For example, each result may need to have an actual value and a unit of measure and a measure of confidence in the accuracy of the result. In other words, when it comes to using real-world data in clinical research, datasets must be considered in the context of a specific project or analysis to determine whether they are suitable or fit for use.
“Fitness-for-use” is a nebulous concept, and defining it is more art than science, with few hard and fast rules established thus far. When it comes to the use of real-world evidence derived from real-world data for regulatory decision-making, the US Food and Drug Administration (FDA) has provided guidance through the recommendations contained in the Framework for FDA's Real-World Evidence Program (FDA 2018) and further highlighted in Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision-Making for Drug and Biological Products (FDA 2021). The FDA defines fitness for use in terms of relevance and reliability. Relevance “includes the availability of key data elements (exposures, outcomes, covariates) and sufficient number of representative patients for the study,” while reliability is focused on “data accuracy, completeness, provenance and traceability” (FDA 2021). The definitions of these terms are described in more detail below. The FDA has stated that it is able to respond to study teams regarding whether a specific set of assessments is sufficient to determine the fitness for use of a real-world data source for a given study or analysis. The FDA has not yet approved a given assessment package that will always be sufficient for determining suitability for all studies that use real-world data. As the field gains more experience and confidence with the use of real-world data and real-world evidence, we expect more refinement in this area.
Key Point: On a study-by-study basis, FDA will work with stakeholders to evaluate whether a given assessment is suitable for a particular research question.
Relevance
A real-world data source is said to be relevant if:
- the data apply to question at hand;
- For example, the data contain sufficient detail to capture the use or exposure of the the product or device and/or the outcome of interest.
- the data are amenable to sound clinical and statistical analysis; and
- For example, the data can be used to answer the specified question using the proposed statistical plan.
- the data and evidence the source provides are interpretable using informed clinical and statistical judgement.
- For example, the use of a device or product in a real-world population is representative of what is captured in the data source, is generalizable to the relevant population under study, etc (FDA 2018).
The "sufficient detail" needed to capture use or exposure may depend on the intended use case. For instance, medication prescription data may be sufficient if an investigator is planning to screen patients for a trial, while dispensing or medication administration data would be a more reliable indicator for exposure as part of an outcome or endpoint. Along the same lines, "amenable to analysis" means that the data are specific enough to support the study question. For example, having death data may be sufficient for one study but not another if it is important to distinguish between all-cause mortality and mortality due to a specific cause.
Investigators will often have a general sense of the relevance of a data source before attempting to use it as part of a study, but it may be necessary to include additional analyses to better demonstrate applicability. This is particularly true for real-world data sources like EHRs. While administrative claims tend to have complete capture of all medically attended events during a given enrollment period, the same concept does not exist within EHRs. Encounters may occur outside a given health system; even within a health system, data collection within the EHR can be variable, particularly for workflows that are not tied to reimbursement. Practices vary by hospital, clinic, and/or provider, and the availability of data for longitudinal analysis may be affected by when the EHR or other clinical information system was deployed across the health system. All of these factors should be taken into account when assessing fitness for use, particularly for studies that rely on EHR data from different healthcare systems.
Reliability: Data Accrual
Data accrual relates to aspects of how the data in the source are collected or captured. Reliable documentation of data accrual methods for a real-world data source includes:
- an operational manual that pre-specifies the data elements to be collected;
- the definitions of those data elements;
- methods of data aggregation, transformation, and documentation; and
- a relevant time window, etc (FDA 2018).
This information is expected for real-world data sources like patient or device registries (Agency for Healthcare Research and Quality 2010; International Medical Device Regulators Forum Group 2015; Krucoff et al 2015; Patient-Centered Outcomes Research Institute 2012), as well as those that collect data directly as part of a study (eg, patient-reported outcomes or patient-generated data). Secondary real-world data sources like EHRs and administrative claims lack many of these characteristics, though it is possible to approximate some aspects through items like data dictionaries or data model specifications, provenance surveys that detail the source of certain data elements, and workflow descriptions that describe how data elements were captured over time, including any changes or modifications (eg, patient-reported outcomes initially being captured in an in-clinic kiosk in the waiting room and later supporting the ability to have patients complete from home via questionnaires delivered through the healthcare system’s patient portal). Documentation of the procedures and specifications used to translate EHR data from the source system to the target database (eg, a common data model or database extract) can provide further insight into the practices of data accrual.
Reliability
For secondary data sources like EHRs and administrative claims, data reliability concerns aspects of data quality and provenance over the “life cycle” of the data, or the steps that occur as data are curated and transformed from initial capture in the source system(s) to data repositories/common data models to a final analytic dataset. Activities to ensure data reliability include the execution of data checks that can describe the completeness, conformance, and plausibility of the data (see Section 4), and documentation of the data quality processes for the various transformation steps along the data life cycle to ensure the overall validity and integrity of the data (see Section 7).
Real-world data sources like patient or device registries (Gliklich et al 2010; International Medical Device Regulators Forum Group 2015; Krucoff et al 2015; Patient-Centered Outcomes Research Institute 2012), as well as data sources that are collected directly as part of a study (eg, patient-reported outcomes or patient-generated data), can provide a template for the types of information that should be documented to demonstrate reliability. While secondary real-world data sources like EHRs and administrative claims lack some of the characteristics of these data sources, it is possible to approximate some aspects through items like data dictionaries or data model specifications, provenance surveys that detail the source of certain data elements, and workflow descriptions that describe how data elements were captured over time, including any changes or modifications (eg, patient-reported outcomes initially being captured in an in-clinic kiosk in the waiting room and later supporting the ability to have patients complete from home via questionnaires delivered through the healthcare system’s patient portal). Documentation of the procedures and specifications used to translate EHR data from the source system to the target database (eg, a common data model or database extract) can provide further insight into the practices of data accrual.
SECTIONS
Resources
Leveraging RWE to Support Regulatory Decisions–An Update on Efforts to Inform Policy; NIH Collaboratory Grand Rounds; March 15, 2019
Expanding Use of Real-World Evidence: A National Academies Workshop Series; NIH Collaboratory Grand Rounds; April 27, 2018
REFERENCES
Gliklich RE, Dreyer NA, eds. 2010. Registries for Evaluating Patient Outcomes: A User's Guide. Rockville, Maryland: Agency for Healthcare Research and Quality. https://effectivehealthcare.ahrq.gov/products/registries-guide-4th-edition/. Accessed August 24, 2020.
Bayley KB, Belnap T, Savitz L, Masica AL, Shah N, Fleming NS. 2013. Challenges in using electronic health record data for CER experience of 4 learning organizations and solutions applied. Med Care. 51:S80-S86. doi:10.1097/MLR.0b013e31829b1d48. PMID: 23774512.
Botsis T, Hartvigsen G, Chen F, Weng C. 2010. Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl Bioinform. 2010:1-5. PMID: 21347133.
Coorevits P, Sundgren M, Klein GO, et al. 2013. Electronic health records: new opportunities for clinical research. J Intern Med. 274:547-560. doi:10.1111/joim.12119. PMID: 23952476.
Etheredge LM. 2007. A rapid-learning health system. Health Aff (Millwood). 26(2):w107-w118. doi:10.1377/hlthaff.26.2.w107. PMID: 17259191.
FDA. 2018. Framework for FDA's Real-World Evidence Program. https://www.fda.gov/media/120060/download. Accessed August 25, 2020.
FDA. 2021. Real-World Data: Assessing Electronic Health Records and Medical Claims Data To Support Regulatory Decision- Making for Drug and Biological Products. https://www.fda.gov/media/152503/download Accessed August 26, 2021.
Friedman CP, Wong AK, Blumenthal D. 2010. Achieving a nationwide learning health system. Sci Transl Med. 2:57cm29. doi:10.1126/scitranslmed.3001456. PMID: 21068440.
Hersh WR, Cimino J, Payne PRO, et al. 2013. Recommendations for the use of operational electronic health record data in comparative effectiveness research. EGEMS (Washington, DC). 1(1):1018. doi:10.13063/2327-9214.1018. PMID: 25848563.
International Medical Device Regulators Forum Group. 2015. Patient Registry: Essential Principles. https://www.imdrf.org/sites/default/files/2021-09/imdrf-cons-essential-principles-151124.pdf. Accessed December 3, 2025.
Jensen PB, Jensen LJ, Brunak S. 2012. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet. 13:395-405. doi:10.1038/nrg3208. PMID: 22549152.
Kahn MG, Todd J. 2008. Comparative quality measures: putting evidence above expediency. Pediatrics. 122(1):182-183. doi:10.1542/peds.2008-1042. PMID: 18596002.
Kahn MG, Eliason BB, Bathurst J. 2010. Quantifying clinical data quality using relative gold standards. AMIA Annu Symp Proc. 2010 Nov 13:356-360. PMID: 21347000.
Kahn MG, Raebel MA, Glanz JM, Riedlinger K, Steiner JF. 2012. A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. Med Care. 50 Suppl(0):S21-S29. doi:10.1097/MLR.0b013e318257dd67. PMID: 22692254.
Khare R, Utidjian L, Ruth BJ, et al. 2017. A longitudinal analysis of data quality in a large pediatric data research network. J Am Med Inform Assoc. 24(6):1072-1079. doi:10.1093/jamia/ocx033. PMID: 28398525.
Krucoff M, Normand S, Edwards F, et al. 2015. Recommendations for a National Medical Device Evaluation System: Strategically Coordinated Registry Networks to Bridge Clinical Care and Research. https://www.fda.gov/media/93140/download. Accessed August 24, 2020.
Patient-Centered Outcomes Research Institute. 2012. Standards in the Conduct of Registry Studies for Patient-Centered Outcomes Research. PCORnet. PCORnet Common Data Model (CDM). https://pcornet.org/data-driven-common-model/. Accessed January 26, 2017.
Qualls LG, Phillips TA, Topping J, et al. 2018. Evaluating foundational data quality in the national Patient-Centered Clinical Research Network (PCORnet). EGEMS (Wash DC). 6(1):3. doi:10.5334/egems.199. PMID: 29881761.
Weiner MG, Embi PJ. 2009. Toward reuse of clinical data for research and quality improvement: the end of the beginning? Ann Intern Med. 151(5):359-360. doi:10.7326/0003-4819-151-5-200909010-00141. PMID: 19638404.