Assessing Fitness for Use of Real-World Data Sources
Section 4
Data Quality Measures
Data quality is a multifaceted and context-dependent concept. Various groups that use secondary real-world data sources for research have extensive experience in assessing the quality of these data and have developed metrics and best practices for their use. Distributed research networks like the Health Care Systems Research Network (HCSRN; formerly the HMO Research Network) (Newton and Larson 2012; Ross et al 2014; Steiner et al 2014; Vogt et al 2004), the Center for Effectiveness & Safety Research (CESR), and Sentinel (Curtis et al 2012; Curtis et al 2014) developed processes to characterize administrative claims, with Sentinel informing much of the FDA's early work with real-world data and real-world evidence, particularly around pharmacoepidemiology (Carnahan and Moores 2012). The growing adoption of EHRs and the potential to use EHR data for research spurred a number of efforts to develop procedures that could assess EHR data quality, with 2 large-scale examples being the National Patient-Centered Clinical Research Network (PCORnet) and the Observational Health Data Science Initiative (OHDSI) (Califf 2014; Fleurence et al 2014; Hripcsak et al 2015).
Across these different approaches, there was varying use of terms to describe the purpose of each data quality measure (data check)—consistency, completeness, accuracy, precision, concordance, correctness, etc. This variation made it difficult to identify commonalities across initiatives or to compare the results of one check to another. As a result, Kahn et al (2016) developed a harmonized data quality framework that groups checks into 3 major categories: conformance (Does the format of the data adhere to the underlying model?), completeness (Are there values where we expect to see data populated?), and plausibility (Do the values that appear make sense?). Each check is further classified by its purpose, whether it is for verification or validation. Verification checks are used to determine whether data match or are consistent with internal expectations, and validation checks are used to compare against external "gold standard" benchmarks. The vast majority of checks that have been developed are verification checks. This is partly due to the difficulty of identifying gold standards to use as comparators, and also because validation is more straightforward against smaller cohorts (eg, patients with heart failure) as opposed to the entire population of a dataset, which may represent all patients who receive care within a healthcare system or all those enrolled in a health plan.
A given data check may apply to multiple fields or tables within a CDM (e.g., all required fields are populated), so networks will often talk about data check instances, or data check measures, which represent a specific instantiation of the check against a given table/field. In this manner, a handful of checks can lead to hundreds or thousands of data check measures. For instance, within PCORnet, 38 data checks translate into more than 1400 data check measures (PCORnet Distributed Research Network Operations Center) while OHDSI has 20 checks that resolve into more than 3300 data check measures. The vast majority of data checks tend to be related to conformance, so CDMs with a higher number of tables and fields will end up with more data check measures.
Example verification checks from several existing distributed research networks are provided in the table below. The source (network) of each check is listed, unless a similar version of the check is used by multiple networks. These examples were taken from publicly available material. While many networks and organizations that use real-world data have quality assurance processes, those that make the content and programs behind these processes freely available remain the exception rather than the norm. (Some groups will provide material only upon request.) For the field to advance and arrive at a consensus for a set of minimum necessary checks, greater sharing and transparency of assessment method are needed. This has become increasingly relevant during the COVID-19 pandemic, because of the potential misuse or misrepresentation of data—as in the case of Surgisphere, which analyzed and published information on hydroxychloroquine from EHRs but was unable to verify their data, resulting in a retraction of a peer-reviewed article (Ledford and Van Noorden 2020). There is also the need to rapidly run parallel analyses across a series of real-world data sources in a way that generates comparable results.
Table 1. Categories of Data Quality Checks and Examples From Distributed Research Networks
| Category | Subcategory | Description | Data Check Example |
| Conformance | Value | Determines whether the data conform to the formats of the data model used to store them | Required fields do not contain values outside of the CDM specifications (multiple networks) |
| Relational | Determines whether the data agree with the constraints imposed by the database used to store them (eg, primary or foreign key relationships) | All fields are present in each CDM table (multiple networks) | |
| Calculation | Evaluates whether variables derived computationally yield valid results | Enrollment periods do not overlap, and are not duplicates or subsets of one another (Sentinel) | |
| Completeness | Examines whether expected values are present (single time point or longitudinally) | Fewer than 50% of patients with an encounter have diagnosis data in the CDM (PCORnet) | |
| Plausibility | Uniqueness | Determines whether multiple values exist when only one value is expected | Patient does not have multiple inpatient admissions to the same facility on the same day (CESR) |
| Atemporal | Measures whether data agree with expected values | More than X% of records fall into the lowest or highest categories of age, height, weight, diastolic blood pressure, etc (multiple networks) | |
| Temporal | Examines whether variables change as expected over a specified time period | More than X% of records have illogical date relationships (eg, events before date of birth, events after date of death) (multiple networks) |
The harmonized terminology proposed by Kahn et al (2016) provides a good organizing framework considering what to assess within a dataset. Yet there are a few gaps. For example, the concept of timeliness or data latency (ie, Given a dataset, how recent are the records?) does not exist in the framework. So, to avoid a new entry, those checks would need to be framed in terms of completeness or plausibility (eg, Does the number of records for a given month look "complete" given expectations? Is the monthly volume of lab records an outlier compared to recent trends?). In addition, persistence checks—those that assess changes in the dataset over time (eg, Is there a large drop in the number of records or number of patients from one refresh to the next?)—do not have a ready home in the framework. They can be considered a type of completeness check, but instead of record-level completeness, the comparison is between 2 versions of the same dataset from different time points. Given that some studies rely on a single data extract, persistence checks may not be as relevant to studies that use multiple data pulls over the course of several years. However, given that persistence issues can appear in any dataset, it is important that study teams have a sense of performance of this measure, even if they cannot measure it directly with the data they receive.
The NIH Collaboratory's Electronic Health Records Core Working Group developed a white paper and recommendations for assessing data quality in pragmatic clinical trials. The recommendations include a formal assessment against 3 domains: accuracy, completeness, and consistency for key data elements, such as endpoints. The report offers examples of metrics that can be used for each dimension, and the optimal metric depends on the nature of the data and the research study. (Note that the white paper was developed before the harmonized framework by Kahn et al, so consistency would now be referred to as conformance).
Data Quality Checks
The differences between the terminologies describing EHR and claims-based data checks and the language used by the FDA to describe whether a dataset is fit for use can lead to confusion when comparing and/or mapping between them. Using data checks to assess the quality of a dataset can be considered a type of data assurance. Such a process is a necessary component of demonstrating data assurance, but it is likely insufficient to satisfy all requirements in that area or demonstrate overall fitness for use. As noted above, most of the data checks developed thus far are for verification purposes, which speaks to the reliability of the data. However, it is possible that a dataset can satisfy all of these data checks and still be inappropriate for a specific research question because of relevance concerns. Validation checks could be considered as a way to assess the relevance of a dataset, since those checks would be more population-focused and targeted to specific outcomes or variables within them (eg, distribution of HbA1c values in a population of patients with diabetes), but many of these checks will need to be developed and evaluated in a study-specific context. One area of the FDA's description of fitness for use that is more difficult to handle directly with data checks are the concepts of provenance and traceability. For EHR and claims-based data sources, these concepts may be best handled through process documentation or otherwise describing steps of data transformation from the original source system to the final analytic dataset.
SECTIONS
Resources
Assessing Fitness-for-Use of Clinical Data for PCTs; 2-page handout for assessing data quality and fitness for use (2023)
Assessing Data Quality for Healthcare Systems Data Used in Clinical Research; NIH Collaboratory white paper (2015)
REFERENCES
Califf RM. 2014. The Patient-Centered Outcomes Research Network: a national infrastructure for comparative effectiveness research. N C Med J. 75:204-210. doi:10.18043/ncm.75.3.204. PMID: 24830497.
Carnahan RM, Moores KG. 2012. Mini-Sentinel's systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned. Pharmacoepidemiol Drug Saf. 21 Suppl 1:82-89. doi:10.1002/pds.2321. PMID: 22262596.
Curtis LH, Weiner MG, Boudreau DM, et al. 2012. Design considerations, architecture, and use of the Mini-Sentinel distributed data system. Pharmacoepidemiol Drug Saf. 21 Suppl 1:23-31. doi:10.1002/pds.2336. PMID: 22262590.
Curtis LH, Brown J, Platt R. 2014. Four health data networks illustrate the potential for a shared national multipurpose big-data network. Health Aff (Millwood). 33(7):1178-1186. doi:10.1377/hlthaff.2014.0121. PMID: 25006144.
Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. 2014. Launching PCORnet, a national patient-centered clinical research network. J Am Med Inform Assoc. 21(4):578-582. doi:10.1136/amiajnl-2014-002747. PMID: 24821743.
Hripcsak G, Duke JD, Shah NH, et al. 2015. Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers. Stud Health Technol Inform. 216:574-578. PMID: 26262116.
Kahn MG, Callahan TJ, Barnard J, et al. 2016. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC). 4(1):1244. doi:10.13063/2327-9214.1244. PMID: 27713905.
Ledford H, Van Noorden R. 2020. High-profile coronavirus retractions raise concerns about data oversight. Nature. 582(7811):160. doi:10.1038/d41586-020-01695-w. PMID: 32504025.
Newton KM, Larson EB. 2012. Learning health care systems: leading through research: the 18th Annual HMO Research Network Conference, April 29-May 2, 2012, Seattle, Washington. Clin Med Res. 10(3):140-142. doi:10.3121/cmr.2012.1099. PMID: 22904375.
Ross TR, Ng D, Brown JS, et al. 2014. The HMO Research Network Virtual Data Warehouse: a public data model to support collaboration. EGEMS (Wash DC). 2(1):1049. doi:10.13063/2327-9214.1049. PMID: 25848584.
Steiner JF, Paolino AR, Thompson EE, Larson EB. 2014. Sustaining research networks: the twenty-year experience of the HMO Research Network. EGEMS (Wash DC). 2(2):1067. PMID: 25848605.
Vogt TM, Lafata JE, Tolsma DD, Greene SM. 2004. The role of research in integrated health care systems: the HMO Research Network. Perm J. 8(4):10-17. PMID: 26705313.