Electronic Health Records–Based Phenotyping
Section 5
Data Quality
The quality of the data in healthcare information systems has the potential to affect the results of phenotype-based queries in such a way that the resulting data may not be useful. Secondary use of healthcare data is defined as use of the data for a purpose other than that for which the data were originally collected (Safran et al 2007). This means that secondary users should not expect the data to meet their needs. For these reasons, data quality assessment should accompany phenotype validation.
Using healthcare data in the absence of an understanding of their accuracy, consistency, missingness, and possible biases can lead to misleading answers. The capacity of the data to support research conclusions is so important that requests for applications for the NIH Pragmatic Trials Collaboratory Trials require that data validation be addressed. A recent methodology report from the Patient-Centered Outcomes Research Institute (PCORI) (Kahn et al 2018) recommends reporting of data quality along with study results for observational and comparative effectiveness research. The report also provides a data quality assessment model and framework. Other guidelines from research networks provide practical advice for data quality checks and reporting (Brown, Kahn, and Toh 2013; Kahn et al 2015).
The NIH Pragmatic Trials Collaboratory has developed a data quality assessment framework to help investigators and research teams identify and implement necessary assessments. (See “Assessing Data Quality for Healthcare Systems Data Used in Clinical Research.”) There are few validated electronic methods for data quality assessment that can be executed on a dataset. Instead, current methods for data quality assessment are comparison-based, involving comparison of chart review to data returned from a phenotype-based query, or comparison of 2 datasets to quantify the number and type of discrepancies and understand how they might be distributed in a dataset.
SECTIONS
Resources
Assessing Data Quality for Healthcare Systems Data Used in Clinical Research
Guidance document from the NIH Collaboratory's Electronic Health Records Core Working Group.
REFERENCES
Brown JS, Kahn M, Toh S. 2013. Data quality assessment for comparative effectiveness research in distributed data networks. Med Care. 51(8 Suppl 3):S22-S29. doi:10.1097/MLR.0b013e31829b1e2c. PMID: 23793049.
Kahn MG, Brown JS, Chun AT, et al. 2015. Transparent reporting of data quality in distributed data networks. EGEMS (Wash DC). 3(1):1052. doi:10.13063/2327-9214.1052. PMID: 25992385.
Kahn M, Ong T, Barnard J, Maertens J. 2018. Developing Standards for Improving Measurement and Reporting of Data Quality in Health Research. Washington, DC: Patient-Centered Outcomes Research Institute. https://doi.org/10.25302/3.2018.ME.13035581. Accessed June 30, 2020.
Safran C, Bloomrosen M, Hammond WE, et al. 2007. Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. J Am Med Inform Assoc. 14:1-9. doi:10.1197/jamia.M2273. PMID: 17077452.
ACKNOWLEDGMENTS
Key contributors to previous versions of this chapter included Michelle Smerek, Shelley Rusincovitch, Meredith Nahm Zozus, Paramita Saha Chaudhuri, Ed Hammond, Robert Califf, Greg Simon, Beverly Green, Michael Kahn, and Reesa Laws.
The Electronic Health Records Core Working Group (formerly the Phenotypes, Data Standards, and Data Quality Core Working Group) of the NIH Collaboratory influenced much of this content through monthly meetings. These additional contributors included Monique Anderson, Nick Anderson, Alan Bauck, Denise Cifelli, Lesley Curtis, John Dickerson, Chris Helker, Michael Kahn, Cindy Kluchar, Melissa Leventhal, Rosemary Madigan, Renee Pridgen, Jon Puro, Jennifer Robinson, Jerry Sheehan, and Kari Stephens. We are also grateful to the Duke Center for Predictive Medicine for development and clarification of the scientific validity and evaluation of phenotype definitions.