Electronic Health Records–Based Phenotyping
Section 4
Evaluating Phenotype Definitions
What makes a "good" phenotype definition?
Computable phenotype definitions should be explicit, reproducible, reliable, and valid. Details of the components of a phenotype definition (such as data elements and value sets) should be provided and should be sufficient to allow the query to be reproduced in another system or by another data operator. For a phenotype definition to be reliable, it must produce a similar result with the same dataset every time it is applied. For a phenotype definition to be valid, it must identify the condition for which it was developed and meet the desired degrees of sensitivity and specificity.
Various performance metrics are used to measure the performance of a phenotype definition in different data sources or populations, analogous to measuring the performance of a case definition or diagnostic technique. These metrics include sensitivity, specificity, positive predictive value, and negative predictive value.
Moreover, to become consistently used, computable phenotype definitions must leverage data that are routinely collected in most, if not all, electronic health records (EHRs) and/or ancillary data sources.
How can the validity of a phenotype definition be determined?
The validity of a phenotype definition is defined as the phenotype definition's ability to correctly measure or detect people with the condition of interest and those without the condition of interest; that is, its ability to correctly identify which individuals exhibit the true phenotype and which do not.
Estimation of validity requires a gold standard, defined as the best classification available for assessing the true or actual phenotype status. Assessment of a gold standard is a resource-intensive process that requires careful manual review of current and historical individual patient data. Owing to logistical and efficiency considerations, multiple clinical reviewers are usually involved in the process. However, to ensure consistency between conclusions drawn from patient records, an initial training of the reviewers is crucial. Most studies use expert clinicians to review identified cases but do not specify the training of the reviewers or the details of their assessment of true disease or case status.
Many phenotype developers have conducted validation studies (Newton et al 2013; Peissig et al 2012; Rosenman et al 2014), but none appear to have used a controlled approach. Some investigators attempt to characterize the validity of a phenotype definition by using agreement rates between the phenotype definition and a known standard, while other investigators report the sensitivity or specificity of the phenotype definition compared with a known or gold standard. In this context, sensitivity is the ability to correctly identify individuals who have the phenotype, and specificity is the ability to correctly identify those who do not have the phenotype.
Positive predictive value is an estimate of the prevalence of the true condition among individuals who have the phenotype. Negative predictive value is an estimate of the prevalence of the true condition among those who do not have the phenotype. Both positive and negative predictive values are indicators of the success rate of a phenotype definition when it is to be used in practice. Similar to sensitivity and specificity, positive and negative predictive values require knowledge about the true phenotype. They can be estimated on the basis of the sensitivity, specificity, and prevalence of the condition in the population being examined.
Researchers at Duke University’s Center for Predictive Medicine are developing and testing methods to quantify the validity and reliability of certain computable phenotype definitions. (See “Practical Development and Implementation of EHR Phenotypes”; NIH Collaboratory Grand Rounds; November 15, 2013.)
Determination of a gold standard is a critical complicating factor for determining data quality in EHRs and ultimately the "source of truth." For conditions in which lab values are diagnostic, a lab value can be the gold standard, though the clinical context is critical in many cases. For behavioral or mental health conditions, the gold standard or best source of data to approximate "truth" is often the patient or an observation by an expert clinician. For many diseases with complex etiologies, subjective diagnoses, or broad ranges of clinical presentations, the best source of data (or "truth") is unclear. It is likely that a variety of data sources must be used to determine a patient's true state of disease or to identify the condition. Because of these challenges, recent efforts have looked to the use of "silver standard" definitions to produce more cost-effective validation sets without the need for significant record review (Swerdel, Hripcsak, and Ryan 2019; Wagholikar et al 2020).
How can the reliability and reproducibility of a phenotype definition be determined?
"Reliability" is defined as the extent to which an experiment, test, or measuring procedure—or phenotype definition—yields the same results in repeated trials. Reliability is an attribute of any computer-related component (such as software, hardware, or a network) that consistently performs according to its specifications. One way to assess reliability is to implement the phenotype definition algorithm multiple times and observe whether the results on the same patients are the same over repeated implementations.
In contrast, "reproducibility" is the consistency of results or implementation of the algorithm multiple times under similar conditions. To determine reliability, the analyst repeatedly implements the algorithm on the same set of patients and checks whether the phenotype results for the same patients match. For reproducibility, the algorithm can be implemented on either different or the same patient populations by different "coders."
Ultimately, what is required is an unequivocal algorithm that is implemented without room for confusion. For most diseases (especially those with a subjective diagnosis or broad range of clinical presentations), a variety of data sources must be included in the phenotype definition. The more complex the phenotype definition, the more difficult it can be to reproduce and the more likely errors will influence the reliability of the algorithm (Richesson et al 2013).
Several well-known issues can affect reliability, including changes in coding terminology over time and variations in coding practices at the provider, healthcare system, and regional levels. An active area of research involves studying data quality and testing various phenotype definitions in different settings or time periods to represent variations in data quality.
How can the reproducibility of a phenotype definition be optimized?
Careful attention to 2 features of phenotype definition development can enhance the likelihood that a phenotype definition will be applied consistently: clearly articulated specifications for the definition, and guidance for implementers. Development of meaningful specifications and documentation is complicated by variation in healthcare information systems and lack of data standards for EHR data.
Ideally, a phenotype definition should be reproducible across institutions, but many factors can affect reproducibility, including regional differences in patient populations, differences in EHR systems, variations in the work flows that generate the data, and variations in coding practices. The process of implementing a phenotype definition at multiple institutions can result in a more robust definition that accounts or adjusts for localization of the data.
What are potential limitations of EHR data and computable phenotypes?
Data contained in EHRs and ancillary data sources are generated through the provision of clinical care. The data are not optimized for secondary uses, and using the data for research purposes has multiple limitations (Bayley et al 2013).
Missing Data
Because EHR data are derived from patient encounters with providers or healthcare systems, data are only recorded during healthcare episodes. This fact can result in bias, because healthier individuals are missing from the dataset. "Missingness" is a common problem and is often nonrandom, a challenge known as "informative censoring" (National Research Council 2010; Shih 2002). Patients are also lost to follow-up if they move out of the area or obtain care from a provider in a different healthcare system. Therefore, in pragmatic clinical trials, it is important to distinguish between "not present" in the dataset and "did not assess."
Inaccurate or Uninterpretable Data
Errors are common in data from EHRs and ancillary data sources, because most data are entered by busy healthcare providers during a patient visit or from recall after the visit. Phenotype definitions based on coding that is influenced by billing are susceptible to systematic biases. In addition, data may be uninterpretable if, for example, units of measurement are missing or if analyzable information cannot be gleaned from qualitative assessments.
Complex and Inconsistent Data
Clinical definitions, coding rules, and data collection systems vary over time. Data collection practices can also vary among providers at different locations. Finally, much information is still captured as unstructured data and stored in narrative notes. Though many challenges exist in extracting unstructured data, these data are increasingly being used to support various types of clinical decision making and research using an evolving set of tools (Nadkarni, Ohno-Machado, and Chapman 2001).
SECTIONS
Resources

Practical Development and Implementation of EHR Phenotypes
NIH Collaboratory Grand Rounds; November 15, 2013
A User’s Guide to Computable Phenotypes
Master’s thesis providing a practical framework to help physicians, clinical researchers, and informaticians evaluate published phenotype algorithms for reuse for various purposes. The framework is divided into 3 phases aligned with expected user roles: overall assessment, clinical validation, and technical review.
REFERENCES
Bayley KB, Belnap T, Savitz L, et al. 2013. Challenges in using electronic health record data for CER: experience of 4 learning organizations and solutions applied. Med Care. 51:S80-86. doi:10.1097/MLR.0b013e31829b1d48. PMID: 23774512.
Nadkarni PM, Ohno-Machado L, Chapman WW. 2011. Natural language processing: an introduction. J Am Med Inform Assoc. 18:544-551. doi:10.1136/amiajnl-2011-000464. PMID: 21846786.
National Research Council. 2010. The Prevention and Treatment of Missing Data in Clinical Trials. Washington, DC: National Academies Press.
Newton KM, Peissig PL, Kho AN, et al. 2013. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc. 20:e147-154. doi:10.1136/amiajnl-2012-000896. PMID: 23531748.
Peissig PL, Rasmussen LV, Berg RL, et al. 2012. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc. 19:225-234. doi:10.1136/amiajnl-2011-000456. PMID: 22319176.
Richesson RL, Rusincovitch SA, Wixted D, et al. 2013. A comparison of phenotype definitions for diabetes mellitus. J Am Med Inform Assoc. 20:e319-e326. doi:10.1136/amiajnl-2013-001952. PMID: 24026307.
Rosenman M, He J, Martin J, et al. 2014. Database queries for hospitalizations for acute congestive heart failure: flexible methods and validation based on set theory. J Am Med Inform Assoc. 21:345-352. doi:10.1136/amiajnl-2013-001942. PMID: 24113802.
Swerdel JN, Hripcsak G, Ryan PB. 2019. PheValuator: development and evaluation of a phenotype algorithm evaluator. J Biomed Inform. 97:103258. doi:10.1016/j.jbi.2019.103258. PMID: 31369862.
Wagholikar KB, Estiri H, Murphy M, Murphy SN. 2020. Polar labeling: silver standard algorithm for training disease classifiers. Bioinformatics. 36(10):3200-3206. doi:10.1093/bioinformatics/btaa088. PMID: 32049335.
ACKNOWLEDGMENTS
Key contributors to previous versions of this chapter included Michelle Smerek, Shelley Rusincovitch, Meredith Nahm Zozus, Paramita Saha Chaudhuri, Ed Hammond, Robert Califf, Greg Simon, Beverly Green, Michael Kahn, and Reesa Laws.
The Electronic Health Records Core Working Group (formerly the Phenotypes, Data Standards, and Data Quality Core Working Group) of the NIH Collaboratory influenced much of this content through monthly meetings. These additional contributors included Monique Anderson, Nick Anderson, Alan Bauck, Denise Cifelli, Lesley Curtis, John Dickerson, Chris Helker, Michael Kahn, Cindy Kluchar, Melissa Leventhal, Rosemary Madigan, Renee Pridgen, Jon Puro, Jennifer Robinson, Jerry Sheehan, and Kari Stephens. We are also grateful to the Duke Center for Predictive Medicine for development and clarification of the scientific validity and evaluation of phenotype definitions.