Electronic Health Records–Based Phenotyping
Section 2
Definitions
What is a phenotype?
A phenotype is the observable physical or biochemical expression of a specific trait in an organism, such as a disease, stature, or blood type, based on genetic information and environmental influences. The phenotype of an organism includes physical appearance, biochemical processes, and behaviors. In short, an organism's phenotype is the appearance the organism presents to observers.
In contemporary biomedical research contexts, a phenotype is understood as a measurable biological marker (such as a physiological, biochemical, or anatomical feature), a behavioral marker (such as a psychometric pattern), or a cognitive marker that is found more often in individuals with a disease or condition than in the general population.
What is a computable phenotype?
A computable phenotype is a clinical condition, characteristic, or set of clinical features that can be determined solely from data in electronic health records (EHRs) and ancillary data sources and does not require chart review or interpretation by a clinician. We use the term "EHR" broadly to refer to data generated through healthcare delivery and reimbursement practices. These functions may be covered in multiple systems and can contain both practice management data and data that are strictly limited to the clinical domain. We use the term "ancillary data sources" to refer to disease registries, claims data, supplemental data collection, and other sources that are related to healthcare delivery but may not be directly integrated into the EHR system. Computable phenotypes are also sometimes referred to as "EHR condition definitions," "EHR-based phenotype definitions," or simply "phenotypes."
What is a computable phenotype definition?
A computable phenotype definition is a specification for identifying patients or populations with a given characteristic or condition of interest from EHRs using data that are routinely collected in EHRs or ancillary data sources. A computable phenotype definition consists of data elements and logical expressions (such as AND, OR, and NOT) that can be interpreted and executed by a computer. In other words, the syntax defining a computable phenotype is designed to be interpreted and executed programmatically without human intervention. Computable phenotype definitions often rely on value sets—lists of codes from standardized medical vocabularies that indicate a condition, drug exposure, or other clinical phenomenon of interest. Data elements and the difference between data elements and phenotypes are described under "How are Data Elements and Phenotypes Different?" later in this section.
Why are computable phenotype definitions important?
Computable phenotype definitions can support reproducible queries of EHR data from multiple systems (such as clinical and ancillary health information systems, research networks, and aggregated databases). These queries can then be replicated at multiple sites in a consistent fashion, enabling efficiencies and ensuring that populations identified from different healthcare organizations have similar features, or were at least identified in the same way.
The ability to identify people with particular conditions across healthcare organizations by using common definitions has value for clinical quality measurement, health improvement, and research. Standard phenotype definitions can enable direct identification of cohorts based on population characteristics, risk factors, and complications, allowing decision makers to identify and target patients for screening tests and interventions that have been demonstrated to be effective in similar populations. This identification process can be integrated with the EHR for real-time clinical decision support. (See the Clinical Decision Support chapter of the Living Textbook.)
Computable phenotype definitions are essential to the conduct of pragmatic clinical trials and comparative effectiveness research. These studies, which may involve multiple hospitals or healthcare systems, rely on standard phenotype definitions for EHR-based inclusion and exclusion of participants and consistent data analysis and reporting across data sources. Computable phenotype definitions have applications in interventional, observational, prospective, and retrospective studies.
How do computable phenotypes relate to the true presence of a condition?
Although computable phenotypes can be used to identify patients for whom the data are suggestive of a particular condition, the presence of a computable phenotype does not guarantee that the patients have the condition. As shown in the figure below, EHR-based computable phenotypes make use of the data constructs and coding systems that are available to providers when they record patient data in the EHR system. These EHR data may reflect a patient's state or disease status, but the data are generated from perception, interpretation, and recording by the clinicians who observe the patient. Thus, EHR data represent a limited view of a patient's condition and are by definition incomplete—and often biased.
EHR data are available only for patients who are motivated (often by a disease or illness) and able to visit a clinician. Other attributes related to the clinician and the healthcare organization influence the nature of the data in the EHR, including the experience of the clinician, the availability and use of diagnostic equipment and therapeutic procedures, interactions with clinical specialists, insurance coverage and limitations, and the coding and reimbursement practices of the healthcare organization (Hsia et al 1988). The quantitative impact of each of these features on the performance of clinical phenotypes is largely unknown. Measurement and estimation of these factors, and the development of strategies to mitigate their impact on data quality, are active areas of methodological research in health services research and informatics.
How are data elements and phenotypes different?
Every data element has a set of possible values, called a "value set." A value set might include a limited set of categorical values, a range of numeric values, or a more extensive list of codes from standardized coding systems, such as the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) or RxNorm. For example, the data element for "sex" includes a single variable with that name, along with a set of discrete values, and perhaps with a definition and associated descriptive metadata. To query the sex of a person, a single data element is assessed. Table 1 shows examples of data elements for sex, birth date, and race and their associated value sets.
| Data Element | Value Set |
| sex | male, female, unknown/not reported |
| birth date | a date value including the present date and no later than 150 years prior |
| race | American Indian or Alaska Native, Asian, black or African American, Native Hawaiian or other Pacific Islander, white, unknown/not reported |
The value set for a given data element may reference an entire coding system or a smaller enumerated list, as shown in Table 2.
| Data Element | Value Set |
| Final diagnosis | ICD-10-CM codes (all) |
| Final diagnosis of diabetes | E089, E099, E139, E0865 (from ICD-10-CM)
249.xx, 250.xx, 357.2, 362.01-06 , 366.41 (from ICD-9-CM) |
| Medications ordered | local medication list; clinical drugs coded in RxNorm |
| Diabetes-related medications ordered | acarbose, Precose, acetohexamide, Dymelor, etc. |
| Abbreviations: ICD-9-CM, International Classification of Diseases, Ninth Revision, Clinical Modification; ICD-10-CM, International Classification of Diseases, Tenth Revision, Clinical Modification. | |
Phenotype definitions are represented as logical query criteria using 1 or more data elements with a defined value set. For example, to infer that a patient has a clinical characteristic, such as type 2 diabetes mellitus, evidence can come from a single data element or many data elements. Table 3 shows possible data elements and their associated value sets for identifying the presence of type 2 diabetes mellitus.
| Data Element | Value Set |
| ICD-10-CM codes for type 2 diabetes mellitus | E11.xx |
| Diabetes-related medications | acarbose, glipizide, metformin, etc. |
| Hemoglobin A1c values suggestive of uncontrolled diabetes | ≥ 6.5% |
| Abbreviation: ICD-10-CM, International Classification of Diseases, Tenth Revision, Clinical Modification. | |
Any single data element in Table 3, all of the data elements collectively, or any combination of the data elements could be used to create a phenotype definition for type 2 diabetes mellitus. Such a definition could specify that any or all of the data elements must contain at least 1 appropriate code. The definition might also specify that the patient must be older than a certain age at the first diagnosis of diabetes, or that the patient must have received diabetes medication but have no history of type 1 diabetes.
What data sources are used?
The number of data fields that are truly standardized and routinely collected across EHR systems is small. Therefore, most phenotype definitions use a combination of International Classification of Diseases, Tenth Revision (ICD-10) codes, medication names, and/or laboratory values. ICD-10 Clinical Modification (ICD-10-CM) diagnosis codes (or ICD-9-CM diagnosis codes before October 1, 2015) can be found in technical billing, professional billing, and/or problem lists. EHRs may use or link ICD-10-CM diagnosis codes or Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT) codes for problem lists and other sections of EHRs. EHRs also contain a significant volume of unstructured narrative data. Use of natural language processing techniques in the biomedical domain is evolving and has allowed researchers to use computable phenotypes to leverage clinically rich narrative data within EHRs (Ludvigsson et al 2013). There are many opportunities to validate and improve these algorithms (Vanderbilt University 2017).
The Office of the National Coordinator for Health Information Technology (ONC) in the US Department of Health and Human Services maintains standards and implementation specifications for EHR systems to ensure that certified systems support "meaningful use" criteria (ONC 2012). Accordingly, data elements required by the ONC can be collected in all certified EHR systems in the United States in a manner that is consistent with ONC specifications.
Because EHR data may be available from different types of encounters, including inpatient, outpatient, and emergency department visits, phenotype definitions should take into consideration which sources are relevant to answering the question at hand. In some cases, multiple sources will be needed for complete data capture. For example, medication data can be obtained from reconciliation of various data sources, such as records from inpatient administration, provider ordering, or outpatient dispensing. It is also important to consider the applicability of a captured data element during certain encounters. For example, a lab value for a patient may be abnormal during an emergency department visit but not reflect the typical range of lab values for that patient.
What are the benefits of "standard" phenotypes or condition definitions and phenotype definition libraries?
Explicit documentation of computable phenotype definitions can support their use in multiple organizations and settings for consistent identification of patient populations for various purposes. Standardized or explicitly defined phenotype definitions can also streamline the development of registries and applications using healthcare data and can enable development of consistent inclusion criteria to support regional surveillance in the identification of infectious diseases and rare disease complications.
Differences across phenotype definitions can affect their application in healthcare organizations and subsequent interpretation of data. It is unlikely that a single phenotype definition—for example, in type 2 diabetes mellitus or heart failure—will be sufficient for all intended uses. Rather, the ideal phenotype definition depends on the purpose and analytical requirements.
Research networks and collaborations are increasingly seeking to share phenotype definitions for a given characteristic or condition and intended use. For example, Observational Health Data Sciences and Informatics (OHDSI) is "researching and developing strategies for establishing a standardized, evidence-based approach to constructing algorithms to define disease phenotypes that can be used in observational analytics" (OHDSI 2020). The Agency for Healthcare Research and Quality has developed Clinical Classification Software and related tools as part of the Healthcare Cost and Utilization Project for use with ICD-9-CM and ICD-10-CM and other classification systems. Finally, phenotype libraries such as the Phenotype KnowledgeBase (PheKB) have emerged to assist researchers in using standard phenotype definitions appropriate for a given characteristic or condition and intended use.
See the Finding Existing Phenotype Definitions section in this chapter of the Living Textbook for more information about standardized definitions from authoritative sources.
SECTIONS
Resources
OHSDI Phenotype Library
Landing page for cohort and phenotype definitions and discussions for the Observational Health Data Sciences and Informatics (OHDSI) community. (Under development.)
Phenotype Phebruary Daily Threads & What We Learned
Condition-specific phenotype definitions developed during a “28 phenotypes for 28 days” initiative held within the OHDSI forums during February 2022.
HDR UK Phenotype Library
A comprehensive, open-access resource providing information, tools, and phenotyping algorithms for electronic health records in the United Kingdom.
REFERENCES
Hripcsak G, Albers DJ. 2013. Next-generation phenotyping of electronic health records. J Am Med Inform Assoc. 20:117-121. doi:10.1136/amiajnl-2012-001145. PMID: 22955496.
Hsia DC, Krushat WM, Fagan AB, et al. 1988. Accuracy of diagnostic coding for Medicare patients under the prospective-payment system. N Engl J Med. 318:352-355. doi:10.1056/NEJM198802113180604. PMID: 3123929.
Ludvigsson JF, Pathak J, Murphy S, et al. 2013. Use of computerized algorithm to identify individuals in need of testing for celiac disease. J Am Med Inform Assoc. 20:e306-310. doi:10.1136/amiajnl-2013-001924. PMID: 23956016.
Observational Health Data Sciences and Informatics (OHDSI). 2020. Phenotype library. https://www.ohdsi.org/resources/libraries/phenotype-library/. Accessed June 23, 2022.
Office of the National Coordinator for Health Information Technology (ONC), Department of Health and Human Services. 2012. Health Information Technology: Standards, Implementation Specifications, and Certification Criteria for Electronic Health Record Technology, 2014 Edition; Revisions to the Permanent Certification Program for Health Information Technology. Final Rule. Fed Regist. 77(171):54163-54292. PMID: 22946139.
Vanderbilt University. Collaboration phenotypes. PheKB. https://phekb.org/phenotypes/collaboration. Accessed June 23, 2022.
ACKNOWLEDGMENTS
Key contributors to previous versions of this chapter included Michelle Smerek, Shelley Rusincovitch, Meredith Nahm Zozus, Paramita Saha Chaudhuri, Ed Hammond, Robert Califf, Greg Simon, Beverly Green, Michael Kahn, and Reesa Laws.
The Electronic Health Records Core Working Group (formerly the Phenotypes, Data Standards, and Data Quality Core Working Group) of the NIH Collaboratory influenced much of this content through monthly meetings. These additional contributors included Monique Anderson, Nick Anderson, Alan Bauck, Denise Cifelli, Lesley Curtis, John Dickerson, Chris Helker, Michael Kahn, Cindy Kluchar, Melissa Leventhal, Rosemary Madigan, Renee Pridgen, Jon Puro, Jennifer Robinson, Jerry Sheehan, and Kari Stephens. We are also grateful to the Duke Center for Predictive Medicine for development and clarification of the scientific validity and evaluation of phenotype definitions.
