Electronic Health Records–Based Phenotyping
Section 6
Using Phenotypes in PCTs—How Do I Get Started?
Before beginning development of any phenotype definition, researchers should search for existing phenotype definitions and consider their performance in validation testing. They should then assess the candidate phenotype definitions for feasibility in particular settings (for example, determining whether available domains match the authoritative source phenotype definition). If a suitable phenotype definition cannot be found from authoritative sources, then a definition must be developed and validated. In any case, once a candidate phenotype definition is identified, it must be validated against a gold standard in clinical populations, as shown in the figure below.
Figure. Phenotype Evaluation Process

If a new phenotype definition is needed, the researchers must first operationalize a disease concept against electronic health record (EHR) data. The researchers must explicitly define how a concept should be measured, observed, or manipulated within a particular study and available data sources. A theoretical or conceptual variable of interest (such as a disease) must be translated into a set of specific diagnoses or procedures paired with implementation specifications that define the variable's meaning in a specific study. In the context of healthcare data, this means explicitly defining diagnoses, treatments, and clinical and patient characteristics that are indicative or suggestive of the condition. The researchers must specify the clinical condition they are looking for and how the condition would be represented in various EHRs.
For example, to identify obesity, the researchers would first identify diagnostic and procedure codes for the condition and investigate whether the codes are reliable and are applied consistently. If the researchers cannot reasonably assume that all patients with obesity would be coded with a given diagnosis or procedure code, they must use other data sources.
The next step is to review the available data sources (such as EHR data, claims data, registry data, and patient-reported outcomes data). If a phenotype definition is to be applied in multiple organizations, the researchers must consider the data sources that are available in other organizations. Possible data sources for obesity might include patient height and weight, the ordering or dispensing of medications associated with weight management, or patient-reported data on weight or a previous diagnosis of obesity. It is also important to consider other factors that may affect these measurements (such as the effect of pregnancy on weight, or the effect of amputation on height). Within each data type, the researchers should identify which data are available to them (for example, some EHR data include medication orders but not administration data, or billing diagnoses rather than problem lists). Knowing the types of data available can support an early feasibility assessment of existing phenotype definitions.
SECTIONS
Resources
A User’s Guide to Computable Phenotypes
Master’s thesis providing a practical framework to help physicians, clinical researchers, and informaticians evaluate published phenotype algorithms for reuse for various purposes. The framework is divided into 3 phases aligned with expected user roles: overall assessment, clinical validation, and technical review.
ACKNOWLEDGMENTS
Key contributors to previous versions of this chapter included Michelle Smerek, Shelley Rusincovitch, Meredith Nahm Zozus, Paramita Saha Chaudhuri, Ed Hammond, Robert Califf, Greg Simon, Beverly Green, Michael Kahn, and Reesa Laws.
The Electronic Health Records Core Working Group (formerly the Phenotypes, Data Standards, and Data Quality Core Working Group) of the NIH Collaboratory influenced much of this content through monthly meetings. These additional contributors included Monique Anderson, Nick Anderson, Alan Bauck, Denise Cifelli, Lesley Curtis, John Dickerson, Chris Helker, Michael Kahn, Cindy Kluchar, Melissa Leventhal, Rosemary Madigan, Renee Pridgen, Jon Puro, Jennifer Robinson, Jerry Sheehan, and Kari Stephens. We are also grateful to the Duke Center for Predictive Medicine for development and clarification of the scientific validity and evaluation of phenotype definitions.