June 7, 2018: NIH Releases First Strategic Plan for Data Science

On June 4, the National Institutes of Health (NIH) released its first Strategic Plan for Data Science. The plan outlines steps the agency will take to modernize research data infrastructure and resources and to maximize the value of data generated by NIH-supported research.

Data science challenges for NIH have evolved and grown rapidly since the launch of the Big Data to Knowledge (BD2K) program in 2014. The most pressing challenges include the growing costs of data management, limited interconnectivity and interoperability among data resources, and a lack of generalizable tools to transform, analyze, and otherwise support the usability of data for researchers, institutions, industry, and the public.

The goals of the NIH Strategic Plan for Data Science are to:

  • support an efficient, effective data infrastructure by optimizing data storage, security, and interoperability;
  • modernize data resources by improving data repositories, supporting storage and sharing of individual data sets, and integrating clinical and observational data;
  • develop and disseminate both generalizable and specialized tools for data management, analytics, and visualization;
  • enhance workforce development for data science by expanding NIH’s internal data science workforce and supporting expansion of the national research workforce, and by engaging a broader community of experts and the general public in developing best practices; and
  • enact policies that promote stewardship and sustainability of data science resources.

As part of the implementation of the strategic plan, the NIH will hire a chief data strategist. For information about the position, see the job announcement.

New Living Textbook Chapter on Acquiring and Using Electronic Health Record Data for Research

Topic ChaptersMeredith Nahm Zozus and colleagues from the NIH Collaboratory’s Phenotypes, Data Standards, and Data Quality Core have published a new Living Textbook chapter about key considerations for secondary use of electronic health record (EHR) data for clinical research.

In contrast to traditional randomized controlled clinical trials where data are prospectively collected, many pragmatic clinical trials use data that were primarily collected for clinical purposes and are secondarily used for research. The chapter describes the steps a prospective researcher will take to acquire and use EHR data:

  • Gain permission to use the data. When a prospective researcher wishes to use data, a data use agreement (DUA) is usually required that describes the purpose of the research and the proposed use of the data. This section also describes use of de-identified data and limited data sets.
  • Understand fundamental differences in context. Data collected in routine care settings reflect standard procedures at an individual’s healthcare facility, and are not collected in a standard, structured manner.
  • Assess the availability of health record data. Few assumptions can be made about what is available from an organization’s healthcare records; up-front, detailed discussions about data element collection over time at each facility is required.
  • Understand the available data. A secondary data user must understand both the data meaning and the data quality; both can vary greatly across organizations and affect a study’s ability to support research conclusions.
  • Identify populations and outcomes of interest. Because healthcare facilities are obligated to provide only the minimum necessary data to answer a research question, investigators must identify the needed patients and data elements with specificity and sensitivity to answer the research question given the available data.
  • Consider record linkage. Studies using data from multiple records and sources will require matching data to ensure they refer to the correct patient.
  • Manage the data. The investigator is responsible for receiving, managing, and processing data and must demonstrate that the data are reproducible and support research conclusions.
  • Archive and share the data after the study. Data may be archived and shared to ensure reproducibility, enable auditing for quality assurance and regulatory compliance, or to answer other questions about the research.

FDA Releases Action Plan to Encourage Greater Patient Diversification in Trials


In August 2014, the Food and Drug Administration (FDA) released an action plan (link opens as a PDF) aimed at encouraging more diverse patient participation in drug and medical device clinical trials. The Action Plan to Enhance the Collection and Availability of Demographic Subgroup Data includes 27 responsive and pragmatic actions, divided into 3 overarching priorities:

  • Data quality: improving the completeness and quality of demographic subgroup data collection, reporting, and analysis
  • Participation: identifying barriers to subgroup enrollment in clinical trials and employing strategies to encourage greater participation
  • Transparency: making demographic subgroup data more available and transparent

The plan follows an August 2013 report to Congress on these concerns and reflects the agency’s commitment to encouraging the inclusion of a diverse patient population (with reference to sex, age, race, and ethnicity) in biomedical research that supports applications for FDA-regulated medical products. Increasing representation is a multifaceted challenge that requires a multifaceted approach and collaboration of federal partners, industry, healthcare providers, patients and patient advocacy groups, academicians, and community groups.

message from the Commissioner of the FDA contains background and details.


Collaboratory Phenotypes, Data Standards, and Data Quality Core Releases Data Quality Assessment White Paper


The NIH Collaboratory’s Phenotypes, Data Standards, and Data Quality Core has released a new white paper on data quality assessment in the setting of pragmatic research. The white paper, titled Assessing Data Quality for Healthcare Systems Data Used in Clinical Research (V1.0) provides guidance, based on the best available evidence and practice, for assessing data quality in pragmatic clinical trials (PCTs) conducted through the Collaboratory. Topics covered include an overview of data quality issues in clinical research settings, data quality assessment dimensions (completeness, accuracy, and consistency), and a series of recommendations for assessing data quality. Also included as appendices are a set of data quality definitions and review criteria, as well as a data quality assessment plan inventory.

The full text of the document can be accessed through the “Tools for Research” tab on the Living Textbook or can be downloaded directly here (PDF).