August 13, 2018: JAMA Commentary Highlights the Value of Data Enclaves and Distributed Data Networks

In a JAMA Viewpoint published online last week, NIH Collaboratory investigator Dr. Richard Platt and colleague Dr. Tracy Lieu discuss the value of “data enclaves” to facilitate information sharing in support of research, quality improvement, and public health reporting.

Creating data enclaves allows health systems to share useful information from their clinical data without releasing the actual data. Data enclaves can be linked with each other in distributed data networks to create powerful resources for researchers and other analysts. The authors note that efforts to realize this vision must address concerns about protecting patients’ personal information, the costs and work required to make the data usable for analysis, and incentives for health systems to participate.

Dr. Platt is a cochair of the NIH Collaboratory’s Distributed Research Network, which uses a common data model that enables investigators to collaborate with each other in the use of electronic health data while safeguarding protected health information and proprietary data.

July 23, 2018: New Report Summarizes Patient-Reported Health Data and Metadata Standards in the ADAPTABLE Trial

A new report in the Living Textbook describes results of a literature review of data standards and metadata standards for variables of interest to the ADAPTABLE trial. Based on the review, the authors recommend standards for ADAPTABLE, also known as the Aspirin Study, which is the first major randomized comparative effectiveness trial to be conducted by the National Patient-Centered Clinical Research Network (PCORnet). The trial aims to identify the optimal dose of aspirin therapy for secondary prevention in atherosclerotic cardiovascular disease.

Because the ADAPTABLE trial relies on patients to report key information at baseline and throughout follow-up, it represents a unique opportunity to develop, pilot, and evaluate methods to validate and integrate patient-reported information with data obtained from electronic health records (EHRs). In 2016, the National Institutes of Health implemented a project with the goal of using the ADAPTABLE study to develop methods to (1) assess the quality of patient-reported data and (2) integrate the data with existing EHR data. It is hoped that this project will inform future efforts to synthesize potentially inconsistent data from patient-reported and EHR sources and identify opportunities to streamline data.

Download the report.

June 7, 2018: NIH Releases First Strategic Plan for Data Science

On June 4, the National Institutes of Health (NIH) released its first Strategic Plan for Data Science. The plan outlines steps the agency will take to modernize research data infrastructure and resources and to maximize the value of data generated by NIH-supported research.

Data science challenges for NIH have evolved and grown rapidly since the launch of the Big Data to Knowledge (BD2K) program in 2014. The most pressing challenges include the growing costs of data management, limited interconnectivity and interoperability among data resources, and a lack of generalizable tools to transform, analyze, and otherwise support the usability of data for researchers, institutions, industry, and the public.

The goals of the NIH Strategic Plan for Data Science are to:

  • support an efficient, effective data infrastructure by optimizing data storage, security, and interoperability;
  • modernize data resources by improving data repositories, supporting storage and sharing of individual data sets, and integrating clinical and observational data;
  • develop and disseminate both generalizable and specialized tools for data management, analytics, and visualization;
  • enhance workforce development for data science by expanding NIH’s internal data science workforce and supporting expansion of the national research workforce, and by engaging a broader community of experts and the general public in developing best practices; and
  • enact policies that promote stewardship and sustainability of data science resources.

As part of the implementation of the strategic plan, the NIH will hire a chief data strategist. For information about the position, see the job announcement.

March 21, 2018: Dr. Rob Califf to Speak on Data Science at March 23 Grand Rounds

Robert Califf, MD, former FDA Commissioner and current Vice Chancellor for Health Data Science at Duke University School of Medicine, will present at NIH Collaboratory Grand Rounds on Friday, March 23 at 1 pm ET. The webinar will be broadcast live and is open to the public. Following the presentation, Dr. Califf will answer questions from the Grand Rounds audience.

As Director of Duke Forge, Duke’s interdisciplinary center for actionable health data science, Dr. Califf is currently working on initiatives designed to harness biostatics, machine learning, and sophisticated informatics approaches to improve health and healthcare. Dr. Califf is also an adjunct professor of medicine at Stanford University and is employed by Verily Life Sciences as a scientific advisor. Verily, part of the Alphabet (Google) family of companies, is aimed at transforming the growth of health-related data into practical applications.

Dr. Califf has been a pioneer in the fields of clinical, translational, and outcomes research, and the NIH Collaboratory looks forward to hearing his thoughts on the pragmatic applications of data that will advance health and health care strategies and practice.

Topic: Data Science in the Era of Data Ubiquity

Date: Friday, March 23, 2018, 1:00-2:00 p.m. ET

Meeting Info: To check whether you have the appropriate players installed for UCF (Universal Communications Format) rich media files, go to https://dukemed.webex.com/dukemed/systemdiagnosis.php.

To join the online meeting:
Go to https://dukemed.webex.com/dukemed/j.php?MTID=m1a4a0665a615ae0382440edecedbdd33

October 10, 2017: NIH Collaboratory Core Working Group Interviews: Reflections from the Phenotypes, Data Standards, and Data Quality Core

At the NIH Collaboratory Steering Committee meeting in May 2017, we asked Drs. Rachel Richesson and W. Ed Hammond, Co-chairs of the Phenotypes, Data Standards, and Data Quality Core, to reflect on the first 5 years of their Core’s work and the challenges ahead.

Both were pleased with how the Core was able to provide guidelines for assessing data quality and the reporting of pragmatic trials, especially around issues with phenotypes and the use of electronic health record data. Future work in this area needs to advance the development of regulations and standards for the collection of clinical data to support learning healthcare systems.

“We’ve built a community in our Core that represents a diverse group of scientists and clinicians showing the many ways to look at data challenges.”
– Dr. Rachel Richesson

In Fall 2017, the Phenotypes, Data Standards, and Data Quality Core merged with the Electronic Health Records Core. The combined Core will continue to work on data standards and quality, and approaches to define clinical phenotypes and endpoints, extract information, and discover errors in data from healthcare systems.

Download the interview (PDF).

A PDF of the May 2017 interview with leaders of the Phenotypes Core Working Group.

New Living Textbook Chapter on Acquiring and Using Electronic Health Record Data for Research

Topic ChaptersMeredith Nahm Zozus and colleagues from the NIH Collaboratory’s Phenotypes, Data Standards, and Data Quality Core have published a new Living Textbook chapter about key considerations for secondary use of electronic health record (EHR) data for clinical research.

In contrast to traditional randomized controlled clinical trials where data are prospectively collected, many pragmatic clinical trials use data that were primarily collected for clinical purposes and are secondarily used for research. The chapter describes the steps a prospective researcher will take to acquire and use EHR data:

  • Gain permission to use the data. When a prospective researcher wishes to use data, a data use agreement (DUA) is usually required that describes the purpose of the research and the proposed use of the data. This section also describes use of de-identified data and limited data sets.
  • Understand fundamental differences in context. Data collected in routine care settings reflect standard procedures at an individual’s healthcare facility, and are not collected in a standard, structured manner.
  • Assess the availability of health record data. Few assumptions can be made about what is available from an organization’s healthcare records; up-front, detailed discussions about data element collection over time at each facility is required.
  • Understand the available data. A secondary data user must understand both the data meaning and the data quality; both can vary greatly across organizations and affect a study’s ability to support research conclusions.
  • Identify populations and outcomes of interest. Because healthcare facilities are obligated to provide only the minimum necessary data to answer a research question, investigators must identify the needed patients and data elements with specificity and sensitivity to answer the research question given the available data.
  • Consider record linkage. Studies using data from multiple records and sources will require matching data to ensure they refer to the correct patient.
  • Manage the data. The investigator is responsible for receiving, managing, and processing data and must demonstrate that the data are reproducible and support research conclusions.
  • Archive and share the data after the study. Data may be archived and shared to ensure reproducibility, enable auditing for quality assurance and regulatory compliance, or to answer other questions about the research.

ClinicalTrials.gov Analysis Dataset Available from CTTI

Tools for ResearchAs part of a project that examined the degree to which sponsors of clinical research are complying with federal requirements for the reporting of clinical trial results, the Clinical Trials Transformation Initiative (CTTI) and the authors of the study are making the primary dataset used in the analysis available to the public. The full analysis dataset, study variables, and data definitions are available as Excel worksheets from the CTTI website and on the Living Textbook’s Tools for Research page.


Collaboratory Phenotypes, Data Standards, and Data Quality Core Releases Data Quality Assessment White Paper


The NIH Collaboratory’s Phenotypes, Data Standards, and Data Quality Core has released a new white paper on data quality assessment in the setting of pragmatic research. The white paper, titled Assessing Data Quality for Healthcare Systems Data Used in Clinical Research (V1.0) provides guidance, based on the best available evidence and practice, for assessing data quality in pragmatic clinical trials (PCTs) conducted through the Collaboratory. Topics covered include an overview of data quality issues in clinical research settings, data quality assessment dimensions (completeness, accuracy, and consistency), and a series of recommendations for assessing data quality. Also included as appendices are a set of data quality definitions and review criteria, as well as a data quality assessment plan inventory.

The full text of the document can be accessed through the “Tools for Research” tab on the Living Textbook or can be downloaded directly here (PDF).