Data Sharing and Embedded Research

Section 3 Data Sharing Solutions for Embedded Research

Contributors

The last few years have seen real progress in increasing openness (Ebrahim et al. 2014), and several methods, with varying degrees of restriction, transparency, and cost have been deployed (see table below), ranging from public release of data sets to private data enclaves with distributed research networks. These methods afford different levels of protection for health systems but also require different levels of support for implementation. We discuss these solutions below using the NIH Collaboratory Trials as examples.

Technical Structures for Data Sharing From Least Restrictive (and Least Expensive) to Most Restrictive (and Most Expensive)

Structure	Description	Additional elements	Resource needs	Example
Public archive	Analyzable data can obtained by any user for any use No restriction on the kinds of research questions new users can address	May impose restrictions like prohibitions against re-identification or access to small cell counts May de-identify certain elements, such as study site or demographics, or present sensitive data as an aggregate summary variable	Initial development and annotation Maintenance and access costs	Agency for Healthcare Research and Quality (AHRQ) Healthcare cost and utilization project (HCUP)
Private archive	Analyzable data can be obtained by authorized users Honest broker or the original owner of the data decides which uses to authorize Requires binding agreement by recipient regarding protection and use of transferred data	As noted for public archive	As noted for public archive Evaluation of requests Execution of data sharing, data use, data transfer, and other agreements, including agreements covering data with full identifiers Monitoring of compliance with agreements, and response to breach of agreements	Yale University Open Data Access (YODA) Project Centers for Medicaid and Medicare (CMS) Limited Data Sets National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Central Repository
Public enclave	Any user may query the data, but not take possession of it. Only aggregate results may be removed from the enclave No restriction on the kinds of questions users can address	May impose restrictions like prohibitions against re-identification, passing the data to other users, or access to small cell counts May de-identify certain elements, such as study site or demographics	Initial development and annotation Ongoing curation and governance Creation and maintenance of informatics support for analyses, including software licenses and computational capabilities, and file storage Personnel needed to ensure data quality, etc.	Centers for Medicaid and Medicare (CMS) Virtual Research Data Center (VRDC)
Private enclave	Similar to public enclave with regard to provisions for analyzing data without taking possession of it Honest broker or the original owner of the data decides which uses to authorize	Moderated by an honest broker or by representatives of the study and/ or site (either queries or results)	As noted for public enclave Additional resources to evaluate requests and supervise the conduct of approved studies	Food and Drug Administration (FDA) Sentinel Distributed Data Set

Public and Private Data Archives

With a data archive, data are annotated and de-identified as deemed necessary, and stored for later analyses by interested users. A publicaly available archive is the least restrictive and least expensive option for sharing data, and a number of Collaboratory trials have used this method (see table below). In most cases, some modification to or restriction of the full analytic dataset was necessary to protect the privacy of health systems or providers. For example, the Suicide Prevention Outreach Trial (SPOT) was developed to compare suicide attempt rates in patients who receive one of two suicide prevention strategies versus usual care. The investigators did not plan to include study site (health system) in the publicly available dataset given concerns by participating health systems that such data could be used for inappropriate comparisons of suicide attempt rates across health systems. A naïve analysis of these data could compare rates of suicide attempt across health systems without considering well-established variation by geographic region and race/ethnicity. In this context, a health system making extra efforts to engage higher-risk populations could paradoxically be shown to have high suicide rates. To facilitate the examination of variation in intervention effects across health systems, datasets including health system identifiers are available on request, following a supervised data archive, subject to specific agreements regarding use and re-disclosure. Because SPOT was randomized at the patient level, failure to account for study site in the released data set may lead to a mis-estimation of variance, but the data will still be of scientific and public health value.

As another example, the Collaboratory’s Pain Program for Active Coping and Training (PPACT) trial was developed to coordinate and integrate services for helping patients adopt self-management skills for chronic pain, limit use of opioid medications, and identify factors amenable to treatment in the primary care setting (Debar et al. 2012). The study was conducted at Kaiser Permanente in the Northwest, Georgia, and Hawaii regions thereby representing diversity of patients and healthcare systems. Because the trial was conducted in three distinct regions with different racial and ethnic distributions, release of demographic information would readily identify the regions and potentially the participating PCPs. Because participating health plans were concerned that naïve analyses of region-specific data could be used to conduct inappropriate or invalid comparisons of pain treatment and outcomes across various health systems, their data-sharing plan attempted to assure regional anonymity. Similarly, there were sensitivities about examination of individual clinician opioid prescribing patterns. Thus, such data was included only in an aggregated format. As such, the PPACT investigators created a public-release data archive that can be shared and that enables individuals to replicate, or at least closely replicate, the primary analysis. The public release dataset was expected to include anonymous patient and cluster identifiers, but no information on region or clinic facility.

The TiME and ICD-Pieces trials both used the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Central Repository—a private archive—to share data. Use of this archive transfers the administrative, financial, and oversight responsibilities to NIDDK, substantially decreasing the burden of the investigators. The availability of more repositories for data sharing will help future investigators more effectively and efficiently share data.

Among national trauma care systems, there are incentives supporting data sharing in multisite pragmatic trials. For example, TSOS trial has a private archive and shares data with researchers whose aim is to impact future policy or effect clinical care in trauma centers nationwide. Although the data are housed in an archive, the team will consider an enclave approach—analyzing the data themselves and returning results—depending on the research question and potential ethical obligations not to reveal or receive sensitive data. Perhaps most importantly, the TSOS team was incentivized to share data and publish with teams at other U.S. trauma sites as part of the larger study goal of disseminating knowledge that will further American College of Surgeons practice guidelines for PTSD and comorbidity screening and intervention (Zatzick et al. 2016).

Public and Private Data Enclaves

A data enclave allows investigators to perform analyses without taking possession of the data. A public enclave allows any user to conduct research on any topic; for a private enclave an honest broker or the original owners of the data will determine appropriate use. Private enclaves may establish their own rules regarding users and uses of their data. The NIH Collaboratory’s ABATE trial used this type of private enclave for all primary and secondary analyses of trial data. All analyses were conducted behind Hospital Corporation of America's firewall using a supervised data enclave model to prevent misuse of data for comparative purposes. This model requires a data use agreement (DUA), and all data are de-identified. Other investigators, with approval, could reproduce the results if ever needed. This solution allows investigators to perform analyses without actually downloading the data themselves, but it is costly, and is in effect only for a finite period of time.

A distributed research network approach is a variation of a private data enclave and has been an important factor in obtaining the voluntary participation of health care organizations in public health activities and research for the public good. It allows the organizations to maintain physical and operational control over both their patients’ and their own confidential data. They thus can opt in to participation in a wide array of societally beneficial programs without concern that they are putting their data at risk of other uses. The Patient-Centered Outcomes Research Network (PCORnet) and FDA’s Sentinel System are examples of a network of users where each participating site holds their data behind their firewall, but can make them available (opt-in) through a distributed research network for approved queries.

Limitations of These Solutions

All of these data sharing mechanisms have drawbacks. Greater control inevitably involves greater expense because of the added leadership, legal, statistical, and information technology resources required. Further, when sensitive health system characteristics are important potential confounders, the least restrictive and least expensive methods are also often the least useful, because the data that can be shared with no restriction will lack information needed to replicate the primary analysis, or to address some additional questions. The most restrictive—an enclave controlled by the original data owner—does not guarantee access. All of these solutions incur meaningful costs for annotation of the data, and all but public archives require ongoing support for oversight. Enclaves also incur substantial ongoing costs for oversight and for maintaining a computing environment that can support analyses.

In this JAMA Viewpoint article, NIH Collaboratory investigator Dr. Richard Platt and colleague Dr. Tracy Lieu discuss the value of data enclaves to facilitate information sharing in support of research, quality improvement, and public health reporting (Platt and Lieu 2018).

“Data enclaves address 2 major barriers to data sharing. First, they allow health systems to protect patients’ interests and their own by maintaining physical and operational control, permitting the systems to opt in or out of proposed analyses. Second, they obviate the need to build new secure systems. (Platt and Lieu. 2018)”

Collaboratory Data Sharing Plans (Assumes HIPAA-Compliant Patient De-identification for All Patients and a Data Use Agreement Where Appropriate)

Study name	Risks to providers or health systems	Data sharing structure	Steps to mitigate risks to providers or health systems
PPACT Pain Program for Active Coping and Training	Data on opioid prescribing patterns could be misused for inappropriate comparisons of providers or facilities.	Public archive of a modified dataset	Public-use dataset does not include facility or health system identifiers, characteristics or prescribing/referral practices of individual providers, or patient-level data on race or ethnicity.
STOP CRC Strategies and Opportunities to Stop Colon Cancer in Priority Populations	Data on screening rates could be misused for inappropriate or biased comparisons of performance across clinics or inaccurate comparisons with public quality measures.	Private archive managed by study team	De-identified patient-level data are available, with permissions and data use agreements in place. Data use agreements are limited to specific research uses and require destruction after authorized analyses are completed.
SPOT Suicide Prevention Outreach Trial	Data on suicide attempt rates could be used for biased or inappropriate comparisons of suicide attempts or suicide mortality across health systems.	Public archive of a modified dataset	Public-use dataset does not include indicator for health system.
TiME Time to Reduce Mortality in End-Stage Renal Disease	Data regarding mortality could be misused for inappropriate or biased comparisons of facilities or healthcare systems. Detailed data regarding patterns of care could reveal proprietary business information.	Private archive managed by NIDDK	De-identified patient-level data were aggregated across provider organizations and stored at the NIDDK Central Repository. Facility identifiers, dialysis provider organization identifiers, and data elements that were unique to one of the dialysis providers were removed. Data are made available through formal request and a data use agreement between the requestor and the NIDDK.
PROVEN Pragmatic Trial of Video Education in Nursing Homes	Data regarding mortality could be misused for inappropriate or biased comparisons of participating facilities or systems. Data regarding admissions and discharges could reveal proprietary business information.	Public archive of aggregate-level dataset	Public-use dataset includes facility-level aggregate data, with restrictions to prevent re-identification of participating facilities.
LIRE Lumbar Image Reporting with Epidemiology	Data regarding treatment patterns and resource use could be used for inappropriate or biased comparisons across health systems and could reveal proprietary health system business information.	Private archive managed by study team	Patient-level datasets were de-identified by health systems, clinics, providers, and patients. Investigators authorize release to specific users for specific purposes.
ABATE Active Bathing to Eliminate Infection	Data regarding infection rates could be used for inappropriate comparisons of facilities or with public reports. Detailed information regarding facilities and utilization patterns could reveal proprietary business information.	Private enclave managed by study team	Potential users may propose specific queries. Only query results (not individual data) will be shared.
ICD-Pieces Improving Chronic Disease management with Pieces	Data regarding patterns of care could be used for biased or inappropriate comparisons across facilities or health systems. Given different specifications, comparison to publicly reported quality measures would be misleading.	Private archive managed by NIDDK	Patient-level data were de-identified and stored in aggregate database. Identifiers for healthcare system, primary practice and patients were removed. Use of aggregate dataset is governed by authorized agreements with NIDDK.
TSOS Trauma Survivors Outcomes and Support	Data regarding baseline patient characteristics and study outcomes could be used for biased or inappropriate comparisons of care in participating facilities.	Private archive managed by study team	De-identified patient level data are provided, with priority given to research that affects trauma care systems nationwide and Collaboratory investigators.

Previous Section Next Section

SECTIONS

CHAPTER SECTIONS

sections

Resources

Introduction to the NIMH Data Archive
Presentation from the BackInAction trial with an overview of the basics and the steps involved in using the NIMH Data Archive.

REFERENCES

Debar LL, Kindler L, Keefe FJ, et al. 2012. A primary care-based interdisciplinary team approach to the treatment of chronic pain utilizing a pragmatic clinical trials framework. Transl Behav Med. 2:523–530. doi:10.1007/s13142-012-0163-2. PMID:23440672.

Ebrahim S, Sohani ZN, Montoya L, et al. 2014. Reanalyses of randomized clinical trial data. JAMA. 312:1024–1032. doi:10.1001/jama.2014.9646. PMID: 25203082.

Platt R, Lieu T. Data Enclaves for Sharing Information Derived From Clinical and Administrative Data. 2018;320:753. doi: 10.1001/jama.2018.9342.PMID:30083726

Zatzick DF, Russo J, Darnell D, et al. 2016. An effectiveness-implementation hybrid trial study protocol targeting posttraumatic stress disorder and comorbidity. Implement Sci. 11:58. doi:10.1186/s13012-016-0424-4. PMID:27130272.

Version History

September 25, 2025: Added resources box with NIMH Data Archive presentation (changes made by G. Uhlenbrauck).

March 9: Updated to make descriptions of the trials past tense (changes made by K. Staman).

December 5, 2018: Updated and added reference as part of annual review (change made by K. Staman).

Published August 25, 2017

COVID-19 Resources

COVID-19 Resources

Rethinking Clinical Trials

A Living Textbook of Pragmatic Clinical Trials

Data Sharing Solutions for Embedded Research

Data Sharing and Embedded Research

Section 3

Data Sharing Solutions for Embedded Research

Public and Private Data Archives

Public and Private Data Enclaves

Limitations of These Solutions

SECTIONS

sections

Resources

REFERENCES

current section :

Data Sharing Solutions for Embedded Research

Citation: