Data Sharing Solutions for Embedded Research

The last few years have seen real progress in increasing openness (Ebrahim et al. 2014), and several methods, with varying degrees of restriction, transparency, and cost have been deployed (see table below), ranging from public release of data sets to private data enclaves with distributed research networks. These methods afford different levels of protection for health systems but also require different levels of support for implementation. We discuss these solutions below using the NIH Collaboratory Demonstration projects as examples.


Technical Structures for Data Sharing From Least Restrictive (and Least Expensive) to Most Restrictive (and Most Expensive)

Structure Description Additional elements Resource needs Example
Public archive
  • Analyzable data can obtained by any user for any use
  • No restriction on the kinds of research questions new users can address
  • May impose restrictions like prohibitions against re-identification or access to small cell counts
  • May de-identify certain elements, such as study site or demographics, or present sensitive data as an aggregate summary variable
  • Initial development and annotation
  • Maintenance and access costs


Agency for Healthcare Research and Quality (AHRQ) Healthcare cost and utilization project (HCUP)
Private archive


  • Analyzable data can be obtained by authorized users
  • Honest broker or the original owner of the data decides which uses to authorize
  • Requires binding agreement by recipient regarding protection and use of transferred data
  • As noted for public archive


  • As noted for public archive
  • Evaluation of requests
  • Execution of data sharing, data use, data transfer, and other agreements, including agreements covering data with full identifiers
  • Monitoring of compliance with agreements, and response to breach of agreements
Yale University Open Data Access (YODA) Project

Centers for Medicaid and Medicare (CMS) Limited Data Sets

National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Central Repository

Public enclave
  • Any user may query the data, but not take possession of it. Only aggregate results may be removed from the enclave
  • No restriction on the kinds of questions users can address


  • May impose restrictions like prohibitions against re-identification, passing the data to other users, or access to small cell counts
  • May de-identify certain elements, such as study site or demographics


  • Initial development and annotation
  • Ongoing curation and governance
  • Creation and maintenance of informatics support for analyses, including software licenses and computational capabilities, and file storage
  • Personnel needed to ensure data quality, etc.
Centers for Medicaid and Medicare (CMS) Virtual Research Data Center (VRDC)
Private enclave
  • Similar to public enclave with regard to provisions for analyzing data without taking possession of it
  • Honest broker or the original owner of the data decides which uses to authorize
  • Moderated by an honest broker or by representatives of the study and/ or site (either queries or results)


  • As noted for public enclave
  • Additional resources to evaluate requests and supervise the conduct of approved studies
Food and Drug Administration (FDA) Sentinel Distributed Data Set



Public and Private Data Archives

With a data archive, data are annotated, de-identified as deemed necessary, and stored for later analyses by interested users. A publically available archive is the least restrictive and least expensive option for sharing data, and a number of Collaboratory trials are using this method (see table below). In most cases, some modification to or restriction of the full analytic dataset is necessary to protect the privacy of health systems or providers. For example, the Suicide Prevention Outreach Trial (SPOT) was developed to compare suicide attempt rates in patients who receive one of two suicide prevention strategies versus usual care. The investigators do not plan to include study site (health system) in the publicly available dataset given concerns by participating health systems that such data could be used for inappropriate comparisons of suicide attempt rates across health systems. A naïve analysis of this data could compare rates of suicide attempt across health systems without considering well-established variation by geographic region and race/ethnicity. In this context, a health system making extra efforts to engage higher-risk populations could paradoxically be shown to have high suicide rates. To facilitate the examination of variation in intervention effects across health systems, and datasets including health system identifiers will be available on request, following a supervised data archive, subject to specific agreements regarding use and re-disclosure. Because SPOT is randomized at the patient level, failure to account for study site in the released data set may lead to a mis-estimation of variance, but the data will still be of scientific and public health value.

As another example, the Collaboratory’s Pain Program for Active Coping and Training (PPACT) trial was developed to coordinate and integrate services for helping patients adopt self-management skills for chronic pain, limit use of opioid medications, and identify factors amenable to treatment in the primary care setting (Debar et al. 2012). The study is being conducted at Kaiser Permanente in the Northwest, Georgia, and Hawaii regions thereby representing diversity of patients and healthcare systems. Because the trial is conducted in three distinct regions with different racial and ethnic distributions, release of demographic information would readily identify the regions and potentially the participating PCPs. Because participating health plans are concerned that naïve analyses of region-specific data could be used to conduct inappropriate or invalid comparisons of pain treatment and outcomes across various health systems, their data-sharing plan will attempt to assure regional anonymity. Similarly, there are sensitivities about examination of individual clinician opioid prescribing patterns. Thus, such data will only be included in an aggregated format. As such, the PPACT investigators are creating a public-release data archive that can be shared and that will enable individuals to replicate, or at least closely replicate, the primary analysis. The public release dataset is expected to include anonymous patient and cluster identifiers, but no information on region or clinic facility.

The TiME and ICD-Pieces trial both use the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) Central Repository—a private archive—to share data. Use of this archive transfers the administrative, financial, and oversight responsibilities to NIDDK, substantially decreasing the burden of the investigators. The availability of more repositories for data sharing will help future investigators more effectively and efficiently share data.

Among national trauma care systems, there are incentives supporting data sharing in multisite pragmatic trials. For example, TSOS trial has a private archive and will share data with researchers whose aim is to impact future policy or effect clinical care in trauma centers nationwide. Although the data will be housed in an archive, the team will consider an enclave approach—analyzing the data themselves and returning results—depending on the research question and potential ethical obligations not to reveal or receive sensitive data. Perhaps most importantly, the TSOS team is incentivized to share data and publish with teams at other US trauma sites as part of the larger study goal of disseminating knowledge that will further American College of Surgeons practice guidelines for PTSD and comorbidity screening and intervention (Zatzick et al. 2016).

Public and Private Data Enclaves

A data enclave allows investigators to perform analyses without taking possession of the data. A public enclave allows any user to conduct research on any topic; for a private enclave an honest broker or the original owners of the data will determine appropriate use. Private enclaves may establish their own rules regarding users and uses of their data. The NIH Collaboratory’s ABATE trial uses this type of private enclave for all primary and secondary analyses of trial data. All analyses are conducted behind Hospital Corporation of America's firewall using a supervised data enclave model to prevent misuse of data for comparative purposes. This model requires a data use agreement (DUA), and all data will be de-identified. Other investigators, with approval, could reproduce the results if ever needed. This solution would allow investigators to perform analyses without actually downloading the data themselves, but it is costly, and will only be in effect for a finite period of time.

A distributed research network approach is a variation of a private data enclave and has been an important factor in obtaining the voluntary participation of health care organizations in public health activities and research for the public good. It allows the organizations to maintain physical and operational control over both their patients’ and their own confidential data. They thus can opt in to participation in a wide array of societally beneficial programs without concern that they are putting their data at risk of other uses. The Patient-Centered Outcomes Research Network (PCORnet) and FDA’s Sentinel System are examples of a network of users where each participating site holds their data behind their firewall, but can make them available (opt-in) through a distributed research network for approved queries.

Limitations of These Solutions

All of these data sharing mechanisms have drawbacks. Greater control inevitably involves greater expense because of the added leadership, legal, statistical, and information technology resources required. Further, when sensitive health system characteristics are important potential confounders, the least restrictive and least expensive methods are also often the least useful, because the data that can be shared with no restriction will lack information needed to replicate the primary analysis, or to address some additional questions. The most restrictive—an enclave controlled by the original data owner—does not guarantee access. All of these solutions incur meaningful costs for annotation of the data, and all but public archives require ongoing support for oversight. Enclaves also incur substantial ongoing costs for oversight and for maintaining a computing environment that can support analyses.

In this JAMA Viewpoint article, NIH Collaboratory investigator Dr. Richard Platt and colleague Dr. Tracy Lieu discuss the value of data enclaves to facilitate information sharing in support of research, quality improvement, and public health reporting (Platt and Lieu 2018).

“Data enclaves address 2 major barriers to data sharing. First, they allow health systems to protect patients’ interests and their own by maintaining physical and operational control, permitting the systems to opt in or out of proposed analyses. Second, they obviate the need to build new secure systems. (Platt and Lieu. 2018)”


Collaboratory Data Sharing Plans (Assumes HIPAA-Compliant Patient De-identification for All Patients and a Data Use Agreement Where Appropriate)

Study name Risks to providers or health systems Data sharing structure Steps to mitigate risks to providers or health systems
PPACT Pain Program for Active Coping and Training Data on opioid prescribing patterns could be misused for inappropriate comparisons of providers or facilities. Public archive of a modified dataset Public-use dataset will not include facility or health system identifiers, characteristics or prescribing/referral practices of individual providers, or patient-level data on race or ethnicity.
STOP CRC Strategies and Opportunities to Stop Colon Cancer in Priority Populations Data on screening rates could be misused for inappropriate or biased comparisons of performance across clinics or inaccurate comparisons with public quality measures. Private archive managed by study team De-identified patient-level data will be available, with permissions and data use agreements in place. Data use agreements will limit to specific research uses and require destruction after authorized analyses are completed.


SPOT Suicide Prevention Outreach Trial Data on suicide attempt rates could be used for biased or inappropriate comparisons of suicide attempts or suicide mortality across health systems. Public archive of a modified dataset Public-use dataset will not include indicator for health system.


TiME Time to Reduce Mortality in End-Stage Renal Disease Data regarding mortality could be misused for inappropriate or biased comparisons of facilities or healthcare systems. Detailed data regarding patterns of care could reveal proprietary business information. Private archive managed by NIDDK De-identified patient-level data that are aggregated across provider organizations will be stored at the NIDDK Central Repository. Facility identifiers, dialysis provider organization identifiers, and data elements that are unique to one of the dialysis providers will be removed. Data will be made available through formal request and a data use agreement between the requestor and the NIDDK.
PROVEN Pragmatic Trial of Video Education in Nursing Homes Data regarding mortality could be misused for inappropriate or biased comparisons of participating facilities or systems. Data regarding admissions and discharges could reveal proprietary business information. Public archive of aggregate-level dataset Public-use dataset will include facility-level aggregate data, with restrictions to prevent re-identification of participating facilities.




LIRE Lumbar Image Reporting with Epidemiology Data regarding treatment patterns and resource use could be used for inappropriate or biased comparisons across health systems and could reveal proprietary health system business information. Private archive managed by study team Patient-level datasets will de-identified by health systems, clinics, providers, and patients. Investigators will authorize release to specific users for specific purposes.



ABATE Active Bathing to Eliminate Infection Data regarding infection rates could be used for inappropriate comparisons of facilities or with public reports. Detailed information regarding facilities and utilization patterns could reveal proprietary business information. Private enclave managed by study team Potential users may propose specific queries. Only query results (not individual data) will be shared.
ICD-Pieces Improving Chronic Disease management with Pieces Data regarding patterns of care could be used for biased or inappropriate comparisons across facilities or health systems. Given different specifications, comparison to publicly reported quality measures would be misleading. Private archive managed by NIDDK Patient-level data will be de-identified and stored in aggregate database. Identifiers for healthcare system, primary practice and patients will be removed. Use of aggregate dataset will be governed by authorized agreements with NIDDK.
TSOS Trauma Survivors Outcomes and Support Data regarding baseline patient characteristics and study outcomes could be used for biased or inappropriate comparisons of care in participating facilities. Private archive managed by study team De-identified patient level data will be provided, with priority given to research that will effect trauma care systems nationwide and Collaboratory investigators.







back to top

Debar LL, Kindler L, Keefe FJ, et al. 2012. A primary care-based interdisciplinary team approach to the treatment of chronic pain utilizing a pragmatic clinical trials framework. Transl Behav Med. 2:523–530. doi:10.1007/s13142-012-0163-2. PMID:23440672.

Ebrahim S, Sohani ZN, Montoya L, et al. 2014. Reanalyses of randomized clinical trial data. JAMA. 312:1024–1032. doi:10.1001/jama.2014.9646. PMID: 25203082.


Platt R, Lieu T. Data Enclaves for Sharing Information Derived From Clinical and Administrative Data. 2018;320:753. doi: 10.1001/jama.2018.9342.PMID:30083726

Zatzick DF, Russo J, Darnell D, et al. 2016. An effectiveness-implementation hybrid trial study protocol targeting posttraumatic stress disorder and comorbidity. Implement Sci. 11:58. doi:10.1186/s13012-016-0424-4. PMID:27130272.


Version History

December 5, 2018: Updated and added reference as part of annual review (change made by K. Staman).

Published August 25, 2017


Simon G, Coronado G, DeBar L, et al. Data Sharing and Embedded Research: Data Sharing Solutions for Embedded Research. In: Rethinking Clinical Trials: A Living Textbook of Pragmatic Clinical Trials. Bethesda, MD: NIH Health Care Systems Research Collaboratory. Available at: Updated March 19, 2020. DOI: 10.28929/070.