Data Sharing Concerns

Type of Information Disclosed

Traditional RCTs, such as tests of efficacy of a new drug, device, behavioral treatment or process, typically create research data sets, which are readily de-identified and contain a limited number of data elements that pertain to the research question. With these data sets it is generally feasible for researchers who did not participate in the original trial to both reproduce the primary results and to perform additional analyses addressing different questions. However, the range of these additional questions is typically limited by the design of the trial dataset. Pragmatic trials and other embedded research typically compare alternative treatments, treatment strategies, or policies. In those comparisons, variation in practice patterns among providers or facilities are potentially important confounders—especially in trials randomizing providers or facilities rather than individual patients.

Embedded research data sets may contain rich information extracted from health system records. Those practice-based data often contain more specific information about the providers and the systems themselves than do conventional clinical trials. Examples include the number, size, or location of facilities and practices; practice volume; the number, size, and census of primary, specialty, and inpatient care units; the number or type of personnel they employ, the structure of their formularies; and information about their vendors and supply chain. For example, the NIH Collaboratory's Active Bathing to Eliminate (ABATE) Infection trial uses data from over 600,000 admissions in 53 hospitals (For a complete description of all the trials, see the table in Definition of a Pragmatic Trial). The dataset includes information on every hospital’s census and length of stay on most wards, plus individuals’ procedure and comorbidity data that could reveal sensitive business information regarding patient volume, size of individual services, length of stay on individual wards, and case mix. The size and richness of the data set effectively precludes protection against re-identification of the hospitals by comparison with external data sources. These facilities vary in size, and could readily be identified by the simple release of numerators and denominators. In addition, because the ABATE study evaluates changes in the number of multidrug-resistant organisms in clinical cultures in these facilities, the potential for misuse and misinterpretation of the data for purposes unrelated to the original research question (e.g., using the data to make biased comparisons of the quality of care at these facilities) would be unacceptable to the healthcare system.

Similarly, the PROVEN trial is working with two nursing home systems operating in over 20 states and is obtaining a wide array of clinical data downloaded monthly on over 200,000 admissions and 60,000 long-stay residents treated in 360 skilled nursing facilities (Mor el al. 2017). Detailed clinical and demographic data from standardized patient assessments on all these patients is automatically merged with longitudinal information about staffing, treatments, and hospitalizations from the facility, which in turn is merged with Medicare claims data to track hospital use and vital status, regardless of whether the patient switches facilities, over the entire study period. While facilities participating in the intervention group represent less than 1% of all US facilities, it would not be difficult to identify the facilities depending upon the level of detail in which study results are presented.

Provider and Institutional Confidentiality

There are two types of confidentiality risks to provider groups and healthcare systems. The first involves revealing business information (e.g., which drugs are purchased or what price is paid for specific services), which has a clear right of privacy. The second is revealing information that could be used for naïve and potentially biased comparisons of quality of care or performance, especially if that information is different (more detailed; more subsets, limited to vulnerable populations, lacking case-mix adjustment) from what is publicly reported by all systems or facilities. In a perfect world there would be no right of privacy about quality of healthcare delivery, but the current world is not perfect because the scope of disclosures for those participating in embedded research could be far greater than that required for assessing quality parameters. Health systems volunteer to participate in research to improve public health, and bearing an additional risk of misuse of sensitive information may be unacceptable (Platt et al. 2016). Moreover, healthcare entities that are incentivized for their performance on quality metrics may be especially concerned about research that may produce data inconsistent with public reports because of differences between definitions or methods used by a study versus those used for public quality measures (e.g., Healthcare Information Data and Information Set (HEDIS) based on claims data versus HEDIS that relies on medical record abstraction) (Simon et al. 2017). Also, the information may be extremely sensitive and may include vulnerable populations. For example, for PPACT, part of the reason that health care systems and individual providers are partnering for the research is the tremendous concern about overprescribing opioids and the dangers it presents. Yet, there are substantial sensitivities about individual primary care provider prescribing patterns, which in turn influences what data can be made available in the shared data sets. Sensitive medical domains that might be the focus of an embedded trial—areas with complex and sensitive issues—could present similar concerns.


Current and proposed disclosure policies are particularly problematic for observational studies and cluster-randomized trials because providers and delivery systems, like their patients, have some of the attributes of research subjects. This is especially problematic for providers since, while care systems may authorize use of their data, individual providers typically are not provided this opportunity.

Many embedded research studies are granted a waiver of consent from patients, with the requirement that personal health information be protected from disclosure. For providers, practices, and health systems that participate in research studies, although there are no similar regulatory protections, there is a reasonable corollary, especially for individuals whose involvement is determined by their inclusion in a randomized cluster without their explicit consent. Some have argued that heath systems, providers, and/or individual practitioners are participants in embedded research—much like patients—and therefore we have ethical obligation to provide suitable assurances regarding legitimate privacy and confidentiality concerns about use and re-use of proprietary data collected during clinical care. However, this ethical argument has proved contentious; the scientific community is encouraging a shift to a more transparent clinical trials enterprise, and this type of data sharing is required in other industries, including the pharmaceutical and device industries. The crux of the argument boils down to a very practical matter, which is ensuring voluntary participation.

Risk of Breach in Data Security

Embedded research studies are typically orders of magnitude larger than conventional clinical trials, making delivery systems especially sensitive to the potential for breaches of data security. For example, the median sample size for the NIH Collaboratory trials is 19,500 individuals; the largest involves 600,000 individuals. Thus, the potential for harm from a single security breach is substantial.




back to top

NIH Collaboratory Healthcare Systems Interactions Core. Lessons Learned from the NIH Health Care Systems Research Collaboratory Demonstration Projects. 2016. Accessed September 1, 2016.

Mor V, Volandes AE, Gutman R, Gatsonis C, Mitchell SL. 2017. PRagmatic trial Of Video Education in Nursing homes: The design and rationale for a pragmatic cluster randomized trial in the nursing home setting. Clin Trials. 14(2):140–151. doi:10.1177/1740774516685298. PMID: 28068789



Platt R, Ramsberg J. 2016. Challenges for Sharing Data from Embedded Research. New England Journal of Medicine. 374(19):1897–1897. doi:10.1056/NEJMc1602016. PMID: 27096325

Simon GE, Coronado G, DeBar LL, Dember LM, Green BB, Huang SS, Jarvik JG, Mor V, Ramsberg J, Septimus EJ, et al. 2017 Oct 3. Data Sharing and Embedded Research. Ann Intern Med. doi:10.7326/M17-0863. PMID: 28973353

Version History

December 5, 2018: Added references as part of the annual update (changes made by K. Staman).

Published August 25, 2017


Simon G, Coronado G, DeBar L, et al. Data Sharing and Embedded Research: Data Sharing Concerns. In: Rethinking Clinical Trials: A Living Textbook of Pragmatic Clinical Trials. Bethesda, MD: NIH Health Care Systems Research Collaboratory. Available at: Updated December 5, 2018. DOI: 10.28929/069.