Data Sharing and Embedded Research
Section 2
Data Sharing Concerns
There is a strong assumption that data and metadata should be broadly shared with the scientific community. NIH issued a data management and sharing policy in 2023 that requires researchers who accept NIH funding to produce a data management and sharing plan (NIH 2023). There are many reasons for these requirements. Appropriate data management is crucial for maintaining the integrity and rigor of the research, and data sharing increases the need for strong data management. Data sharing can improve the reproducibility of research findings and reuse of existing data in new research. This allows research to progress more rapidly and ensures that the benefits of federal funding are optimized. Finally, transparency that results from data sharing enhances confidence in research findings and allows others to ascertain and validate results and analyses. In general, data should be managed and shared in a way that is consistent with the FAIR principles. That is, data should be Findable, Accessible, and Interoperable, and digital assets should be Reusable.
Not all data are appropriate for sharing. There are exceptions to the regulatory requirements for data sharing due to ethical considerations discussed in this section. These considerations are especially relevant for pragmatic clinical trials.
Type of Information Disclosed
Traditional clinical trials, such as tests of efficacy of a new drug, device, behavioral treatment or process, typically create research data sets, which are readily deidentified and contain a limited number of data elements that pertain to the research question. With these data sets it is generally feasible for researchers who did not participate in the original trial to both reproduce the primary results and to perform additional analyses addressing different questions. However, the range of these additional questions is typically limited by the design of the trial dataset. Pragmatic trials and other embedded research typically compare alternative treatments, treatment strategies, or policies. In those comparisons, variation in practice patterns among providers or facilities are potentially important confounders—especially in trials randomizing providers or facilities rather than individual patients.
Embedded research data sets may contain rich information extracted from health system records. Those practice-based data often contain more specific information about the providers and the systems themselves than do conventional clinical trials. Examples include the number, size, or location of facilities and practices; practice volume; the number, size, and census of primary, specialty, and inpatient care units; the number or type of personnel they employ; the structure of their formularies; and information about their vendors and supply chain.
For example, ABATE Infection, an NIH Collaboratory Trial, used data from over 500,000 admissions in 53 hospitals (Huang et al 2019). The dataset included information on every hospital’s census and length of stay on most wards, plus individuals’ procedure and comorbidity data that could reveal sensitive business information regarding patient volume, size of individual services, length of stay on individual wards, and case mix. The size and richness of the dataset effectively precluded protection against reidentification of the hospitals by comparison with external data sources. These facilities varied in size and could readily be identified by the simple release of numerators and denominators. In addition, because the ABATE Infection trial evaluated changes in the number of multidrug-resistant organisms in clinical cultures in these facilities, the potential for misuse and misinterpretation of the data for purposes unrelated to the original research question (such as using the data to make biased comparisons of the quality of care at these facilities) would be unacceptable to the healthcare system.
Similarly, the PROVEN study, an NIH Collaboratory Trial, worked with 2 nursing home systems operating in over 20 states and obtained a wide array of clinical data downloaded monthly on over 200,000 admissions and 60,000 long-stay residents treated in 360 skilled nursing facilities (Mor el al 2017). Detailed clinical and demographic data from standardized patient assessments on all these patients was automatically merged with longitudinal information about staffing, treatments, and hospitalizations from the facility, which in turn was merged with Medicare claims data to track hospital use and vital status, regardless of whether the patient switched facilities, over the entire study period. While facilities participating in the intervention group represented less than 1% of all US facilities, it would not be difficult to identify the facilities depending upon the level of detail in which study results were presented.
Patient Participant Privacy
Because pragmatic clinical trials often involve both large amounts of data from large numbers of patients and real-world data from embedded interventions, adequately deidentifying patient data may be difficult if the data are shared. There is greater risk that patient data will inadvertently include identifiable information or that linkage of different data types will allow reidentification of patient information. This means that efforts to adequately deidentify data for data sharing may be logistically difficult and prohibitively expensive (Morain et al 2023).
Provider and Institutional Confidentiality
In addition to concerns about potential privacy violations to patients who are participants in pragmatic trials, there are 2 types of confidentiality risks to provider groups and healthcare systems. The first involves revealing business information (for example, which drugs are purchased or what price is paid for specific services), which has a clear right of privacy. The second is revealing information that could be used for naïve and potentially biased comparisons of quality of care or performance, especially if that information is different (more detailed; more subsets, limited to vulnerable populations, lacking case-mix adjustment) from what is publicly reported by all systems or facilities. In a perfect world there would be no right of privacy about quality of healthcare delivery, but the current world is not perfect because the scope of disclosures for those participating in embedded research could be far greater than that required for assessing quality parameters. Health systems volunteer to participate in research to improve public health, and bearing an additional risk of misuse of sensitive information may be unacceptable (Platt et al 2016). Moreover, healthcare entities that are incentivized for their performance on quality metrics may be especially concerned about research that may produce data inconsistent with public reports because of differences between definitions or methods used by a study versus those used for public quality measures (for example, Healthcare Information Data and Information Set [HEDIS] based on claims data versus HEDIS that relies on medical record abstraction) (Simon et al 2017). Also, the information may be extremely sensitive and may include vulnerable populations. For example, in the PPACT study, an NIH Collaboratory Trial, part of the reason that healthcare systems and individual providers partnered for the research was the tremendous concern about overprescribing opioids and the dangers it presents. Yet, there were substantial sensitivities about individual primary care provider prescribing patterns, which in turn influenced what data could be made available in the shared data sets. Sensitive medical domains that might be the focus of an embedded trial—areas with complex and sensitive issues—could present similar concerns.
Together, the concerns about the potential for reidentification of patient data and risks to participants (both patients and healthcare providers) could justify limiting the data sharing that would normally be expected. NIH regulations allow for maintaining the privacy of data if sharing data would pose risks to participants and the data cannot be adequately deidentified.
Informed Consent
Many pragmatic clinical trials are granted waivers or alterations of consent, particularly when they use cluster randomized designs, but waivers or alterations may be granted for embedded studies when extant data from clinical care are the only data being used. This is a challenge because informed consent for data sharing is generally used to justify data sharing. Informed consent is also the primary vehicle for the demonstration of respect for participants. Therefore, if a waiver of consent is granted for a pragmatic trial, the assumption that data sharing is consistent with the participant’s wishes is not necessarily valid and the obligation to demonstrate respect to participants must be met in a different way. There are several ways of addressing this, including disclosure (if not consent), and greater visibility and education of patients about ongoing research efforts to improve care could play a role in demonstrating respect (Morain et al 2025; O’Rourke et al 2025; Propes et al 2024)
Current and proposed disclosure policies are particularly challenging for observational studies and cluster randomized trials because providers and delivery systems, like their patients, have some of the attributes of research subjects. This is especially problematic for providers since, while care systems may authorize use of their data, individual providers typically are not provided this opportunity.
For providers, practices, and health systems that participate in research studies, although there are no similar regulatory protections, there is a reasonable corollary to a waiver of consent, especially for individuals whose involvement is determined by their inclusion in a randomized cluster without their explicit consent. Some have argued that health systems, providers, and/or individual practitioners are participants in embedded research—much like patients—and therefore we have ethical obligation to provide suitable assurances regarding legitimate privacy and confidentiality concerns about use and reuse of proprietary data collected during clinical care. However, this ethical argument has proved contentious; the scientific community is encouraging a shift to a more transparent clinical trials enterprise, and this type of data sharing is required in other industries, including the pharmaceutical and device industries. The crux of the argument boils down to a very practical matter, which is ensuring voluntary participation.
Risk of Breach in Data Security
Embedded research studies are typically orders of magnitude larger than conventional clinical trials, making delivery systems especially sensitive to the potential for breaches of data security. For example, the median sample size for the NIH Collaboratory Trials is 19,500 individuals; the largest involves more than 500,000 individuals. Thus, the potential for harm from a single security breach is substantial. NIH is considering a new research security policy that would require institutions to have a research security plan if they accept more than $50 million in federal funding and to implement new training requirements for all research personnel (NIH 2025). Researchers have a duty to protect data from unauthorized access. Research security plans are an opportunity to think through meeting these obligations and demonstrating respect for participants.
SECTIONS
Resources
Data Sharing and Pragmatic Clinical Trials: Law and Ethics Amidst a Changing Policy Landscape; Rethinking Clinical Trials Grand Rounds; November 11, 2022
REFERENCES
Huang SS, Septimus E, Kleinman K, et al. 2019. Chlorhexidine versus routine bathing to prevent multidrug-resistant organisms and all-cause bloodstream infections in general medical and surgical units (ABATE Infection trial): A cluster-randomised trial. Lancet. 393(10177):1205-1215. doi: 10.1016/S0140-6736(18)32593-5. PMID: 30850112.
Mor V, Volandes AE, Gutman R, Gatsonis C, Mitchell SL. 2017. PRagmatic trial Of Video Education in Nursing homes: The design and rationale for a pragmatic cluster randomized trial in the nursing home setting. Clin Trials. 14(2):140-151. doi: 10.1177/1740774516685298. PMID: 28068789.
Morain SR, Bollinger J, Weinfurt K, Sugarman J. 2023. Stakeholder perspectives on data sharing from pragmatic clinical trials: Unanticipated challenges for meeting emerging requirements. Learn Health Syst. 8(1):e10366. doi: 10.1002/lrh2.10366. PMID: 38249837.
Morain SR, Brickler A, Ali J, et al. 2025. Ethical considerations for sharing aggregate results from pragmatic clinical trials. Clin Trials. 22(2):248-254. doi: 10.1177/17407745241290782. PMID: 39587730.
NIH. 2023. Final NIH Policy for Data Management and Sharing. NOT-OD-21-013. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html. Accessed October 7, 2025.
NIH. 2025. Implementation of NIH Research Security Policies. NOT-OD-25-154. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-25-154.html. Accessed October 7, 2025.
O'Rourke PP, Ali J, Carrithers J, et al. 2025. Disentangling informing participants from obtaining their consent. Learn Health Syst. 2025 Apr 21. doi: 10.1002/lrh2.70014. Epub ahead of print.
Platt R, Ramsberg J. 2016. Challenges for sharing data from embedded research. N Engl J Med. 374(19):1897. doi: 10.1056/NEJMc1602016. PMID: 27096325.
Propes C, O'Rourke PP, Morain SR. 2024. Recurring and emerging ethical issues in pragmatic clinical trials. Circ Cardiovasc Qual Outcomes. 17(7):e010847. doi: 10.1161/CIRCOUTCOMES.124.010847. PMID: 39012931.
Simon GE, Coronado G, DeBar LL, et al. 2017. Data sharing and embedded research. Ann Intern Med. 167(9):668-670. doi: 10.7326/M17-0863. PMID: 28973353.
