Acquiring Real-World Data
Section 7
Gaining Permission to Use Real-World Data
In the United States, using patient data for research generally requires institutional review board (IRB) approval if the study is a clinical investigation that supports applications for research or marketing permits for products regulated by the US Food and Drug Administration (21 CFR Parts 50 and 56) or, more broadly, research involving human subjects conducted, supported, or otherwise subject to regulation by any federal department or agency (45 CFR Part 46 [the Common Rule]).
In addition, use of patient data for research is generally subject to federal and state privacy laws, notably the Health Insurance Portability and Accountability Act (HIPAA), which limits access to identifiable or protected health information (PHI) outside the context of treatment, payment, or healthcare operations. Use or disclosure of PHI for research generally requires either patient authorization or a waiver of the requirement for authorization approved by a privacy board.
Recent changes to the Common Rule have made it easier to reuse data collected as part of routine healthcare operations for research purposes, including EHR data, with additional categories of studies now exempt from IRB oversight. However, to publish research involving data from human subjects, virtually all peer-reviewed journals require that the study must have been reviewed and approved by an IRB or ethics board or have received a determination that the research is exempt from oversight or is not human subjects research (Zozus et al 2015). In all of these cases, investigators are advised to locate an appropriate IRB or ethics board before embarking on research using patient data, even if their institution does not require it.
Data From Healthcare Organizations
Most healthcare organizations have procedures in place that define the permissible internal uses of the data they collect and store. These routine uses typically fall into the categories of treatment, payment, or operations, which are consistent with HIPAA regulations. Examples include data access for members of the care team, information exchange for care transitions, data use in quality improvement projects, and administrative reporting for organizational management. Facilities that conduct research also have procedures in place for secondary use of these health data. Secondary use of data for research is governed by federal regulations, privacy laws, and procedures established by the facility's IRB, privacy board, or research compliance office. Research uses of internal PHI typically fall under HIPAA privacy protections, requiring either patient authorization or a partial or full waiver of HIPAA protections, which may be granted by an IRB or privacy board under certain conditions.
Additional contractual agreements and regulatory compliance are required when investigators want to use data from institutions with which they are not directly associated (for example, a university researcher who wants to use data from local community hospitals). This will almost certainly be the case when using healthcare data as part of a multicenter pragmatic clinical trial. It is important to understand the requirements of HIPAA, which is the relevant regulation for such data disclosures. HIPAA applies to all covered entities, defined as health plans, healthcare clearinghouses, and healthcare providers who electronically transmit any health information in connection with transactions for which the US Department of Health and Human Services (DHHS) has adopted standards.
HIPAA addresses both the internal use and the external disclosure of PHI. Disclosure of PHI, defined as sharing outside of the covered entity, is allowed without patient authorization only in certain controlled situations, including release to healthcare reimbursement or operations departments, individual patients, or regulatory authorities and for national priority purposes. Disclosure of PHI for research and some other purposes requires either an authorization from each individual, an approved waiver of authorization, or the creation of a limited dataset (discussed below). To limit risk, many healthcare organizations will try to ensure that the data released are the “minimum necessary” to support a project. As a result, even in prospective studies for which investigators obtain patient consent and authorization, organizations may not wish to release more PHI than necessary.
A limited dataset, which is considered PHI, may contain identifiers including certain dates (for example, dates of hospital admission or discharge), gender, age and elements of a patient's address (such as zip code, city, and state), but may not contain other more “direct” HIPAA identifiers, such as name , telephone number, or street address (National Institutes of Health 2003). For both limited datasets and datasets containing more PHI than allowed in a limited dataset, the recipient of the data must execute a data use agreement (DUA), which is a contractual arrangement for the transfer of PHI that describes the purposes for which the data can be used and prohibits reidentification (45 CFR 164.514). The DUA will also include language securing the data by specifying limits on its use and sharing, and in most cases will prohibit reidentification of patients or linking to other data from the patients. (If patient consent is obtained and a project explicitly involves linkage, this language would not apply.)
When approaching a healthcare organization for a DUA, a prospective researcher should be prepared to provide a detailed, precise statement of what data elements are required, from what sources, and over what period. In addition, the investigator must describe how the data will be used and transferred securely and provide a list of all external personnel who will be permitted to use the information. Timelines for working out DUAs between stakeholders at healthcare facilities and external investigators vary greatly; in our experience, intervals range from 6 months to more than 2 years.
Of note, research with deidentified data is not considered to be research with human subjects and is not covered by HIPAA limitations.
HIPAA presents 2 approaches for covered entities to follow in creating deidentified datasets: Safe Harbor or Expert Determination.
The Safe Harbor method requires the removal of all 18 HIPAA identifiers and a determination by the entity sharing the data that they do “not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.” The identifiers include name, dates, address, Social Security number, phone and fax numbers, email addresses, biometric information, and other individually unique information (45 CFR 164.514(b)(2); Guidance on the Safe Harbor Method).
The Expert Determination method, by contrast, involves a determination by a qualified individual that the risk of reidentifying individuals from the dataset, either alone or in combination with other “reasonably available information,” is not more than “very small.” This approach can allow identifiers such as dates to be retained in the dataset, though certain variables may need to go through a transformation to minimize the risk of reidentification. Examples of such transformations include date shifting (shifting all dates by an offset, with a different random offset used for each patient), generalizing zip codes, and suppressing patient ages (DHHS 2012). Due to its simplicity, most deidentified datasets have historically been produced using the Safe Harbor approach. The use of the Expert Determination method has gained favor in recent years as health systems and others within the healthcare industry have begun to license EHR datasets, aggregate and link them, and make deidentified versions of record-level data available for use by researchers and others (Noel and Bartelt 2023; Truveta 2025).
Deidentified datasets are most appropriate for retrospective, observational studies, though the lack of identifiers such as dates of service and dates of death can prove problematic for certain analyses.
Data From Other Sources
Datasets managed by government agencies (such as the Centers for Medicare & Medicaid Services, the US Census Bureau, or the Environmental Protection Agency) will often have similar restrictions to those of healthcare organizations. IRB approval and DUAs may be required, and there may be requirements that any future publication cite the organization that provided the data. Similar processes may also be in place for groups that manage product, device, or disease registries.
SECTIONS
Resources
Living Textbook chapter from the NIH Pragmatic Trials Collaboratory's Electronic Health Records Core
REFERENCES
US Department of Health and Human Services (DHHS). 2012. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html. Accessed September 20, 2025.
National Institutes of Health. 2003. How can covered entities use and disclose protected health information for research and comply with the Privacy Rule? In: Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule. NIH Publication Number 03-5388. https://privacyruleandresearch.nih.gov/pdf/HIPAA_Privacy_Rule_Booklet.pdf. Accessed September 24, 2025.
Noel A, Bartelt K. 2023. Cosmos: Real-world data powered by the healthcare community. J Soc Clin Data Manag. 3(S1):1-4. doi: 10.47912/jscdm.246.
Truveta. 2025. Truveta Data. https://www.truveta.com/truveta-data. Accessed September 20, 2025.
Zozus MN, Richesson RL, Hammond WE, Simon GE. 2015. Acquiring and Using Electronic Health Record Data. NIH Collaboratory Electronic Health Records Core. https://dcricollab.dcri.duke.edu/sites/NIHKR/KR/Acquiring%20and%20Using%20Electronic%20Health%20Record%20Data.pdf. Accessed September 24, 2025.