Acquiring Real-World Data
Section 8
Methods of Access
There are several approaches to obtaining real-world data. Real-world data may be obtained directly from a site (such as a healthcare organization) or data holder, via a distributed research network, or directly from patients. Depending on the data needed, real-world data may be provisioned into a protected computing environment, often referred to as an enclave. We detail the trade-offs between the different approaches below.
Direct From Sites or Data Holders
Healthcare organizations, particularly those that participate in research, can often provide data in a variety of formats, which need to be aligned with the requirements of the project. Many other data holders, such as those that maintain disease or device registries have similar capabilities. Examples include:
- Clinician-generated reports: Most electronic health records (EHRs) provide functionality that allows clinicians to generate on-demand reports geared toward answering care management questions (for example, who received a flu shot in last 30 days, who was in the emergency department last night). Creating these reports has relatively low cost, and the reports typically take seconds to run, with real-time results. The drawback is that they have limited ability to include longitudinal results. They are geared toward "most recent" values—most recent lab result, date of last test, etc. Adoption and uptake also varies. Clinicians may not realize that they have the capabilities to generate such reports, as training and support vary by healthcare system.
- Database reports: Almost every EHR includes a reporting database and/or data warehouse. Extracts can be programmed against these repositories, and it may be possible to reuse the same query, or a large portion of it, across sites that use the same vendor. Once a query is developed, it is usually possible to automate the production and delivery of the data. This approach may not be feasible for smaller sites or sites without local information technology support, and complex queries will rely heavily on the skill set and knowledge of the local analyst responsible for programming. This can lead to variation in quality across sites.
- Common data model (CDM) extracts: Many academic medical centers that participate in distributed research networks may have their local source data transformed into a CDM. After the cohort or study population has been defined, local analysts can generate extracts of the relevant tables and fields.
- Application programming interface (API): Healthcare organizations (both healthcare systems and payers) are working to make their data available via API standards like FHIR. Most sites have limited experience delivering data in this format, which means they may not yet have robust processes for allowing access from external parties.
Distributed Research Networks
Study teams can partner directly with distributed research networks to obtain data to support their trials. The process to develop and distribute a query within these networks is usually straightforward, though there are often governance processes that must be followed. One query can be distributed in order to retrieve results from the whole network (or participating sites). An added benefit is that most distributed research networks perform some level of curation or quality assessment on data within their networks. The major drawbacks to this approach are that the data elements of interest for the study may not be in the CDM of the network, which means they must be obtained through other means (for example, added to the CDM or abstracted through chart review), and that large studies will likely need to go beyond a single distributed research network, meaning study teams will need to deal with data in multiple formats.
Direct From Patient
While patients in the United States have always had the right to receive their health records from providers under the Health Insurance Portability and Accountability Act (HIPAA), it was not always possible to receive them in a machine-readable, electronic format (for example, not a scanned PDF). Spurred by efforts of the US federal government over the past decade to promote interoperability and patients' access to their own healthcare data, it has become increasingly viable to obtain data directly from them. Certified EHRs historically have provided patients with the ability to download structured documents, which contain information about most recent visit and some longitudinal values.
More recent regulations will require that EHRs provide data via FHIR APIs, which should streamline the process somewhat, especially given that technology companies such as Apple have made it easy for users to download their EHRs into their local Health app. Once the records have been downloaded, users can decide whether to share them with other applications, including those designed for research. CMS has enabled similar workflows through its BlueButton 2.0 initiative, for Medicare beneficiaries as well as for CMS-regulated payers, including those that support Medicare Advantage, Medicaid, CHIP, and Qualified Health Plans (QHPs) on the federally facilitated exchanges.
There are some drawbacks to this approach: (1) the "completeness" of the implementation of the standard varies by site and/or EHR vendor (D’Amore et al 2014); (2) study teams must broker access through a secondary app such as Apple Health, Hugo, or 1upHealth; and (3) if a patient receives care in multiple healthcare systems, they must make multiple requests to receive all of their records. Despite this, there may be studies that can benefit from such an approach, for instance, a study on a rare disease with a small number of patients who receive care across multiple healthcare systems. Negotiating agreements with multiple systems is time-consuming, so it may be faster to engage directly with patients.
The research community has much to learn about best practices in engaging with patients to obtain data in this manner (including how to encourage high response rates and how to ensure access is provided for the life of the study), but it remains an encouraging possibility.
Protected Computing Environments
Most academic medical centers and many healthcare organizations struggle with the need to provide access to clinical data for research while protecting sensitive data from EHRs and other systems. For this reason, many organizations set up protected computing environments, or limited-access platforms, where only individuals with the proper permissions can access data in a protected, secure environment that is separated from the rest of the network used for clinical and/or research purposes. (Another term for such a platform is a data or computing enclave.) In such an environment, users generally do not have the ability to download data to their local machines, access is provided via a remote or virtual computer, and analyses are contained within the protected space. In order to remove data from such an environment, users must go through an honest broker process, where the content is reviewed to ensure that it can be removed from the secure environment. For example, users may be able to transfer aggregate counts without additional approvals but may need special permission to remove patient-level records.
Examples of organizations that use protected computing environments include CMS and the Veterans Affairs, and they exist within many academic medical centers as well, such as PACE (Protected Analytics Computing Environment). When used as part of a pragmatic clinical trial, prospective study data can be uploaded into the computing enclave, linked with the relevant records stored there, and used in the resulting analysis. Summary or analysis datasets can then be downloaded by the study team. Despite these extra steps, there can be benefits to using a protected computing environment. In some cases, it is the only way to gain access to a particular data source, while in others, the data may be refreshed or updated more frequently, since it is not necessary to generate stand-alone flat files. As concerns grow about the organizational risk of sharing sensitive patient information, these protected environments are likely to increase in prevalence.
Real-world data have particular relevance to pragmatic clinical trials, as they generally represent data collected or generated in the course of routine operations. There are many different types of real-world data and different approaches by which to obtain them. However, real-world data sources are not interchangeable, and any given source may not be applicable for a specific study. As a result, care must be taken to ensure that the real-world data source aligns with the study in question and that the data are obtained in a format that support the proposed analysis.
SECTIONS
Resources
Approaches to Patient Follow-Up for Clinical Trials: What’s the Right Choice for Your Study?
NIH Pragmatic Trials Collaboratory PCT Grand Rounds; March 1, 2019
Data Linkage Within, Across, and Beyond PCORnet
NIH Pragmatic Trials Collaboratory PCT Grand Rounds; November 9, 2018
REFERENCES
D'Amore JD, Mandel JC, Kreda DA, et al. 2014. Are meaningful use stage 2 certified EHRs ready for interoperability? Findings from the SMART C-CDA Collaborative. J Am Med Inform Assoc. 21(6):1060-1068. doi: 10.1136/amiajnl-2014-002883. PMID: 24970839.