Assessing Fitness for Use of Real-World Data Sources

Section 6 Data Source Accuracy: Case Study from TRANSLATE-ACS

Contributors

Data Source Accuracy

Although numerous methods have been proposed for collecting inpatient event data, there is little comparative research regarding the relative accuracy of these methods and the resulting implications for pragmatic clinical trial design. Comparative accuracy research is needed to allow clinical trial planners to understand the limitations associated with different data collection methods, properly estimate inpatient event rates, and inform their sample size estimates.

As an example, the Treatment With Adenosine Diphosphate Receptor Inhibitors: Longitudinal Assessment of Treatment Patterns and Events After Acute Coronary Syndrome (TRANSLATE-ACS; ClinicalTrials.gov Identifier: NCT01088503) study was designed to evaluate the use of prasugrel and other ADP receptor inhibitor therapies among myocardial infarction (MI) participants treated with percutaneous coronary intervention (PCI) (Chin et al. 2011). It is one of the few multi-center studies to compare inpatient event data collection methods (Krishnamoorthy et al. 2016; Guimarães et al. 2017). These investigators compared (a) patient-reported versus physician-adjudicated myocardial infarction (MI) events and (b) hospital bill-derived versus physician-adjudicated MI events. We use this as a case study because it provides insights into the limitations inherent in using patient-reported and medical claims data as the sole sources for determining inpatient endpoints.

Case Study: TRANSLATE ACS

TRANSLATE-ACS was an observational study that enrolled 12,365 acute myocardial infarction patients. After discharge, a centralized call center conducted telephone interviews with patients at 6 weeks, and at 6, 12, and 15 months follow-up. During these interviews patients were asked for information regarding re-hospitalizations. In a subset of interviews, patients were asked for their re-hospitalization reason. Initiating a source document collection process to obtain objective evidence, such as a claim or bill for a hospitalization, was triggered based upon patient reported re-hospitalizations and automatic queries from their enrolling hospital at 12 months follow-up. In a second stage, the investigators requested patient medical records when a patient’s hospital bill indicated the patient may have experienced a major adverse cardiovascular event (MACE). An independent physician committee (also called a CEC) then adjudicated patient medical records to validate MACE events. In this study’s context, hospital bill data collection assumes prior patient interviews and physician adjudication data collection assumes prior triggering of events, collection of source documents for those events and expert adjudication of the events based on the source documents. Subsequent analyses compared results from these different inpatient event data collection methods through 12 months follow-up. At one year after MI, they found that event rates for MI, stroke, and bleeding were lower when medical claims were used to identify events than when adjudicated by physicians (Guimarães et al. 2017).

Validating an inpatient event data collection method is conceptually similar to validating a computable phenotype. The data collection method must be able to identify the endpoint event it purports to identify and meet a desired accuracy level when compared with the best methods for assessing the endpoint event (i.e., the gold standard, which is physician adjudicated medical records) (Richesson and Smerek 2014). Without such validation, the researcher can’t confidently state that the data support the conclusions. Whereas with such a validation, the researcher can provide evidence and will have quantified the uncertainty. There is a trade-off between a definition that can be applied consistently [across multiple reviewers (in the case of CEC) or EHR data from multiple facilities in the case of computational definitions over EHR data] versus clinical accuracy (agreement with the truth). The existence of the trade-off emphasizes that measurement of the inaccuracy is needed.

We will define “data collection ascertainment accuracy” using three metrics: sensitivity, specificity and positive predictive value (PPV). We also will define Type I and II errors associated with data collection accuracy. While these data collection errors may influence traditional hypothesis testing Type I and II error rates, they are distinct and represent another factor to be considered in pragmatic trial sample size estimation. Essentially, traditional sample size estimation methods make implicit assumption regarding data accuracy. Here, we are making those assumptions more explicit.

For TRANSLATE ACS data collection accuracy ascertainment

True positive: physician adjudication and a patient report or hospital bill indicates that a MI event occurred.
False negative: physician adjudication indicates that a MI event occurred, but according to patient report or hospital bill, it did not occur.
False positive: patient report or hospital bill indicates that a MI event occurred, but according to physician review of the medical record, it did not actually occur.
True negative: both physician adjudication and patient report or hospital bill indicate that an event did not occur.

		Gold Standard Condition (Physician adjudication indicates an event occurred)
		Yes	No
Comparator Condition (Patient report or hospital bill indicates an event occurred)	Yes	True positive (TP)	False positive (FP)
	No	False negative (FN)	True negative (TN)

Sensitivity is the true positive (TP) rate, in this case, the proportion of actual inpatient myocardial infarction (MI) events that are identified by a given inpatient event data collection method. Sensitivity is measured using the number of true positives (TP) and false negatives (FN) with this formula: TP/(TP + FN).

Sensitivity will help gauge the possibility of a Type II Error (the possibility of rejecting the null hypothesis when it is true) (Sharma et al. 2009). The type II error rate with respect to measurement is defined as 1 – sensitivity by Sharma et al (2009). As shown in the table below, the sensitivity of patient-reported MI events in TRANASLATE-ACS is quite low, meaning that patients tended to under state their MI events, while the sensitivity of hospitals bills is higher, and improves with the number of ICD diagnosis codes that are used to identify the MI event.

Myocardial Infarction Event Data Sources		TP	FP	FN	TN	Sensitivity TP/(TP+FN)	Specificity TN/(TN+FP)	PPV TP/(TP+FP)
Standard	Compare
Physician adjudicated	Patient Report	103	257	254		0.289		0.286

Physician adjudicated	Hospital Bill
	1^st dx code	482	66	264	1145	0.646	0.945	0.880
	2^nd dx	588	90	158	1121	0.788	0.926	0.867
	All dx code	625	103	121	1108	0.838	0.915	0.859

Specificity is the true negative (TN) rate. In this case the proportion of events that are not MIs identified by a given inpatient event data collection method. Specificity is measured using the number of true negatives (TN) and false positives (FP) with this formula: TN/(TN+FP).

Specificity will help us gauge the possibility of a Type I Error (the failure of rejecting a false negative hypothesis) (Sharma et al. 2009). The type I error rate is 1-specificity as defined by Sharma et al (2009). In the table above, there is no specificity for patient-reported MI events because the TRANSLATE ACS authors did not report the associated true negatives. In contract with sensitivity, the specificity of hospital bill inpatient event data collection is high but decreases with increases in the number of ICD diagnosis codes used to identify the MI event.

Positive Predictive Value (PPV) is defined as the proportion of actual inpatient events among those identified by a given inpatient event data collection method (TP/(TP + FP)) and will help gauge the precision associated with an inpatient event data collection method. As with sensitivity, the PPV was low for patient report and higher for hospital bills, and improves with the number of ICD diagnosis codes used to identify the MI event.

We can extend these metrics to compute the measurement concepts of Type I (1-specificity) and Type II (1 – sensitivity) Error rates (Sharma et al. 2009), as shown in the table below. For hospital bills using all diagnosis codes, the Type I Error Rates is 0.162 (1- 0.838), and the Type II Error Rate is 0.085.

Myocardial Infarction Event Data Sources		Errors
Standard	Comparator	Type I (false positive)	Type II (false negative)
Physician adjudicated	Patient report	0.711	0.714

Physician adjudicated	Hospital Bill
	1^st diagnosis code	0.354	0.055
	2^nd diagnosis code	0.212	0.074
	All diagnosis codes	0.162	0.085

Our Type I and Type II Error metrics tell us that the combined patient report and hospital bill data collection methods will miss a number of true MIs (FN) and include events that are not true MIs (FP). One reason for false positives may be the stringent definitions used in adjudication of endpoints within traditional trials. Nonetheless, this information will help us understand how the accuracy of data collection methods may impact sample size estimation and subsequent hypothesis testing.

Metrology (the science of measurement) evaluates the reliability of measurements in terms of their objectivity and intersubjectivity (Mari et al. 2012). Objectivity guaranteesthat measurement results are independent of their context (e.g., properties of the object being measured, the measurement system, and the person who is measuring). To gauge objectivity, we can compare the TRANSLATE –ACS hospital billing data collection results with those from the Women’s Health Initiative (WHI) (Hlatky et al. 2014). In the WHI study, physician adjudicators used standardized forms and definitions to review patient medical records and identify MI events. Inpatient hospital bill MI events then were identified by linking study patients with their CMS Medicare Part A claims. The resulting analysis only compared physician adjudicated MIs with CMS claims-identified MIs in the first or second diagnosis codes. The resulting sensitivity (0.790) and specificity (0.988) are close to their corresponding TRANSLATE-ACS values. However, the PPV (0.708) is lower. This difference illustrates how data from different facilities, units, and providers (and patients/sub-groups of patients) may yield different and biased answers. A researcher’s options are to (1) measure the impact, or (2) measure the range and location of the errors and simulate the impact and to show that the error is inconsequential (or not) (Richesson et al. 2013 Sep 11).

Metrology’s intersubjectivity evaluation standards requires that measurement results convey both information regarding measurement values and the degree of trust that should be attributed to those values (Mari et al. 2012). This information can be presented as a target measurement uncertainty that defines the minimal quality needed to support a specific decision. The implication is that if the actual uncertainty is greater than the target, the measurement value is not considered valid because it cannot support the intended use. This is similar to the use of Type I and II Error rates in sample size estimation and hypothesis testing. In the example above, measurement values are the MI event rates obtained by different data collection methods and the degree of trust attributed to those values are their associated Type I and Type II Error rates. The unanswered question is whether these rates meet minimal data quality accuracy standards. While it could be argued that data quality accuracy errors may have minimal effect upon clinical trial outcomes, the burden of proof lies with the investigator. Calls for reporting data quality assessment results with published research results have been made (Kahn et al. 2015). Those who use these data and rely upon their associated study results in decision-making should be aware of potential data quality accuracy limitations that may influence their use and interpretation of study results. For this reason is it essential that minimum inpatient event data quality accuracy standards are developed.

One way to compensate for measurement error is to increase the sample size, but this can increase financial costs and risks; another way to compensate is to 1) improve the reliability of raters through training or 2) use multiple methods for ascertaining information (Perkins et al. 2000). Meaningful reductions in sample size have been gained from using the mean of multiple sources and other improvements in reliability (Perkins et al. 2000). Nonetheless, investigators will need to determine the data collection methods most appropriate for their study before designing their study and making their sample size estimates.

The example above uses CEC adjudicated endpoints as the gold standard. However, data collection methods will vary in their accuracy versus this gold standard and in their applicability to clinical practice. Previous research has demonstrated that the accuracy of claims data is high for cardiovascular procedures (e.g., CABG surgery and PCI) and much lower for bleeding events. Perhaps, this is because bleeding event definitions used in explanatory trials (Mehran et al. 2011) are not relevant for clinical practice and could be replaced by the number of transfusions. Similarly, the explanatory trial CEC myocardial infarction definition may differ from that commonly used in clinical practice.

All data collection methods are associated with type I and II errors. These error rates will vary by endpoint and data collection method. These errors typically are not accounted for in sample size estimates. Research is needed to determine these error rates and how they may influence sample size estimates. It is also the case that certain endpoints used in explanatory trials may have little relevance in actual practice and may not be recorded in EHRs. Because of these issues, the pragmatic trial community needs to collectively determine which endpoints are relevant for pragmatic trials, how they can be measured and validated, and how the accuracy of these measurement methods may impact hypothesis testing sample size estimates.

SECTIONS

CHAPTER SECTIONS

sections

REFERENCES

Chin CT, Wang TY, Anstrom KJ, et al. 2011. Treatment with adenosine diphosphate receptor inhibitors-longitudinal assessment of treatment patterns and events after acute coronary syndrome (TRANSLATE-ACS) study design: expanding the paradigm of longitudinal observational research. Am Heart J. 162:844–851. doi:10.1016/j.ahj.2011.08.021.

Guimarães PO, Krishnamoorthy A, Kaltenbach LA, et al. 2017. Accuracy of Medical Claims for Identifying Cardiovascular and Bleeding Events After Myocardial Infarction : A Secondary Analysis of the TRANSLATE-ACS Study. JAMA Cardiol. 2:750–757. doi:10.1001/jamacardio.2017.1460.

Hlatky MA, Ray RM, Burwen DR, et al. 2014. Use of Medicare Data to Identify Coronary Heart Disease Outcomes in the Women’s Health Initiative. Circulation: Cardiovascular Quality and Outcomes. 7:157–162. doi:10.1161/CIRCOUTCOMES.113.000373.

Kahn MG, Brown JS, Chun AT, et al. 2015. Transparent reporting of data quality in distributed data networks. EGEMS (Wash DC). 3:1052. doi:10.13063/2327-9214.1052.

Krishnamoorthy A, Peterson ED, Knight JD, et al. 2016. How Reliable are Patient-Reported Rehospitalizations? Implications for the Design of Future Practical Clinical Studies. J Am Heart Assoc. 5. doi:10.1161/JAHA.115.002695.

Mari L, Carbone P, Petri D. 2012. Measurement Fundamentals: A Pragmatic View. IEEE Trans Instrum Meas. 61:2107–2115. doi:10.1109/TIM.2012.2193693.

Mehran R, Rao SV, Bhatt DL, et al. 2011. Standardized bleeding definitions for cardiovascular clinical trials: a consensus report from the Bleeding Academic Research Consortium. Circulation. 123:2736–2747. doi:10.1161/CIRCULATIONAHA.110.009449.

Perkins DO, Wyatt RJ, Bartko JJ. 2000. Penny-wise and pound-foolish: the impact of measurement error on sample size requirements in clinical trials. Biol Psychiatry. 47:762–766.

Richesson RL, Smerek M. Electronic Health Records-based Phenotyping, in Rethinking Clinical Trials A Living Textbook in Pragmatic Clinical Trials. NIH Health Care Systems Research Collaboratory. Published June 27, 2014.

Sharma D, Yadav UB, Sharma P. 2009. The concept of sensitivity and specificity in relation to two types of errors and its application in medical research. J Reliability Stat Stud. 2:53–58.

Version History

Update January 17, 2021: Moved from “Inpatient Outcomes” chapter to “Assessing Fitness-for-Use” chapter (changes made by K. Staman).

Published June 19, 2019.

COVID-19 Resources

COVID-19 Resources

Rethinking Clinical Trials

A Living Textbook of Pragmatic Clinical Trials

Data Source Accuracy: Case Study from TRANSLATE-ACS

Assessing Fitness for Use of Real-World Data Sources

Section 6

Data Source Accuracy: Case Study from TRANSLATE-ACS

Data Source Accuracy

Case Study: TRANSLATE ACS

SECTIONS

sections

REFERENCES

current section :

Data Source Accuracy: Case Study from TRANSLATE-ACS

Citation: