Statistical Analysis Archives - Rethinking Clinical Trials

Speakers

Gregory E. Simon, MD, MPH
Kaiser Permanente Washington Health Research Institute

Susan Huang, MD, MPH
University of California Irvine

Elizabeth Turner, PhD
Duke University

Keywords

P-Values; Significance; Statistical Analysis; Pragmatic Trials; Decision-Makers

Key Points

P-values are a part of the statistical process of hypothesis and significance testing. They quantify of the degree of “surprise” in a finding. The result is dichotomous; a P-value of less than 0.05 is considered statistically significant, while a P-value greater than or equal to 0.05 is not.

0.05 is a useful but somewhat arbitrary cutoff. It was probably first described in Statistical Methods for Medical Workers by R. A. Fisher: “It is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not.” According to an anecdote shared by Fisher’s daughter, he identified the cutoff as “convincing enough” based on an informal experiment with a colleague.

Using a single threshold to determine significance can be problematic in real-world settings. Healthcare decisionmakers are seeking solutions to multi-dimensional problems, and they care about subgroups. Dr. Huang illustrated this point with an overview of ABATE Infection trial and her team’s subsequent collaboration with decision-makers.

ABATE Infection was a pragmatic, cluster-randomized trial assessing universal decolonization in non-ICUs. While decolonization wasn’t effective for all non-ICU patients, a post-hoc analysis found that the intervention was highly effective in patients with medical devices. This finding was practically significant and was included in national guidance around decolonization.

In a cost-effectiveness analysis of universal, targeted, or no decolonization for patients with medical devices, the ABATE team found that the optimal outcome was dependent on site circumstances, i.e. prevalence of device use, adherence to targeted decolonization, and financial penalties for bloodstream infection.

For years, experts have questioned the reliance on P-values. On the other hand, there are concerns that rejecting “H1 – H0” could prove to be a slippery slope to data dredging and “post-hoc chicanery.”

The dogma of the P-value may be more applicable to a clinical trial setting than to a pragmatic setting. Establishing the standard of care requires a high level of certainty. Scientific rigor demands rules and a threshold that isn’t affected by cost.

In hospitals, clinical decisions are rarely based on certainty; safe interventions that are low-cost and have a possible benefit are given more consideration. Decision-makers should understand the probability of benefit at a given P-value; circumstances may warrant adoption.

In pragmatic trials, valuable information may include the intervention effect size, the effect for various outcomes and on various subgroups, and information pertinent to implementation: fidelity, reach, cost, etc.

Decision-making is complex and multidimensional. What is important may depend on context, audience, or other situational factors. While P-values can be useful in decision-making, they aren’t the only piece of the puzzle.

Discussion Themes

Changing the reliance on P-values would require a multi-pronged, multi-dimensional approach; sponsors, journals, and other stakeholders each uphold the use of P-values for various reasons. Perhaps the best way to start integrating this perspective shift into the clinical trials ecosystem is to hold the line, routinely seeking and providing information about a variety of outcomes and confidence levels.

If we hold that the underlying but unknown truth is fixed, then our process for arriving at conclusions regarding a treatment’s effectiveness (or whether the treatment has a favorable benefit-risk profile) inherently has important operating characteristics, such as the Type I error rate. If we move away from P-values, we will need to define a design approach that considers these operating characteristics.

Maybe it’s more practical to think about honing into a standard of care as an iterative process, in the way that human learning is an iterative process; to state that we know something to some degree of certainty, then modify, refine, and get closer to defining these truths.

Key Points

Researchers often look at the intention-to-treat (ITT) effect – i.e., the effect of being assigned to an intervention – for randomized trials, and at the per-protocol effect – i.e., the effect of receiving an intervention – for observational studies.

Why are we giving estimates for 2 different effects? If a causal question is important enough to be asked in an observational study, Dr. Hernán posed, then we should also ask it in a randomized trial, and vice versa.

He walked through some of the field’s justifications for asking different causal questions. If adherence patterns vary, the ITT effect may differ across trials that study the same treatment. Arguments in favor of using the ITT effect often include that it preserves the null; it’s conservative; and it measures effectiveness in the real world.

But null preservation is not guaranteed; most pragmatic trials are not blinded, for instance, so assignment may make patients and doctors alter their behavior. Through hypotheticals and case examples, Dr. Hernán made the case that ITT effects are not necessarily a measure of effectiveness, nor are they of primary interest for doctors and patients.

The per-protocol effect represents another causal question that randomized trials could answer. The issue with per-protocol analysis is that valid estimate of the effect requires adjustment for confounding. Historically, trialists have been suspicious of these analyses as potentially biased.

In the past few decades, however, statistical methods have been developed that allow researchers to adjust for confounders at baseline and after baseline. Provided a few criteria are met, there is hope for the estimation of per-protocol effects as a way to contrast generic treatment strategies.

For a long time, regulators were reluctant to consider causal effects other than the ITT effect. Under increasing pressure from industry, a group of methodologists from pharma and regulatory agencies worked together to generate the “estimand framework.” This manifested as an addendum to the ICH E9 guidelines.

The addendum was a great step forward for more rigorous causal influence, Dr. Hernán noted. But it has room for improvement, specifically in terms of choice (of standard terminology and of estimands) and a lack of emphasis on both treatment strategies and trial design.

Researchers do not compare treatments but treatment strategies. Therefore, the causal contrast needs to specify the treatment strategies of interest; typically, these will be the treatment strategies specified in the protocol. Observational studies for causal inference already try to estimate well-defined per-protocol effects when emulating target trials. It’s time to turn to randomized trials.

Discussion Themes

The intention-to-treat effect is not always the primary interest. When there is a safety outcome or in a non-inferiority trial, the per-protocol effect is necessarily the estimand of interest.

When we say the ITT effect is biased towards the null, we mean it’s biased for the per-protocol effect. I.e., “I’m interested in the per-protocol effect because it’s going to be harder to find differences between treatments using ITT analysis.”

One issue with the estimand framework is that it doesn’t give the per-protocol effect much consideration; it is not named and is part of a set of estimands, all of which are given equal importance. Dr. Hernán argued that the per-protocol effect should be made more explicit, as it’s the main question that we can ask of a trial outside of the ITT effect.

The person who designs the trial is responsible for pre-specifying the procedure when a participant switches treatments.

COVID-19 Resources

COVID-19 Resources

Rethinking Clinical Trials

A Living Textbook of Pragmatic Clinical Trials

Statistical Analysis

Grand Rounds September 26, 2025: Significance in ePCTs: P Values vs Decision-Maker Perspectives (Gregory E. Simon, MD, MPH; Susan Huang, MD, MPH; Elizabeth Turner, PhD)

Speakers

Keywords

Key Points

Discussion Themes

Grand Rounds July 12, 2024: Causal Estimands: Should We Ask Different Causal Questions in Randomized Trials and in the Observational Studies That Emulate Them? (Miguel Hernán, MD)

Speaker

Keywords

Key Points

Discussion Themes