Grand Rounds Biostatistics Series January 5, 2024: Methods for Handling Missing Data in Cluster Randomized Trials (Rui Wang, PhD; Moderator: Fan Li, PhD)

Speaker

Speaker: Rui Wang, PhD
Associate Professor of Population Medicine and Associate Professor in the Department of Biostatistics, Harvard Pilgrim Health Care Institute and Harvard Medical School

Moderator: Fan Li, PhD
Assistant Professor of Biostatistics, Yale School of Public Health

Keywords

Cluster-randomized; Intervention; Missingness; Outcomes

Key Points

  • Cluster randomized trials are trials in which clusters of individuals rather than independent individuals are randomly allocated to intervention groups, and outcomes are measured for each individual in the cluster.
  • Sometimes there is missingness in the outcome. Significant work has been done on individual level missingness, where there is an observed outcome for each cluster, but an outcome is missing for one individual. Less work focuses on cluster level missingness, where the outcome is missing for an entire cluster of individuals, but researchers have made some progress in analyzing these cases.
  • For missing completed at random (MCAR) models, the missing process will be independent of all the covariate treatments and outcomes so that the observed data actually represent the underlying population you’re making an inference about. Most of an analysis can proceed using complete data, so this isn’t an issue. Unfortunately, this can rarely happen, so people consider using the missing at random (MAR) model, which models the missingness process using the observed data in cluster randomized trials.
  • This research addresses two challenges – outcomes can be missing, and data for individuals within the same cluster are likely to be correlated. The missing data framework is focused on making inferences about aspects of the distribution of the full data based on the observed data. To do this, two approaches are commonly used: 1) mixed effect models via maximum likelihood estimation; and 2) population average models fitted with generalized estimating equation (GEE) approaches.
  • The GEE estimator focuses on population average effects rather than cluster specific effects, requires fewer parametric assumptions, and is robust to misspecification of the correlation structure. In the presence of missing data, if data are MCAR, the standard GEE estimator based on complete data is consistent. However, when data are MAR, the standard GEE estimator based on complete data may provide biased estimates. Under this assumption, some possible solutions are the multiple imputation, inverse probability weighting, augmented inverse probability weighting, and a “multiply robust” version of the augmented inverse probability weighting solution. The “multiply robust” version also works for missingness of an entire cluster.
  • When thinking about using the GEE, you might run into the following computational challenges with fitting GEEs: second-order GEEs include an extra set of estimating equations, involving all possible pairs of observations; the computing complexity increases quadratically as the cluster sizes increase; and solving GEEs with large cluster sizes becomes difficult due to both convergence and memory allocation issues. Stochastic algorithms, which specify using only a subsample at each iteration, can circumvent these computational issues.
  • In summary, in the multilevel case, there can be either class level or individual level missingness. If it’s class level missingness and is MCAR, use the individual level missingness mechanism. If it’s also MCAR, then proceed with complete case analysis. If not, the GEE estimator in the form that fits the situation best is a plausible solution. For class level cases with outcomes MAR, use the “multiply robust” version of the GEE estimator to proceed with analysis.

 

Discussion Themes

-If the cluster size is small, we may have a propensity score estimated at zero or one. How does the EM algorithm help in this context? The EM doesn’t help if the number of clusters is small. When you fit the lambda at the probability of missingness, that doesn’t help with the extreme waste problem. When we fit different model, we would look at the tool missing, the gamma parameter, or maybe the AGA parameter depending on the model and whether the individual outcome is missing or not. Across these situations, treating certain aspects at zero would cause these examples not to be what we’re targeting. The EM algorithm helps with that problem. However, if you don’t have enough data, the estimation is not going to be stable.

-One cautionary note when creating the propensity weights using logistic regression is that there’s issues with collapsibility. It’s difficult to gauge what kind of bias this introduces. It might be better to use a marginal based, binary indicator model. That’s a great point. When you have a non-identity link, you get that non-collapsibility issue so that other models can be better in terms of estimating. At the end, we just need a model for predicting the probability of the observed. It’s important to note that these approaches are not wedded to logistic regression, as long as we have a sensible model for the missingness.

Tags

#pctGR, @Collaboratory1