Beyond the Standard Model: Multilevel and Multivariate Meta-Analysis for Complex Clinical Evidence

This is Part 2 of a series on meta-analysis in clinical evidence synthesis. Part 1: From Effect Sizes to Regulatory Decision-Making covers the standard meta-analytic toolkit.

Introduction: When Standard Models Fall Short

In Part 1 of this series, we covered the standard meta-analytic toolkit: equal- and random-effects models, meta-regression, heterogeneity quantification, and publication bias. These methods work well when each study contributes exactly one independent effect size estimate.

In practice, clinical evidence is rarely that tidy. A single study may report outcomes at multiple time points. A systematic review of surgical techniques may extract complication rates for several distinct endpoints from the same trial. Multiple publications from the same research group may share subjects or methodological features that induce correlation between their results. A multi-arm trial comparing three treatments against a common control generates effect sizes that are inherently dependent because they share the same control group data.

Ignoring these dependencies — by pretending all estimates are independent — leads to incorrect standard errors, inflated Type I error rates, and confidence intervals that are too narrow (Hedges et al., 2010). The standard random-effects model simply cannot accommodate this kind of structure. What we need are multilevel and multivariate extensions that explicitly model the dependencies in the data.

This post covers these advanced methods, drawing on the framework implemented in the rma.mv() function of the metafor package (Viechtbauer, 2010) and following the approach taught in Wolfgang Viechtbauer's Advanced Meta-Analysis Course.

The Independence Assumption and Why It Breaks

The standard random-effects model assumes that all observed outcomes are independent:

$y_i = \mu + u_i + e_i$

where $u_i \sim N(0, \tau^2)$ captures between-study heterogeneity and $e_i \sim N(0, v_i)$ captures sampling error. The implied variance-covariance matrix is diagonal — no off-diagonal elements, no covariances between studies.

This assumption is violated whenever a subject (or a study, or a research group) contributes data to more than one effect size estimate. Common scenarios include multiple effect sizes from the same study (e.g., different outcome measures, subgroups, or time points), multiple studies from the same research group or clinical site, multi-arm trials where each active arm is compared against a shared control, and results reported in multiple publications from overlapping datasets.

Traditional workarounds — selecting one outcome per study, averaging dependent estimates, or running separate analyses on independent subsets — are straightforward but wasteful. They discard information, reduce statistical power, and can introduce selection bias in which estimate gets chosen (Moeyaert et al., 2017).

Multilevel Meta-Analytic Models

The Three-Level Model

The multilevel (or three-level) random-effects model extends the standard model by adding a clustering level. Consider the example from Konstantopoulos (2011): 56 studies of educational interventions nested within 11 school districts. The standard model treats all 56 studies as independent, but studies from the same district may share features that make their true effects more similar to each other than to effects from other districts.

The three-level model makes this structure explicit:

$y_{ij} = \mu + w_i + u_{ij} + e_{ij}$

Here, $w_i \sim N(0, \sigma^2_B)$ is a cluster-level random effect capturing between-cluster heterogeneity (e.g., differences between districts), $u_{ij} \sim N(0, \sigma^2_W)$ is an estimate-level random effect capturing within-cluster heterogeneity (differences between studies within the same district), and $e_{ij} \sim N(0, v_{ij})$ is the sampling error as before.

The key consequence is that the marginal variance-covariance matrix is no longer diagonal. Estimates from the same cluster share the $\sigma^2_B$ component, inducing a block-diagonal structure where within-cluster observations are correlated. The implied within-cluster correlation of true outcomes is $\rho = \sigma^2_B / (\sigma^2_B + \sigma^2_W)$ . In the Konstantopoulos example, the estimated between-district variance was $\hat{\sigma}^2_B = 0.065$ and the within-district variance was $\hat{\sigma}^2_W = 0.033$ , yielding an estimated within-cluster correlation of approximately 0.67 — substantial, and certainly not something you would want to ignore.

Practical Implications

The pooled effect estimate changes when clustering is properly accounted for. In the Konstantopoulos example, the standard random-effects model yielded $\hat{\mu} = 0.128$ (SE = 0.044), while the multilevel model gave $\hat{\mu} = 0.185$ (SE = 0.085). The estimate shifted, and — critically — the standard error nearly doubled. This is the typical pattern: ignoring clustering tends to underestimate uncertainty.

The total heterogeneity can be decomposed across levels. In this example, $\hat{\sigma}^2_B + \hat{\sigma}^2_W = 0.098$ is the total heterogeneity, split roughly two-thirds between districts and one-third within districts. This decomposition is itself informative — it tells us where the variability in treatment effects primarily resides.

Implementation

In R's metafor package, the multilevel model is fitted using rma.mv() with nested random effects:

res <- rma.mv(yi, vi, random = ~ 1 | district/school, data = dat)

The syntax ~ 1 | district/school adds a random effect for each district and a random effect for each school-within-district combination. One can also use the equivalent multivariate parameterisation random = ~ school | district with struct = "CS" (compound symmetry), which directly estimates $\tau^2$ and $\rho$ rather than $\sigma^2_B$ and $\sigma^2_W$ — mathematically identical but sometimes more interpretable.

A Common Error

A frequent mistake is replacing the estimate-level random effect with the cluster-level one rather than adding the cluster-level effect to the existing model. Specifying only random = ~ 1 | district (without the nested school term) assumes zero within-cluster heterogeneity — that is, it forces the within-cluster correlation to 1, which means all studies within the same cluster are assumed to share exactly the same true effect. This is almost never a reasonable assumption and can severely distort inference.

Model Checking

For any model with complex random effects, it is essential to verify that the variance components are identifiable. The profile() function in metafor plots the (restricted) log-likelihood as a function of each variance component. A well-identified component shows a clear peak at its estimated value. A flat profile indicates that the parameter cannot be reliably estimated from the data — the model may be overspecified for the available information.

Likelihood ratio tests (via anova()) can formally test whether specific variance components are significantly different from zero. Profile likelihood confidence intervals (via confint()) provide interval estimates for the variance components that are generally more reliable than Wald-type intervals, especially for parameters near boundary values.

Multivariate Meta-Analysis: Multiple Correlated Outcomes

The Problem

When studies measure multiple outcomes in the same subjects — say, probing depth (PD) and attachment level (AL) in a periodontal trial — the sampling errors of the two estimates are correlated because they come from the same patients. Additionally, the true effects for these outcomes may themselves be correlated across studies.

The multivariate random-effects model handles both sources of correlation. For two outcomes measured in each study, the model is:

$y_{ij} = \mu_j + u_{ij} + e_{ij}$

where $\mu_j$ is the average true outcome for the $j$ -th measure, the random effects $u_{i1}$ and $u_{i2}$ have an unstructured variance-covariance matrix (allowing different $\tau^2$ values for each outcome and a correlation $\rho$ between them), and the sampling errors $e_{i1}$ and $e_{i2}$ have a known variance-covariance matrix (the $V$ matrix) that includes both the sampling variances and the covariance between the two estimates within each study.

Constructing the V Matrix

The V matrix — the variance-covariance matrix of the sampling errors — is where multivariate meta-analysis gets both powerful and practically challenging. When two standardized mean differences are computed from the same study (e.g., for depression and anxiety), their covariance depends on the correlation between the two outcome variables in the primary study. Specifically, the sampling correlation between two SMDs is approximately equal to the correlation between the underlying variables (Gleser & Olkin, 2009).

The problem is that this correlation is rarely reported in primary studies. In practice, one must make an educated guess based on domain knowledge. The vcalc() function in metafor facilitates constructing the V matrix given an assumed correlation. Sensitivity analyses across a range of plausible correlations are strongly recommended.

For some data types, the V matrix is exactly known. In the periodontal data from Berkey et al. (1998), for example, the full variance-covariance matrix of the sampling errors was reported for each study — the ideal scenario, but unfortunately the exception rather than the rule.

Variance-Covariance Structures for Random Effects

The rma.mv() function supports several structures for the random effects variance-covariance matrix. For two or more outcomes, the main options are:

Unstructured (UN): Each outcome has its own $\tau^2$ and each pair of outcomes has its own correlation. This is the most flexible but requires the most parameters — for $p$ outcomes, there are $p(p+1)/2$ parameters.

Compound symmetry (CS): All outcomes share the same $\tau^2$ and the same pairwise correlation $\rho$ . Parsimonious but restrictive.

Heteroscedastic compound symmetry (HCS): Each outcome has its own $\tau^2$ but all pairwise correlations are the same $\rho$ .

Diagonal (DIAG): Each outcome has its own $\tau^2$ but all correlations are zero (no correlation between the true effects of different outcomes).

For two outcomes, UN and HCS are equivalent. Model selection between structures can be guided by likelihood ratio tests via anova().

Distinguishable vs. Non-Distinguishable Outcomes

A subtle but important distinction: when multiple effect sizes from the same study pertain to clearly different constructs (e.g., PD versus AL, or mortality versus morbidity), they are "distinguishable" and it makes sense to model separate average effects μ_j for each outcome. When multiple effect sizes pertain to the same construct and are not meaningfully distinct (e.g., results from different measurement instruments assessing quality of life), they are "non-distinguishable" and one typically pools them under a common mean.

Cluster-Robust Inference: A Safety Net

Even with careful model specification, there is always a risk that the V matrix or the random effects structure is misspecified. Cluster-robust inference methods (also called robust variance estimation or the sandwich method) provide a safety net.

The key insight: even if the working model for the variance-covariance structure is wrong, the estimated fixed effects ( $\hat{\mu}$ or the regression coefficients) remain unbiased. However, the standard errors may be incorrect. The sandwich estimator replaces the model-based variance estimate with one derived from the observed residuals, yielding standard errors that are asymptotically correct regardless of model misspecification (Hedges et al., 2010).

The recommended workflow, as described by Pustejovsky and Tipton (2022), is to construct the best V matrix you can, fit an appropriate multilevel or multivariate model, and then apply cluster-robust inference as a protective layer:

# Step 1: Construct V matrix
V <- vcalc(vi, cluster = study, type = outcome, rho = 0.6, data = dat)

# Step 2: Fit the model
res <- rma.mv(yi, V, mods = ~ outcome - 1,
              random = ~ outcome | study, struct = "UN", data = dat)

# Step 3: Apply cluster-robust inference
sav <- robust(res, cluster = study, clubSandwich = TRUE)

The clubSandwich = TRUE option invokes the bias-reduced linearisation (CR2) estimator with Satterthwaite degrees of freedom, which performs better in small samples than the standard Eicker-Huber-White sandwich estimator.

Note that robust variance estimation does not affect the variance component estimates. If the random effects structure is misspecified, the prediction intervals will still be off. Using a good working model therefore remains important — RVE is a supplement, not a substitute.

Meta-Analysis of Longitudinal Data

When studies report outcomes at multiple follow-up time points, the data have a natural longitudinal structure. The sampling errors within a study are correlated (the same patients are measured repeatedly), and the true effects may change systematically over time.

An autoregressive AR(1) structure is often a sensible model for both the sampling error correlations and the random effects correlation. Under this structure, the correlation between two time points decays as a function of the temporal distance between them: $\text{corr}(t_1, t_2) = \rho^{|t_1 - t_2|}$ .

In a meta-analysis of deep brain stimulation for Parkinson's disease (Ishak et al., 2007), outcomes were measured at up to four time points (3, 6, 12, and >12 months) across 46 studies. The $V$ matrix was constructed with an assumed autocorrelation of $\varphi = 0.97$ for the sampling errors, and the true effects were modelled with a heteroscedastic AR(1) structure (different $\tau^2$ at each time point, common autocorrelation for the true effects). This approach allows one to formally test whether outcomes change over time while properly accounting for the correlation between estimates.

The metafor implementation uses:

V <- vcalc(vi, cluster = study, time1 = time, phi = 0.97, data = dat)
res <- rma.mv(yi, V, mods = ~ factor(time) - 1,
              random = ~ time | study, struct = "HAR", data = dat)

Arm-Based Models: An Alternative Perspective

Most clinical meta-analyses work with contrast-based measures (mean differences, log risk ratios) that summarise the comparison between two groups within each study. An alternative approach, the arm-based model, works directly with the arm-level data — the outcome in each treatment group separately.

For example, rather than meta-analysing log risk ratios (RR = risk_trt / risk_ctrl), one can model the log-transformed risks in each arm separately using a bivariate random-effects model. The log risk ratio then emerges as a contrast between the two arm-level pooled estimates.

The advantage is flexibility. Arm-based models can simultaneously estimate absolute risks in each group (not just relative effects), accommodate studies that only report one arm, naturally handle multi-arm trials, and separate between-study heterogeneity in the treatment effect from heterogeneity in baseline risk.

In the BCG vaccine example, the arm-based model yields an estimated log risk ratio of −0.71 — essentially identical to the standard contrast-based model (−0.71). The arm-based approach also reveals that the estimated between-arm correlation is very high ( $\hat{\rho} = 0.94$ ), meaning that studies with higher baseline TB risk also tend to show higher treatment-group risk — the overall treatment contrast is relatively stable despite large variation in absolute risk levels across studies.

Adding fixed study effects to the arm-based model recovers the standard random-effects model exactly, demonstrating that the two approaches are fundamentally related (van Houwelingen et al., 1993).

Practical Recommendations for Clinical Evidence Synthesis

Always check for dependencies. Before fitting any model, map out the data structure. Are there multiple estimates per study? Per research group? Per clinical site? Dependencies that are ignored do not disappear — they bias your inference.

Use the right level of complexity. A three-level model with ~ 1 | cluster/study handles most hierarchical structures. For multiple distinguishable outcomes, move to a full multivariate model with ~ outcome | study. Do not add complexity that the data cannot support — profile the variance components to check identifiability.

Construct the V matrix carefully. The effort invested in approximating the sampling error covariances pays off in model accuracy. Use vcalc(), make reasonable assumptions, and run sensitivity analyses across plausible values of the assumed correlations.

Apply robust inference routinely. The robust() function with clubSandwich = TRUE provides a safety net against model misspecification with minimal cost. It should be part of the standard workflow for any multilevel or multivariate meta-analysis.

Report the variance decomposition. For multilevel models, reporting $\sigma^2_B$ and $\sigma^2_W$ (and the implied intraclass correlation) is as informative as reporting the pooled effect. For multivariate models, reporting the between-outcome correlation of true effects provides insight into whether different clinical endpoints tend to move together across studies.

Use likelihood ratio tests for model selection. When comparing nested variance structures (e.g., UN vs. CS, or testing whether a specific variance component is zero), LRTs via anova() are the appropriate tool.

References

Berkey, C. S., Hoaglin, D. C., Antczak-Bouckoms, A., Mosteller, F., & Colditz, G. A. (1998). Meta-analysis of multiple outcomes by regression with random effects. Statistics in Medicine, 17(22), 2537–2550.
Gleser, L. J., & Olkin, I. (2009). Stochastically dependent effect sizes. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 357–376). Russell Sage Foundation.
Hedges, L. V., Tipton, E., & Johnson, M. C. (2010). Robust variance estimation in meta-regression with dependent effect size estimates. Research Synthesis Methods, 1(1), 39–65.
Ishak, K. J., Platt, R. W., Joseph, L., Hanley, J. A., & Caro, J. J. (2007). Meta-analysis of longitudinal studies. Clinical Trials, 4(5), 525–539.
Konstantopoulos, S. (2011). Fixed effects and variance components estimation in three-level meta-analysis. Research Synthesis Methods, 2(1), 61–76.
Moeyaert, M., Ugille, M., Beretvas, S. N., Ferron, J., Bunuan, R., & Van den Noortgate, W. (2017). Methods for dealing with multiple outcomes in meta-analysis: A comparison between averaging effect sizes, robust variance estimation and multilevel meta-analysis. International Journal of Social Research Methodology, 20(6), 559–572.
Pustejovsky, J. E., & Tipton, E. (2022). Meta-analysis with robust variance estimation: Expanding the range of working models. Prevention Science, 23, 425–438.
van Houwelingen, H. C., Zwinderman, K. H., & Stijnen, T. (1993). A bivariate approach to meta-analysis. Statistics in Medicine, 12(24), 2273–2284.
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48.
Wei, Y., & Higgins, J. P. (2013). Estimating within-study covariances in multivariate meta-analysis with multiple outcomes. Statistics in Medicine, 32(7), 1191–1205.