Meta-Analysis in Clinical Evidence Synthesis: From Effect Sizes to Regulatory Decision-Making

This is Part 1 of a series on meta-analysis in clinical evidence synthesis. Part 2: Multilevel and Multivariate Meta-Analysis covers advanced methods for handling dependent effect sizes, correlated outcomes, and complex data structures.

Why Meta-Analysis Matters for Clinical Evidence

In an era where regulatory bodies, HTA agencies, and payers demand ever more rigorous evidence, meta-analysis has become indispensable. Whether you are building a clinical evaluation report under the EU MDR, preparing a regulatory submission to the FDA, or developing a value dossier for market access — the ability to systematically aggregate and critically appraise quantitative evidence from multiple studies is a core competency.

Meta-analysis is not simply "averaging results." It is a set of statistical methods for combining, comparing, and drawing inferences from collections of related studies, where the key idea is to quantify the size, direction, and strength of an effect in each study and use this as the primary data for further analysis (Cooper, 2017). The distinction between a narrative review and a proper meta-analysis lies in the transparency, reproducibility, and statistical rigour of the synthesis process.

This post provides an overview of the fundamental methods, practical considerations, and pitfalls relevant to anyone working in clinical evidence synthesis — whether in medical devices, pharmaceuticals, or academic research.

The Meta-Analytic Framework

A systematic review and meta-analysis follows a structured process (Cooper, 2017):

Problem formulation — defining the clinical question (often using the PICO framework)
Literature search — comprehensive, reproducible searches across multiple databases
Information gathering — data extraction from included studies
Quality / risk of bias evaluation — using tools such as the Cochrane Risk of Bias tool or ROBINS-I
Analysis — the quantitative synthesis (this is the meta-analysis proper)
Interpretation — what does the pooled estimate mean, and how certain are we?
Presentation of results — forest plots, summary tables, and clear reporting per PRISMA guidelines

The analysis step addresses several fundamental questions: What is the overall average effect? Is the effect constant across studies or does it vary? If it varies, by how much — and can we explain the variation through study-level characteristics?

Outcome Measures: Choosing the Right Effect Size

A critical first step is selecting an outcome measure that quantifies the phenomenon of interest in a way that is comparable across studies. The choice depends on the type of data:

For two-group comparisons with continuous outcomes, the most common measures are the raw mean difference (MD), the standardized mean difference (SMD, often called Hedges' g after bias correction), and the ratio of means (ROM). The raw mean difference preserves the original measurement scale and is directly interpretable when all studies use the same instrument. When studies use different scales, standardization by the pooled standard deviation becomes necessary — but interpretation then requires anchoring back to a familiar metric. For instance, an SMD of −0.20 for the effect of lead exposure on cognitive function translates to roughly a 3-point IQ deficit when the standard deviation of IQ scores is approximately 15 (Borenstein & Hedges, 2019).

For two-group comparisons with dichotomous outcomes (the bread and butter of clinical trials reporting adverse events, responder rates, or mortality), the standard measures are the log risk ratio (RR), the log odds ratio (OR), and the risk difference (RD). The logarithmic transformation is used because it produces a symmetrical measure centred at zero under the null hypothesis and because it yields approximate normality in the sampling distribution — both important properties for valid inference (Borenstein & Hedges, 2019).

For association data, the Pearson correlation coefficient (or its Fisher r-to-z transformation) is standard. The transformation stabilises the variance and improves the normal approximation, which is why it is generally recommended for meta-analytic pooling.

Each of these measures comes with a known (or estimable) sampling variance, which reflects the precision of each study's estimate and directly determines how much weight the study receives in the analysis.

The Core Models: Equal-Effects vs. Random-Effects

The two standard meta-analytic models differ in a fundamental assumption about the nature of heterogeneity:

The equal-effects (fixed-effect) model assumes that all studies share one common true effect $\theta$ , and that the only source of variability in observed outcomes is sampling error. This model is appropriate when you genuinely believe that all studies are functionally identical — a strong assumption that is rarely tenable in practice.

The random-effects model assumes that the true effects $\theta_i$ themselves vary across studies, following a normal distribution with mean $\mu$ and between-study variance $\tau^2$ . This "normal-normal" model acknowledges that differences in study populations, interventions, measurement instruments, and contexts lead to genuine variation in the underlying effects (Riley et al., 2011). When $\tau^2 = 0$ , the random-effects model reduces to the equal-effects model.

The random-effects model is almost always more appropriate in clinical research, where perfect homogeneity across studies is the exception rather than the rule. The pooled estimate $\hat{\mu}$ then represents the estimated average true effect, and the prediction interval $(\hat{\mu} \pm 1.96\sqrt{\hat{\tau}^2})$ gives a range within which approximately 95% of the true study-specific effects are expected to fall — a far more informative quantity than the confidence interval alone (Riley et al., 2011).

Quantifying Heterogeneity

Understanding heterogeneity is arguably as important as estimating the average effect. Several statistics are commonly reported:

$\tau^2$ (tau-squared) is the estimated between-study variance. Several estimators exist — DerSimonian-Laird (DL), restricted maximum likelihood (REML), and the Paule-Mandel (PM) estimator among others. REML is generally recommended due to its good statistical properties and generalisability to more complex models (Viechtbauer, 2005).

$I^2$ expresses the proportion of total variability attributable to true heterogeneity rather than sampling error. While widely reported, $I^2$ is a relative measure — for the same degree of variability in true effects, $I^2$ can be small or large depending on study sizes (Borenstein et al., 2017). It should therefore not be interpreted as an absolute measure of how much the effects differ.

The prediction interval is arguably the most informative measure for clinical decision-making, as it directly shows, in the units of the effect size, the range of effects one might expect in a new study. A meta-analysis showing a statistically significant average effect with a prediction interval spanning the null tells a very different story from one where the prediction interval excludes the null entirely.

Meta-Regression: Explaining Heterogeneity

When substantial heterogeneity is present, meta-regression can be used to examine whether study-level characteristics (moderators) account for some of the variability. The mixed-effects meta-regression model extends the random-effects model by including one or more predictor variables.

A classic example comes from the BCG vaccine meta-analysis (Colditz et al., 1994): the 13 included trials showed substantial heterogeneity in BCG vaccine efficacy against tuberculosis (I² ≈ 92%). However, when absolute latitude of the study location was included as a moderator, it accounted for approximately 80% of the heterogeneity (R² ≈ 0.80). This made biological sense — environmental mycobacteria, which may confer natural immunity against TB, are more prevalent near the equator, making the BCG vaccine appear less effective in those regions.

The pseudo- $R^2$ statistic indicates the proportion of heterogeneity explained by the moderator(s), calculated as $(\tau^2_{RE} - \tau^2_{ME}) / \tau^2_{RE}$ . Categorical moderators (e.g., randomisation method, blinding status, geographical region) are handled through dummy coding, and contrasts between subgroups can be tested via Wald-type tests.

Refined Inference Methods

Standard Wald-type confidence intervals and tests assume that the between-study variance $\tau^2$ is known. In reality, $\tau^2$ is estimated — and especially with small numbers of studies, this introduces additional uncertainty that standard methods fail to capture, leading to confidence intervals that are too narrow and Type I error rates that are inflated.

The Knapp-Hartung method addresses this by using a t-distribution (with $k - 1$ degrees of freedom in the random-effects model) instead of a normal distribution and by incorporating an additional variance component. This should be the default approach for random-effects and mixed-effects models (IntHout et al., 2014). With very small $k$ , the resulting confidence intervals may be wide — but this correctly reflects the difficulty of making inferences from few studies.

Permutation testing provides a complementary, distribution-free approach: under the null hypothesis that the true average effect is zero, the signs of the observed outcomes are arbitrary. By randomly permuting the signs and re-computing the test statistic many times, one obtains an exact (or approximate) null distribution against which to evaluate the observed statistic.

Publication Bias: The Elephant in the Room

Publication bias — the tendency for studies with statistically significant or "positive" results to be more likely published — is perhaps the single greatest threat to the validity of a meta-analysis. The magnesium treatment example (Egger et al., 1997) provides a cautionary tale: a meta-analysis of 14 small trials suggested that intravenous magnesium significantly reduced mortality after myocardial infarction (RR ≈ 0.50). However, the subsequent large-scale ISIS-4 trial (n > 58,000) found no benefit whatsoever (RR ≈ 1.0). The earlier positive finding was almost certainly driven by publication bias favouring small studies with positive results.

Detection Methods

Funnel plots are the most common visual diagnostic. Under the assumption of no publication bias, a plot of effect sizes against their standard errors (or other precision measures) should resemble a symmetric, inverted funnel. Asymmetry suggests that small, imprecise studies with non-significant results may be missing. However, funnel plots are difficult to interpret, especially with small k and large τ² (Terrin et al., 2005).

Egger's regression test formalises funnel plot assessment by regressing effect sizes on their standard errors. A significant slope indicates asymmetry — though this is a test for funnel plot asymmetry, not publication bias per se, as other factors (e.g., genuine effect-size differences between small and large studies) can also produce asymmetry (Sterne et al., 2011).

The rank correlation test (Begg & Mazumdar, 1994) and the test of excess significance (Ioannidis & Trikalinos, 2007) provide additional diagnostics.

Adjustment Methods

When publication bias is suspected, several correction methods are available:

The trim and fill method (Duval & Tweedie, 2000) imputes "missing" studies to restore funnel plot symmetry and re-estimates the pooled effect. While intuitive, it relies on strong assumptions about the suppression mechanism.

PET-PEESE (Stanley & Doucouliagos, 2014) uses the intercept from a meta-regression of effect sizes on their standard errors (PET) or sampling variances (PEESE) as a bias-corrected estimate — conceptually, the projected effect for a study with infinite precision.

Selection models (Hedges, 1984; Iyengar & Greenhouse, 1988) directly model the relationship between a study's p-value and its probability of inclusion, providing a principled statistical framework for bias correction. These are the most theoretically appealing methods but require larger k for stable estimation.

The fail-safe N ("file drawer analysis") assesses robustness by calculating how many null-result studies would need to exist to overturn the meta-analytic conclusion (Rosenthal, 1979). While often criticised, it remains useful as a quick sensitivity check when properly interpreted.

No single method is definitive — best practice is to apply multiple approaches and evaluate whether conclusions are robust across methods.

Implications for Regulatory and Industry Practice

For professionals working in clinical evidence synthesis for medical devices or pharmaceuticals, several practical takeaways emerge:

Model choice matters. Random-effects models with the Knapp-Hartung adjustment should be the default. Reporting only an equal-effects model when heterogeneity is present can substantially overstate the precision of the pooled estimate.

Report prediction intervals. For regulatory reviewers and clinicians, the prediction interval is often more decision-relevant than the confidence interval. It answers the question: "What effect might we see in a new study or clinical setting?"

Address publication bias explicitly. Clinical evaluation reports, HTA submissions, and systematic reviews should include a dedicated publication bias assessment. Funnel plots, regression tests, and at least one adjustment method should be reported.

Heterogeneity is information, not noise. Rather than viewing heterogeneity as a nuisance, investigate it through meta-regression. Understanding why effects vary across studies is often more valuable than a single pooled number.

Pre-registration helps. The gold standard remains prospective meta-analysis of registered studies — effectively eliminating publication bias by design (Simes, 1995). For post-hoc meta-analyses, a registered protocol (e.g., on PROSPERO) and transparent reporting remain essential.

Software

The R package metafor (Viechtbauer, 2010) provides a comprehensive, open-source toolkit for conducting meta-analyses and is widely used in both academic research and industry. It supports all the methods discussed here — from effect size computation via escalc(), to random-effects and mixed-effects models via rma(), to publication bias methods including funnel plots, regression tests, trim-and-fill, selection models, and permutation tests.

For those less comfortable with R, commercial alternatives include Comprehensive Meta-Analysis (CMA) and RevMan (by Cochrane). However, the flexibility and transparency of R-based analyses are increasingly expected in regulatory and HTA contexts.

References

Begg, C. B., & Mazumdar, M. (1994). Operating characteristics of a rank correlation test for publication bias. Biometrics, 50(4), 1088–1101.
Borenstein, M., & Hedges, L. (2019). Effect sizes for meta-analysis. In H. Cooper, L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis (3rd ed., pp. 207–243). Russell Sage Foundation.
Borenstein, M., Higgins, J. P. T., Hedges, L. V., & Rothstein, H. R. (2017). Basics of meta-analysis: I² is not an absolute measure of heterogeneity. Research Synthesis Methods, 8(1), 5–18.
Colditz, G. A., Brewer, T. F., Berkey, C. S., Wilson, M. E., Burdick, E., Fineberg, H. V., et al. (1994). Efficacy of BCG vaccine in the prevention of tuberculosis: Meta-analysis of the published literature. Journal of the American Medical Association, 271(9), 698–702.
Cooper, H. M. (2017). Research synthesis and meta-analysis: A step-by-step approach (5th ed.). Sage.
Duval, S. J., & Tweedie, R. L. (2000). Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56(2), 455–463.
Egger, M., Davey Smith, G., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. British Medical Journal, 315(7109), 629–634.
Hedges, L. V. (1984). Estimation of effect size under nonrandom sampling: The effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational Statistics, 9(1), 61–85.
IntHout, J., Ioannidis, J. P. A., & Borm, G. F. (2014). The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC Medical Research Methodology, 14, 25.
Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4(3), 245–253.
Iyengar, S., & Greenhouse, J. B. (1988). Selection models and the file drawer problem. Statistical Science, 3(1), 109–135.
Riley, R. D., Higgins, J. P., & Deeks, J. J. (2011). Interpretation of random effects meta-analyses. British Medical Journal, 342, d549.
Rosenthal, R. (1979). The "file drawer problem" and tolerance for null results. Psychological Bulletin, 86(3), 638–641.
Siddaway, A. P., Wood, A. M., & Hedges, L. V. (2019). How to do a systematic review: A best practice guide for conducting and reporting narrative reviews, meta-analyses, and meta-syntheses. Annual Review of Psychology, 70, 747–770.
Simes, R. J. (1995). Prospective meta-analysis of cholesterol-lowering studies: The Prospective Pravastatin Pooling (PPP) Project and the Cholesterol Treatment Trialists (CTT) Collaboration. American Journal of Cardiology, 76(9), 122C–126C.
Stanley, T. D., & Doucouliagos, H. (2014). Meta-regression approximations to reduce publication selection bias. Research Synthesis Methods, 5(1), 60–78.
Sterne, J. A., Sutton, A. J., Ioannidis, J. P., et al. (2011). Recommendations for examining and interpreting funnel plot asymmetry in meta-analyses of randomised controlled trials. British Medical Journal, 343, d4002.
Terrin, N., Schmid, C. H., & Lau, J. (2005). In an empirical evaluation of the funnel plot, researchers could not visually identify publication bias. Journal of Clinical Epidemiology, 58(9), 894–901.
Viechtbauer, W. (2005). Bias and efficiency of meta-analytic variance estimators in the random-effects model. Journal of Educational and Behavioral Statistics, 30(3), 261–293.
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48.