The Confidence Interval Answers the Wrong Question: Why Every Meta-Analysis Needs a Prediction Interval

This is Part 4 of a series on meta-analysis in clinical evidence synthesis. Part 1 covers the standard toolkit, Part 2 extends it to multilevel and multivariate models, and Part 3 shows how to draw the prediction interval on a forest plot in R.

Summary

The random-effects meta-analysis is the workhorse of clinical evidence synthesis, but the number most readers carry away, the pooled estimate and its 95% confidence interval, describes the average true effect, not the effect a new study or a new clinical setting would see.
A confidence interval narrows toward zero width as you add studies. It quantifies how well we have pinned down the mean. It says nothing about how widely the true effects scatter around that mean.
The prediction interval answers the question clinicians and regulators actually ask: if we deploy this device in the next centre, what range of true effect should we expect? It adds the between-study heterogeneity ( $\tau^2$ ) back into the picture.
A meta-analysis can show a "statistically significant" benefit (confidence interval excluding the null) while its prediction interval comfortably spans both benefit and harm. The two intervals routinely tell opposite stories, and the prediction interval is usually the honest one.
This is a modifiable problem. PRISMA, the Cochrane Handbook, and a decade of methodological pleading all say the same thing. Report the prediction interval, on the forest plot, every time, and know the few situations where it should not be trusted.

A triumph, and a quiet failure

The random-effects meta-analysis is one of the genuine triumphs of evidence-based medicine. It lets us pool a scattered, underpowered, multi-centre literature into a single defensible estimate. It forces transparency about which studies were included and how they were weighted. It turns a stack of conflicting papers into a number a guideline committee can act on. For medical-device evidence under the EU MDR, where the literature is almost always a handful of small, heterogeneous studies, it is often the only way to say anything quantitative at all.

However, the single number readers extract from that analysis is almost always the wrong one. They read the pooled effect, glance at its 95% confidence interval, check whether it crosses the null, and stop. The confidence interval feels like the answer to "how big is the effect, and how sure are we?" It is not. It answers a question about the average that almost nobody is actually asking. The question a clinician, a payer, or a Notified Body reviewer cares about is forward-looking. In the next patient population, with the next operator, at the next centre, what effect should we actually expect? That question has a different answer, a different interval, and usually a sobering width.

This post builds that second interval from the ground up. We start with what a 95% confidence interval is and how it is calculated, show where heterogeneity enters, expose the connection that motivates the prediction interval, and define the prediction interval. We then compare the two head to head, map when it helps and when it misleads, give the calculation explicitly, and end with limitations and a short list of what to do. Throughout, one worked example does the heavy lifting.

The running example

For illustration, suppose we pool 10 randomised trials of a medical device against standard care, with a binary safety outcome expressed as an odds ratio (OR < 1 favours the device). A random-effects model (REML, Knapp–Hartung) returns:

Pooled OR = 0.75, 95% CI 0.62 to 0.91 (p ≈ 0.004), a "significant" 25% reduction in the odds of the event.
Heterogeneity: $I^2 = 71\%$ , $\tau^2 = 0.10$ (on the log-odds scale), $\hat\tau \approx 0.32$ .
95% prediction interval: OR 0.35 to 1.61.

Hold those two intervals side by side. The confidence interval says benefit, and we are sure. The prediction interval says the true effect in a new setting could plausibly be anything from a 65% reduction to a 60% increase in the odds of harm. Same data, same model. Opposite clinical message. Everything below explains why both are correct, and which one you should put in front of a decision-maker.

What a 95% confidence interval is, and how it is built

Start with the definition, because the conventional gloss ("we are 95% sure the true value lies inside") is wrong and the error matters here. A 95% confidence interval is a frequentist coverage statement about a procedure: if we repeated the entire study-and-analysis process many times, 95% of the intervals constructed this way would contain the true parameter. The parameter is fixed. The interval is random. What the interval quantifies is sampling uncertainty in our estimate of one specific quantity, and in a random-effects meta-analysis that quantity is $\mu$ , the mean of the distribution of true effects.

Mechanically, a confidence interval is a point estimate plus or minus a multiple of its standard error:

$\hat\mu \pm z_{0.975}\,\mathrm{SE}(\hat\mu), \qquad z_{0.975} = 1.96$

In meta-analysis the point estimate $\hat\mu$ is a weighted average of the study effects, and the weights are the inverse of each study's total variance. Under the random-effects model each study $i$ has sampling variance $v_i$ and shares the between-study variance $\tau^2$ , so its weight is

$w_i^{*} = \frac{1}{v_i + \hat\tau^2}, \qquad \hat\mu = \frac{\sum_i w_i^{*}\,y_i}{\sum_i w_i^{*}}, \qquad \mathrm{Var}(\hat\mu) = \frac{1}{\sum_i w_i^{*}}.$

The standard error is $\mathrm{SE}(\hat\mu) = \sqrt{\mathrm{Var}(\hat\mu)}$ , and the confidence interval follows. (With few studies, the better-calibrated Knapp–Hartung approach replaces $z_{0.975}$ with a $t$ -quantile on $k-1$ degrees of freedom and inflates the standard error. This fixes the well-documented under-coverage of the naive interval and should be the default, per IntHout et al., 2014.)

The decisive feature is in the denominator $\sum_i w_i^{*}$ . Every study you add increases that sum, so $\mathrm{Var}(\hat\mu)$ shrinks, and the confidence interval narrows. As $k \to \infty$ the confidence interval collapses toward a point. That is exactly what you want from an estimate of an average, and exactly why it cannot be the whole story.

Where heterogeneity enters

The equal-effects (fixed-effect) model assumes a single shared truth $\theta$ : every study estimates the same effect, and the only reason observed results differ is sampling error. In clinical and especially device research that assumption is rarely tenable. Operators climb learning curves, device generations evolve, populations and follow-up windows differ, centres differ. The effects genuinely vary.

The random-effects model encodes this. It assumes each study has its own true effect $\theta_i$ , drawn from a distribution:

$\theta_i \sim N(\mu, \tau^2).$

Here $\mu$ is the mean of that distribution and $\tau^2$ is its variance, the between-study heterogeneity, measured in the squared units of the effect size. When $\tau^2 = 0$ the model collapses back to the equal-effects case. When $\tau^2 > 0$ , there is no single "true effect" to estimate at all. There is a spread of true effects, and $\mu$ is merely its centre of gravity.

Three statistics describe that spread, and the field's habit of reporting only the last one is the root of the problem this post addresses:

$\tau^2$ (and $\hat\tau$ ): the absolute heterogeneity, in effect-size units. Most interpretable, least reported.
$I^2$ : the proportion of total variability due to heterogeneity rather than sampling error. Crucially, $I^2$ is a relative measure: for the same $\tau^2$ , it grows simply because the studies are larger. An $I^2$ of 71% does not tell you, in clinical units, how much the effects differ (Borenstein et al., 2017). It is routinely over-interpreted as if it did.
$Q$ : Cochran's test of the null hypothesis $\tau^2 = 0$ . Underpowered when $k$ is small, hypersensitive when $k$ is large. A non-significant $Q$ is not evidence of homogeneity.

Notice what is missing from that list: a number, in the original effect-size units, that tells a reader how far the true effects actually scatter. That number is the prediction interval, and the reason heterogeneity matters here is that the confidence interval throws all of this away. $\tau^2$ enters the weights, and thereby nudges $\hat\mu$ and slightly widens $\mathrm{Var}(\hat\mu)$ , but the confidence interval is still an interval about the mean. It does not report the spread. It averages over it.

The missing link: from the average to the next study

Here is the connection that motivates everything that follows. The confidence interval contains exactly one source of uncertainty, the sampling uncertainty about where the mean $\mu$ sits, namely $\mathrm{Var}(\hat\mu)$ . But a decision-maker facing the next study, the next centre, the next patient population, faces two sources of uncertainty stacked on top of each other:

We do not know the mean $\mu$ exactly. That is $\mathrm{Var}(\hat\mu)$ , the confidence-interval part.
Even if we knew $\mu$ perfectly, the next study's true effect $\theta_{\text{new}}$ is a fresh draw from $N(\mu, \tau^2)$ , scattering around $\mu$ with variance $\tau^2$ .

A forecast for a future true effect must carry both. The relevant variance is therefore $\tau^2 + \mathrm{Var}(\hat\mu)$ , not $\mathrm{Var}(\hat\mu)$ alone. That single added term, $\tau^2$ , is the entire difference between estimating the past and predicting the future, and it is precisely the term the confidence interval discards.

This reframes the inversion that catches so many readers off guard. A narrow confidence interval is not, by itself, reassurance. With enough studies it narrows no matter how wildly the true effects scatter, because $\mathrm{Var}(\hat\mu)\to 0$ while $\tau^2$ stays exactly where it is. A tight confidence interval around a heterogeneous body of evidence means only that we have precisely located the centre of a wide cloud, not that the cloud is small. To describe the cloud, we need a different interval.

What a prediction interval is

The prediction interval is the range within which the true effect of a future study is expected to fall, with a stated probability (conventionally 95%). It is not a statement about $\mu$ . It is a statement about $\theta_{\text{new}}$ , an as-yet-unobserved true effect drawn from the same distribution of effects the meta-analysis is modelling (Higgins, Thompson & Spiegelhalter, 2009, and Riley, Higgins & Deeks, 2011).

Read plainly, the prediction interval is where you should expect the truth to land next time. In our running example it is OR 0.35 to 1.61. That is the sentence to put in a clinical evaluation report, because it matches the decision being made: in a new setting, the device's true effect on this outcome is expected to lie between a 65% reduction and a 60% increase in the odds of the event. A clinician is never going to treat "the average of past trials." They are going to treat the patient in front of them, in a setting that is, statistically, the next draw.

Prediction interval versus confidence interval

The cleanest way to hold the two apart is by what they are about and how they behave as evidence accumulates.

	95% Confidence interval	95% Prediction interval
Estimates	the mean true effect $\mu$	a future true effect $\theta_{\text{new}}$
Variance used	$\mathrm{Var}(\hat\mu)$	$\tau^2 + \mathrm{Var}(\hat\mu)$
As $k \to \infty$	shrinks toward a point	converges to $\mu \pm 1.96\,\tau$ , does not vanish
When $\tau^2 = 0$	(essentially) coincides with the prediction interval	(essentially) coincides with the confidence interval
Answers	"How precisely do we know the average?"	"What should we expect next time?"
Width	always narrower	always at least as wide, usually much wider

Two consequences deserve emphasis. First, the prediction interval is always at least as wide as the confidence interval, because it adds a non-negative term ( $\tau^2$ ) under the square root, and with real heterogeneity it is dramatically wider. Second, and most importantly for interpretation: the two can cross the null on opposite sides. Our example is exactly this case. The CI of 0.62 to 0.91 excludes 1 and looks "significant," while the PI of 0.35 to 1.61 includes 1 and reaches well into harm. When that happens, the meta-analysis is telling you that the average effect is beneficial but the next result is genuinely uncertain in direction. Reporting only the confidence interval in that situation is not a simplification. It is a misrepresentation.

When the prediction interval is useful

The prediction interval earns its place whenever the decision is about application, not just description, which in clinical work is almost always.

Whenever heterogeneity is real. With non-trivial $\tau^2$ , the prediction interval is the single most clinically informative summary of a meta-analysis (Riley et al., 2011, and IntHout et al., 2016). It converts an abstract $I^2$ into a concrete range a clinician can reason about.
Medical-device evidence under the MDR. Device heterogeneity is structural, not noise. It comes from generations, operators, centres, and indications. The prediction interval is the honest way to express it in a clinical evaluation report, and it directly supports the Post-Market Clinical Follow-up logic: it states, quantitatively, what a future PMCF study might find.
Translating evidence to a new population or site. "Will this work in our hospital?" is a prediction question. The prediction interval is its native answer.
Guarding against false confidence. When a confidence interval excludes the null but the prediction interval does not, the prediction interval is the warning label. It stops a guideline panel from over-reading a precise-looking average.
On the forest plot, always. The prediction interval should be drawn as an extended bar beneath the pooled diamond (in metafor, addpred = TRUE). PRISMA, MOOSE, and Cochrane all call for it.

When the prediction interval misleads

The prediction interval is not a free good, and a few situations make it untrustworthy or meaningless. Knowing them is part of using it responsibly.

Too few studies. The interval depends on a variance of a distribution ( $\tau^2$ ) estimated from $k$ points. With small $k$ , $\hat\tau^2$ is wildly uncertain, and the standard interval's actual coverage can stray far from 95% (Partlett & Riley, 2017). The interval requires at least $k = 3$ to exist at all (the $t$ -distribution needs $k-2 \ge 1$ degrees of freedom), and many methodologists would not trust it below roughly $k = 5$ , treating it cautiously up to about $k = 10$ . Improved constructions exist for the small- $k$ case (Nagashima, Noma & Furukawa, 2019), and they should be used when the decision is high-stakes and the studies are few.
When the normal-distribution assumption fails. The prediction interval assumes the true effects are approximately normally distributed. With a few outlying studies, distinct subpopulations, or a genuinely bimodal effect, that assumption breaks and the interval describes a distribution that does not exist. The fix is not a wider interval. It is to investigate the heterogeneity (subgroups, meta-regression) before predicting through it.
When the heterogeneity is artefactual. If the spread is driven by risk-of-bias differences, data-extraction error, or pooling clinically incompatible studies, then $\tau^2$ , and therefore the prediction interval, is quantifying noise and mistakes, not real clinical variability. A prediction interval built on a bad synthesis inherits its sins.
When the model itself is wrong. Under an equal-effects model the prediction interval is not defined in any useful sense (there is no distribution to predict from). And the choice of random-effects versus equal-effects must be a modelling decision about the estimand, never a reflex driven by the $Q$ -test p-value.

In short, the prediction interval is only as good as the random-effects model and the studies underneath it. It is a magnifying glass, not a microscope. It makes real heterogeneity legible, but it cannot manufacture trustworthy structure out of a handful of incompatible trials.

How the prediction interval is calculated

The simplest textbook form ignores the uncertainty in $\hat\mu$ and uses the heterogeneity alone:

$\hat\mu \pm 1.96\,\hat\tau.$

This is the limiting form (large $k$ ) and is fine for intuition, but it understates the interval because it pretends we know $\mu$ exactly. The recommended construction adds the estimation uncertainty back in and, because $\tau^2$ is itself estimated from $k$ studies, replaces the normal quantile with a $t$ -quantile on $k-2$ degrees of freedom (Higgins, Thompson & Spiegelhalter, 2009, and IntHout et al., 2016):

$\hat\mu \;\pm\; t_{k-2,\,0.975}\,\sqrt{\hat\tau^2 + \mathrm{Var}(\hat\mu)}.$

The two extra ingredients versus the confidence interval are exactly the two we identified earlier: the $\tau^2$ term inside the square root (the spread of true effects) and the heavier-tailed $t$ multiplier (the penalty for having estimated that spread from few studies).

Worked through, on the log-odds scale, for our example. We have $\hat\mu = \ln(0.75) = -0.288$ , $\hat\tau^2 = 0.10$ , and $\mathrm{Var}(\hat\mu) \approx 0.0096$ (so $\mathrm{SE}(\hat\mu) \approx 0.098$ , consistent with the CI 0.62 to 0.91). With $k = 10$ , $t_{8,\,0.975} = 2.306$ :

$-0.288 \;\pm\; 2.306\,\sqrt{0.10 + 0.0096} \;=\; -0.288 \pm 2.306 \times 0.331 \;=\; -0.288 \pm 0.764.$

That gives a log-odds interval of $(-1.05,\ 0.48)$ . Exponentiating back to the odds-ratio scale yields OR 0.35 to 1.61, the prediction interval quoted at the top. (Using the naive $1.96\hat\tau$ form instead gives roughly 0.39 to 1.43, which is narrower, and falsely so.)

In R with metafor, none of this is done by hand. The model fit carries everything. The predict() function returns the prediction limits and forest() draws them:

library(metafor)

# dat already contains log odds ratios (yi) and variances (vi) from escalc()
res <- rma(yi, vi, data = dat, method = "REML", test = "knha")

# Confidence interval is in the model summary. Prediction interval below:
predict(res, transf = exp)     # pi.lb / pi.ub on the OR scale

# Forest plot WITH the prediction interval drawn under the diamond
forest(res, atransf = exp, at = log(c(0.25, 0.5, 1, 2, 4)),
       addpred = TRUE,                      # the prediction interval bar
       header = c("Study", "OR [95% CI]"))

The single argument addpred = TRUE is the difference between a forest plot that shows only the average and one that also shows the spread. It costs nothing and changes the conclusion a reader draws.

Limitations

Even when used correctly, the prediction interval carries caveats worth stating plainly, so it is neither dismissed nor over-sold:

It inherits all the small- $k$ fragility above. Coverage of the standard interval is approximate and can be poor below roughly 5 to 10 studies. Report it with that honesty, or use a small-sample-corrected method (Nagashima et al., 2019).
It is sensitive to the $\tau^2$ estimator. Different estimators (REML, DerSimonian–Laird, Paule–Mandel) yield different $\hat\tau^2$ and therefore different widths, so a sensitivity analysis across estimators is good practice.
It assumes normality of true effects, an assumption that is hard to check with few studies and easy to violate with subpopulations.
It does not fix bias. Publication bias, selective outcome reporting, and confounding in non-randomised device studies shift the whole distribution. A prediction interval around a biased centre is a precise statement about the wrong place.
It is wider, and that is the point, not a flaw. A common objection is that the prediction interval "looks bad" because it so often crosses the null. That width is information. Suppressing it does not make the underlying uncertainty go away. It only hides it from the decision-maker.

What to do: report the spread, not just the centre

The fix here is unusually cheap and unusually clear, which is why a decade of methodological literature keeps repeating it. The problem is not a deficiency of the data or of the software. It is a habit of reporting. Habits are modifiable.

Fit a random-effects model with Knapp–Hartung by default, and treat the equal-effects model as a special-case assumption to be justified, never a default.
Report all of $\tau^2$ , $I^2$ (with its confidence interval), and the 95% prediction interval, not $I^2$ alone. Give heterogeneity in clinical units, not just as a percentage.
Draw the prediction interval on every forest plot (addpred = TRUE). If a figure shows a pooled diamond without a prediction bar, it is incomplete.
Interpret the two intervals together. State explicitly when they disagree, as in "the average effect favours the device (CI 0.62 to 0.91), but a future study's true effect could plausibly range from benefit to harm (PI 0.35 to 1.61)." That sentence is more honest, and more useful, than either interval alone.
When $k$ is small, say so and use a corrected method. Do not let an unstable $\hat\tau^2$ masquerade as precision in either direction.
In a CER, frame the prediction interval as the bridge to PMCF, the quantitative expectation against which post-market data will be judged.

Conclusion

The confidence interval is not wrong. It is answering a different question from the one most readers think they are asking. It tells you how precisely you have located the average of a distribution of true effects. The prediction interval tells you how wide that distribution is, and therefore what the next study, the next centre, the next patient population should be expected to show. In a heterogeneous evidence base, the gap between the two is not a technicality. It is the difference between "we are confident about the past" and "we can forecast the future," and only the second is a clinical decision.

Worse, the two intervals routinely point in opposite directions: a "significant" pooled benefit sitting on top of a prediction interval that spans harm. When that happens, reporting only the confidence interval does not simplify the evidence. It overstates it. And the cost of doing better is a single argument in a single function call.

Report the spread, not just the centre. Draw the prediction interval on every forest plot. The reform is overdue, and it is free.

References

Borenstein, M., Higgins, J. P. T., Hedges, L. V., & Rothstein, H. R. (2017). Basics of meta-analysis: I² is not an absolute measure of heterogeneity. Research Synthesis Methods, 8(1), 5–18.
Higgins, J. P. T., Thompson, S. G., & Spiegelhalter, D. J. (2009). A re-evaluation of random-effects meta-analysis. Journal of the Royal Statistical Society: Series A, 172(1), 137–159.
IntHout, J., Ioannidis, J. P. A., & Borm, G. F. (2014). The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC Medical Research Methodology, 14, 25.
IntHout, J., Ioannidis, J. P. A., Rovers, M. M., & Goeman, J. J. (2016). Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open, 6(7), e010247.
Nagashima, K., Noma, H., & Furukawa, T. A. (2019). Prediction intervals for random-effects meta-analysis: A confidence distribution approach. Statistical Methods in Medical Research, 28(6), 1689–1702.
Partlett, C., & Riley, R. D. (2017). Random effects meta-analysis: Coverage performance of 95% confidence and prediction intervals following REML estimation. Statistics in Medicine, 36(2), 301–317.
Riley, R. D., Higgins, J. P. T., & Deeks, J. J. (2011). Interpretation of random effects meta-analyses. British Medical Journal, 342, d549.
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48.