Time Series Forecasting for Critical Systems: Building Reliable Predictions with ARIMAX

Why This Matters

Infrastructure systems that support public health and safety require forecasting methods that balance predictive accuracy with methodological transparency. Gas storage management, pharmaceutical supply chains, hospital capacity planning, and regulatory compliance monitoring all share a common challenge: making forward-looking decisions under uncertainty while maintaining accountability to stakeholders.

This post examines the development of a forecasting system for German gas storage levels using ARIMAX (AutoRegressive Integrated Moving Average with eXogenous predictors) time series methodology. While the application context is energy infrastructure, the statistical framework and implementation principles transfer directly to healthcare capacity modeling, clinical trial planning, pharmaceutical inventory optimization, and post-market surveillance systems. The emphasis throughout is on transparent specification of modeling assumptions, rigorous out-of-sample validation, and honest quantification of forecast uncertainty.

The Forecasting Problem

Germany's gas storage infrastructure functions as a strategic buffer between supply and demand. Storage operators inject gas during periods of low consumption (typically April through October) and withdraw during peak heating season (October through March). This seasonal arbitrage stabilizes the system, but requires careful monitoring to prevent depletion during extended cold periods or supply disruptions.

The German Federal Network Agency (Bundesnetzagentur) defines two threshold levels in the gas emergency plan that trigger escalating regulatory and market interventions:

Alarmstufe (Alert Level): 15% aggregate storage fill
Notfallstufe (Emergency Level): 10% aggregate storage fill

The forecasting objective is to estimate daily aggregate fill levels over a 14-day horizon, with probabilistically calibrated prediction intervals and threshold breach probabilities that support risk assessment by policymakers and infrastructure operators.

Methodological Challenges

Several characteristics of this forecasting problem are representative of broader infrastructure and healthcare capacity applications:

First, the outcome depends on multiple drivers operating at different timescales. Heating demand responds to temperature on hourly timescales but exhibits weekly business cycle patterns and annual seasonal patterns. Market prices reflect forward-looking expectations and adjust in milliseconds, while physical import capacity changes over quarters and years. A credible forecasting model must integrate these heterogeneous inputs coherently.

Second, critical forecast inputs are themselves uncertain. Next week's temperature is unknown. Import levels and market prices evolve stochastically. The model must propagate this input uncertainty through to forecast uncertainty, rather than treating future conditions as deterministic.

Third, historical data provides limited information about tail scenarios. Aggregate storage levels below 20% are rare in the training sample, making it difficult to estimate model behavior in precisely the scenarios where accurate forecasts matter most. This requires careful treatment of extrapolation uncertainty.

Fourth, stakeholder decision-making requires probability statements, not just point forecasts. A regulator needs to know "What is the probability of breaching the alert threshold by day 10?" not merely "What is the expected fill level on day 10?" This necessitates proper calibration of forecast intervals through out-of-sample validation.

These challenges are not unique to energy systems. Hospital capacity planners face analogous problems with seasonal respiratory illness patterns, uncertain admission rates, and rare surge scenarios. Pharmaceutical inventory managers must forecast demand driven by prescription patterns, demographic shifts, and stochastic health shocks. The ARIMAX framework provides a principled approach to integrating autoregressive temporal structure with exogenous predictors while maintaining tractability and interpretability.

The ARIMAX Framework

ARIMAX stands for AutoRegressive Integrated Moving Average with eXogenous predictors. Understanding this acronym provides insight into how the model integrates different sources of information to generate forecasts.

Conceptual Foundation

Time series forecasting methods exploit temporal dependence: the value observed today contains information about likely values tomorrow. However, purely autoregressive models that rely solely on past observations of the target variable often fail to capture important external influences. The ARIMAX framework extends basic time series models by explicitly incorporating exogenous predictors while maintaining the autoregressive error structure that captures residual temporal patterns.

The four components work together as follows:

AutoRegressive (AR) structure models persistence in the target variable. If gas storage typically decreases by 0.3 percentage points per day during winter, this systematic pattern appears in the autoregressive terms. The AR component captures the inertia in the system: tomorrow's change will likely resemble today's change, adjusted for known external drivers.

Integrated (I) refers to the differencing transformation applied to achieve stationarity. Raw storage levels exhibit strong seasonal trends: they rise systematically from April through October and fall from October through March. This non-stationarity complicates statistical modeling because the distributional properties change over time. First-differencing transforms the raw level series into a series of daily changes, which fluctuate around a seasonal mean but do not exhibit persistent trends. The term "integrated" reflects that the level series is the integral (cumulative sum) of the differenced series.

Moving Average (MA) components model the propagation of forecast errors. If yesterday's unexpected weather shock caused larger-than-expected withdrawal, today's forecast adjusts based on that recent error. While conceptually important, the MA component is set to zero in the current specification (hence ARIMA(7,0,0) rather than ARIMA(7,0,q) for q>0), relying entirely on the AR structure to capture temporal dependence. This design choice reflects empirical model selection based on information criteria and out-of-sample forecast performance.

eXogenous predictors (X) incorporate external information that influences the target variable but is not directly predicted by the model. These include temperature (affecting heating demand), import flows (affecting supply), and market prices (reflecting economic incentives and forward-looking expectations). The exogenous variables shift the conditional mean of the AR process based on current conditions, while the AR terms capture deviations from this conditional mean due to system-specific dynamics.

Formal Specification

The model for the daily change in storage fill level takes the form:

\Delta y_t = \beta_0 + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \cdots + \beta_{10} X_{10,t} + \phi_1 \Delta y_{t-1} + \phi_2 \Delta y_{t-2} + \cdots + \phi_7 \Delta y_{t-7} + \varepsilon_t

where $\Delta y_t$ represents the daily change in fill level ( $y_t$ minus $y_{t-1}$ ), the $X$ variables are the exogenous predictors measured at time $t$ , the $\phi$ coefficients capture autoregressive dependencies extending seven days into the past, and $\varepsilon_t$ represents random variation not explained by the systematic components.

The seven-lag autoregressive structure reflects the weekly operational cycle in gas storage management. Natural gas markets and industrial consumption patterns exhibit strong day-of-week effects: industrial demand falls on weekends, trading activity concentrates on business days, and operational schedules at storage facilities follow weekly rhythms. An AR(7) specification allows the model to capture these weekly patterns implicitly through the temporal correlation structure.

To generate forecasts of the storage level rather than just the daily change, the differenced predictions are accumulated:

\hat{y}_{t+h} = y_t + \sum_{j=1}^{h} \Delta\hat{y}_{t+j}

where $\hat{y}_{t+h}$ denotes the $h$ -step-ahead forecast of the level, $y_t$ is the last observed level, and the summation aggregates the $h$ forecasted daily changes. The ARIMAX forecasting algorithm properly propagates the autoregressive error structure by using the most recent seven observed changes to initialize the AR dynamics.

Building the Model: Six Key Steps

Step 1: Choose What to Forecast

We forecast the daily change in fill level (in percentage points), not the level itself. Why?

Raw storage levels trend upward in summer and downward in winter—non-stationary. Daily changes are stationary: they fluctuate around a seasonal pattern but don't persistently trend up or down. Stationary data is easier to model reliably.

Mathematical translation:

$\Delta y_t = y_t - y_{t-1}$

where $y_t$ is the fill level on day $t$ .

To get the 14-day-ahead level forecast:

$\hat{y}_{t+14} = y_t + \sum_{j=1}^{14} \Delta\hat{y}_{t+j}$

Step 2: Gather Data from Multiple Sources

Good forecasting requires comprehensive data. We integrate five public data streams:

Data Source	What It Tells Us	Update Frequency
AGSI+ (Gas Storage)	Current fill levels, daily injection/withdrawal	Daily (official storage registry)
Open-Meteo (Weather)	Temperature, heating degree days, 16-day forecast	Daily, with forecast
ENTSO-G (Gas Flows)	Cross-border gas imports into Germany	Daily (transparency platform)
ENTSO-E (Power)	Gas-fired power generation (demand signal)	Daily
TTF Futures (Market)	Gas price expectations (economic signal)	Daily (trading days)

Critical design choice: All data fetching happens server-side. The public dashboard never makes external API calls from user browsers (GDPR compliance, no tracking).

Step 3: Engineer Features from Raw Data

Raw data rarely tells the full story. We engineer 10 features that capture different aspects of the system:

Seasonal Patterns

sin(day_of_year), cos(day_of_year): Mathematical way to capture the annual injection/withdrawal cycle
is_weekend: Industrial gas demand drops on weekends

Weather-Driven Demand

HDD (Heating Degree Days): max(0, 18°C - temperature). Cold days drive heating demand.
HDD_rolling7: 7-day average HDD. Sustained cold spells matter more than single cold days (buildings have thermal inertia).
HDD²: Extreme cold has disproportionate impact. A -10°C day withdraws more than twice a -5°C day.

Supply Dynamics

total_daily_imports: Physical gas flowing into Germany from pipelines
import_shock: Binary flag when imports drop > 2 standard deviations below the 90-day average (supply disruption signal)

Market Signals

TTF_price: Front-month gas futures price. Higher prices incentivize storage withdrawal (sell high); lower prices incentivize injection (buy low).

Demand Signals

gas_power_generation: Gas-fired electricity production. High generation = high gas consumption.

Why these features? Each passed statistical tests:

Low multicollinearity (VIF < 3.0 for all features)
TTF price Granger-causes storage changes (p = 0.020), meaning prices provide forward-looking information
Quadratic HDD term is highly significant (p < 0.001)

Step 4: Specify the ARIMAX Model

The model takes this form:

\begin{aligned} \Delta y_t = \;& \beta_0 \\ &+ \beta_1 \sin(\text{day\_of\_year})_t + \beta_2 \cos(\text{day\_of\_year})_t \\ &+ \beta_3 \text{HDD}_t + \beta_4 \text{HDD\_rolling7}_t + \beta_5 \text{HDD}^2_t \\ &+ \beta_6 \text{is\_weekend}_t \\ &+ \beta_7 \text{TTF\_price}_t \\ &+ \beta_8 \text{imports}_t + \beta_9 \text{gas\_generation}_t + \beta_{10} \text{import\_shock}_t \\ &+ \phi_1 \Delta y_{t-1} + \phi_2 \Delta y_{t-2} + \cdots + \phi_7 \Delta y_{t-7} \\ &+ \varepsilon_t \end{aligned}

The AR(7) structure (using 7 past days) captures weekly patterns: 5 business days + 2 weekend days create a 7-day cycle in storage operations.

Model estimation: Maximum likelihood estimation using Python's statsmodels library, fitting on ~2,960 daily observations (2018-present).

Step 5: Validate Model Specification and Forecast Performance

Model development involves two distinct validation exercises: diagnostic assessment of the fitted model to verify that statistical assumptions hold, and out-of-sample evaluation to assess genuine predictive performance. Both are essential; a model that satisfies all diagnostic checks may still forecast poorly if fitted on an unrepresentative sample or if structural change has occurred.

Diagnostic Assessment of Model Specification

After maximum likelihood estimation on the training sample, we examine residuals to verify that model assumptions are approximately satisfied. The standardized residuals should behave as independent, identically distributed observations with mean zero and constant variance if the model is correctly specified.

The Ljung-Box portmanteau test evaluates residual autocorrelation across multiple lags simultaneously. The null hypothesis states that the first k autocorrelations are jointly zero, which would be expected if the AR(7) structure has adequately captured temporal dependence. For this application, the test statistic Q(20) equals 19.83 with p-value 0.47, providing no evidence against the null hypothesis. This indicates that the seven autoregressive lags successfully model the temporal correlation structure without leaving systematic patterns in the residuals.

The Jarque-Bera test assesses whether residuals exhibit skewness and kurtosis consistent with a normal distribution. While ARIMAX estimation via maximum likelihood remains consistent under non-normality, the validity of standard errors and prediction intervals depends on approximate normality. The test yields p-value 0.14, indicating no strong departure from normality. Some deviation is expected in finite samples, particularly if extreme events (supply disruptions) generate occasional large residuals.

Heteroscedasticity testing via the Breusch-Pagan approach regresses squared residuals on the exogenous predictors to detect systematic patterns in the conditional variance. The test produces p-value 0.03, suggesting mild heteroscedasticity. This is not surprising in a system where volatility itself varies with the season (winter exhibits higher variability than summer). While problematic for standard errors in regression contexts, the impact on ARIMAX forecast intervals is partially addressed through the conditional variance adjustments described in Section 6 below.

Multicollinearity among exogenous predictors is assessed via variance inflation factors. All VIF values remain below 3.0, well within the conventional threshold (VIF<10) for acceptable collinearity. This confirms that the predictors provide complementary rather than redundant information. Early model iterations included a seven-day rolling average of temperature, which exhibited VIF exceeding 25 due to high correlation with the HDD rolling average; this prompted its removal in favor of the HDD-based specification.

Out-of-Sample Validation via Rolling-Origin Forecasting

In-sample diagnostics confirm model coherence but provide no information about genuine forecast accuracy. A model can fit historical data arbitrarily well through overfitting while failing to generalize to new observations. Proper validation requires evaluating forecasts that were not used in model estimation, simulating the true forecasting task where the model must predict unknown future values.

We implement rolling-origin validation as follows. Starting from a cutoff date t₀, we fit the model using all available data up to that date. We then generate forecasts for days t₀+1 through t₀+14 using only information available at t₀. We advance the cutoff by seven days to t₁=t₀+7, refit the model on data through t₁, and generate new 14-day-ahead forecasts. This process repeats across the validation period, generating more than 100 non-overlapping forecast windows.

For each forecast window, we compute prediction errors once actual values are observed. The mean error across all windows measures systematic bias: a positive mean indicates that forecasts systematically underpredict, while a negative mean indicates overprediction. The obtained mean error of -0.006 percentage points is effectively zero, indicating unbiased forecasts. This is essential for regulatory applications where systematic bias could lead to inappropriate policy responses.

More importantly, we evaluate prediction interval coverage: the proportion of realized values that fall within the stated confidence intervals. A well-calibrated 80% interval should contain the actual value approximately 80% of the time across many forecast windows. Systematic under-coverage indicates that stated uncertainty is too narrow, leading stakeholders to underestimate forecast risk. Systematic over-coverage is less problematic but represents wasted information.

The 80% prediction interval (P10 to P90 quantiles) achieves 79.8% empirical coverage across validation windows, nearly exactly matching the nominal level. The 90% interval (P5 to P95) achieves 88.9% coverage, slightly below the target but acceptably close given finite-sample variability. These results confirm that the uncertainty quantification framework (described in Step 6) successfully calibrates forecast intervals to reflect genuine forecast uncertainty, not merely model-based standard errors.

Step 6: Quantify Uncertainty Properly

Standard forecasting models underestimate uncertainty. Why? They assume:

Future temperatures/imports/prices are known (they're not)
Model parameters are perfectly estimated (they're not)
No rare supply disruptions occur (they do)

We implement three layers of uncertainty:

Layer 1: Model-Based Uncertainty

Standard ARIMAX forecast intervals, accounting for autoregressive error propagation.

Layer 2: Conditional Variance Adjustments

Reality is messier than models assume. We inflate forecast uncertainty based on:

Current fill level: Lower storage → less buffer → higher uncertainty
Season: Winter (Oct-Mar) is more volatile than summer
Forecast horizon: Day 14 is much more uncertain than day 1

Calibration multiplier: 2.90× (empirically derived from historical validation)

This might seem like cheating, but it's intellectual honesty: acknowledging that our model doesn't capture everything.

Layer 3: Tail Event Injection (Monte Carlo Simulation)

Supply disruptions happen (pipeline shutdowns, geopolitical events). Historical data from 2018-present may not include events as severe as future risks.

We simulate 10,000 possible futures, injecting low-probability supply shocks:

Probability: 4% per forecast path
Magnitude: -0.3 to -0.7 percentage points per day (sustained high withdrawal)
Duration: 3-7 days

From these 10,000 simulations, we extract percentiles:

P50 (median): Most likely outcome
P25-P75: Likely range
P10-P90: Broader uncertainty
P5-P95: Extreme scenarios

This gives us threshold breach probabilities: "35% chance of dropping below 15% on day 12."

Forecasting Future Variables: The Known-Unknown Problem

To forecast storage 14 days ahead, we need values for all 10 predictors on all future days. But we don't know next week's imports or prices.

Our approach:

Temperature/HDD: Use Open-Meteo's 16-day weather forecast (meteorologists are pretty good at this)
Day-of-year, weekend: Deterministic—we know future calendar dates
Imports, power generation: Assume they continue at their trailing 30-day average (mean reversion)
TTF price: Assume trailing 30-day average (market efficiency)
Import shock probability: Use trailing 180-day empirical rate

Trade-off acknowledged: Mean reversion assumptions lag turning points. We address this with scenario analysis (see below).

Communicating Results: From Statistics to Decisions

Technical rigor means nothing if stakeholders can't understand the forecast.

Fan Charts

Visual representation showing:

Median forecast (central line)
50% confidence band (P25-P75, darker shading)
80% confidence band (P10-P90, medium shading)
90% confidence band (P5-P95, light shading)

The widening fan visually communicates increasing uncertainty over time.

Threshold Breach Probabilities

For each forecast day:

"P(Fill < 15%)": Probability of alert level breach
"P(Fill < 10%)": Probability of emergency level breach

Calculated directly from the 10,000 Monte Carlo trajectories: count how many paths breach the threshold, divide by 10,000.

Scenario Analysis

Stakeholders can explore "what-if" scenarios:

"What if TTF prices spike 30%?" → Accelerated withdrawal (high prices incentivize selling stored gas)
"What if imports drop 20%?" → Depleted storage (less new supply)
"What if next week is 3°C colder than forecasted?" → Higher heating demand

Each scenario regenerates forecasts on the fly, showing sensitivity to key drivers.

What This Doesn't Do (Limitations)

Every model has boundaries. Ours include:

Structural Limitations

Linear relationships assumed: We include HDD² for nonlinearity, but complex interactions (e.g., "cold weather + supply disruption") aren't explicitly modeled
No regime switching: The model doesn't distinguish winter from summer with different parameter sets (seasonality is captured through Fourier terms)
Germany-specific: Cross-country dynamics and EU-wide balancing aren't modeled

Forecast Assumption Limitations

Mean reversion for prices/imports: May miss turning points during market shocks
Weather forecast dependency: Beyond 16 days, temperature assumptions degrade
No ensemble weather forecasts: We use a single weather forecast, not probability distributions

Data Limitations

Reporting lag: Official storage data arrives 1-2 days delayed
Historical coverage: Import data only starts in 2019
Weekend gaps: Market prices unavailable on weekends (forward-filled)

Scope Limitations

Not investment advice: Informational/educational purposes only
No political modeling: Government interventions (demand curtailment, strategic reserves) not captured
No LNG terminal details: New import facilities captured only implicitly through flow data

Why list limitations explicitly? Transparency builds trust. Stakeholders need to know what the model can and cannot do.

Methodological Principles for Applied Time Series Forecasting

The development of this forecasting system reinforces several principles that generalize beyond the specific application domain. These insights emerged through iterative model refinement and are relevant to forecasting problems in healthcare capacity planning, pharmaceutical supply chain management, and regulatory compliance monitoring.

Domain Knowledge as Foundation for Feature Engineering

The inclusion of exogenous predictors in the ARIMAX framework creates an opportunity to incorporate substantive understanding of system behavior. A purely data-driven approach that regresses outcomes on lagged values alone would miss important predictive information. For example, the relationship between temperature and heating demand is nonlinear: extreme cold drives disproportionate consumption relative to moderate cold. This relationship, known from physics and engineering, motivates the inclusion of the quadratic HDD term. Similarly, understanding that industrial gas consumption follows a weekly business cycle justifies the seven-lag autoregressive structure rather than selecting lag length purely through statistical criteria.

The TTF price variable illustrates a more subtle point about forward-looking information. Physical import flows and generation levels represent realized supply and demand, which are determined simultaneously with storage changes. Market prices, however, incorporate expectations about future conditions: an anticipated supply disruption appears in forward prices before it manifests in physical flows. The Granger causality test confirms that TTF prices at lag 4 significantly predict storage changes after controlling for contemporaneous physical variables, validating the hypothesis that market prices contain independent predictive content. This would not be discoverable without understanding the economic structure of gas markets.

The Necessity of Out-of-Sample Validation

Statistical software routinely reports in-sample fit metrics: R-squared, Akaike information criteria, log-likelihood values. These measures quantify how well the model explains historical data used in estimation. However, the forecasting task requires predicting new data not used in estimation, which is fundamentally different from explaining known data. A model can achieve perfect in-sample fit through arbitrary complexity while failing catastrophically at genuine prediction.

Rolling-origin validation addresses this by enforcing temporal separation between training and test data. At each validation window, the model is estimated using only past information and evaluated on truly future data. This simulates the actual forecasting task and reveals problems that in-sample diagnostics cannot detect. For instance, if a structural relationship changed midway through the sample (as occurred with gas markets following the 2022 supply disruption), rolling validation would show degrading forecast performance even if full-sample fit remains excellent.

The distinction between validation and testing is also important. The calibration multiplier that inflates prediction intervals (described in the next section) was itself tuned using validation data. A final held-out test set, never examined during model development or calibration, would provide a more conservative assessment of true forecasting performance. In production systems where model parameters update automatically, this train-validate-test partition becomes essential to prevent adaptation to recent anomalies.

Uncertainty Calibration Beyond Model-Based Standard Errors

Standard time series forecasting theory provides formulas for prediction interval width as a function of horizon, estimated parameters, and residual variance. These formulas assume that the model is correctly specified, that parameters are estimated without error (asymptotically), and that future conditions follow the same distribution as historical data. None of these assumptions holds exactly in practice.

The empirically derived calibration multiplier of 2.90 acknowledges these departures from ideal conditions. This multiplier was determined by comparing model-based prediction interval coverage to actual coverage in rolling validation, then adjusting until nominal and empirical coverage aligned. While pragmatic, this approach raises a philosophical question: have we constructed a model or merely fit a black-box forecaster to historical patterns?

The critical distinction lies in decomposition of uncertainty sources. The baseline ARIMAX model captures parameter uncertainty and autoregressive error propagation. The conditional variance adjustments account for regime-dependent volatility (winter versus summer, low fill versus high fill). The Monte Carlo tail event injection represents uncertainty about rare supply disruptions not well-represented in recent data. Each layer addresses a specific limitation of the baseline model with a defensible modeling choice. The overall calibration multiplier then corrects for any residual underestimation of uncertainty due to factors not explicitly modeled.

This hierarchical approach to uncertainty quantification differs from purely empirical methods (such as conformal prediction) that make minimal structural assumptions. The advantage of the structured approach is interpretability: stakeholders can understand what drives forecast uncertainty and how different scenarios affect predictions. The disadvantage is the need for domain-specific modeling choices at each layer. Neither approach is universally superior; the choice depends on whether stakeholder decision-making requires mechanistic insight or merely calibrated probabilities.

Applications Beyond Energy

The ARIMAX framework—classical time series modeling with domain-informed features, rigorous validation, and transparent uncertainty—generalizes to many business problems:

Healthcare

Hospital bed occupancy: Seasonal patterns (flu season), day-of-week effects, external drivers (COVID-19 cases, vaccination rates)
ICU capacity: Lead time for patient deterioration, demographic factors, staffing constraints
Pharmaceutical demand: Prescription trends, demographic shifts, policy changes

Supply Chain

Inventory levels: Demand forecasts, lead times, supplier reliability
Delivery times: Weather disruptions, holiday patterns, carrier performance
Stockout probabilities: Safety stock optimization under uncertainty

Regulatory Compliance

Adverse event rates: Product usage patterns, demographic factors, regulatory reporting lags
Quality metrics: Process control, batch-to-batch variation, environmental factors
Vigilance monitoring: Post-market surveillance for medical devices

Clinical Trials

Enrollment forecasting: Site activation timelines, seasonal variation, competing trials
Endpoint event rates: Background disease progression, treatment effects, loss-to-follow-up
Resource planning: Monitoring visit schedules, laboratory capacity

The principles remain constant:

Understand system drivers (domain knowledge)
Engineer informative features
Model temporal dependencies (AR structure)
Validate out-of-sample
Calibrate uncertainty
Communicate transparently

Technical Implementation Notes

For technical readers, key implementation details:

Software Stack

Python 3.11: Core language
statsmodels: ARIMA estimation and forecasting
pandas: Time series data manipulation
numpy: Numerical computation, Monte Carlo simulation
pytest: Comprehensive testing (70+ unit tests)

Model Estimation

Method: Maximum Likelihood Estimation
Optimizer: Default statsmodels optimizer (BFGS)
Sample size: ~2,960 daily observations (2018-present, after differencing)
Estimation time: ~2 seconds on standard server

Production Pipeline

Data fetch (5 API calls, parallelized, ~10 seconds)
Feature engineering (~0.5 seconds)
Model re-estimation (weekly, ~2 seconds)
Forecast generation with Monte Carlo (~3 seconds)
JSON serialization for dashboard (<0.1 seconds)

Total end-to-end runtime: ~15 seconds for fresh forecast.

Testing Strategy

Unit tests: Feature engineering, risk calculations, data transformations
Integration tests: Multi-source data merge, missing data handling
Contract tests: Dashboard JSON schema validation
Coverage tests: Interval calibration validation

All tests run automatically on each code commit (CI/CD via GitHub Actions).

When to Use ARIMAX (and When Not To)

ARIMAX Works Well When:

✅ You have regular time series data (daily, weekly, monthly observations)
✅ Historical patterns are informative about the future
✅ You can identify relevant external predictors
✅ You need transparent, explainable forecasts
✅ Stakeholders need uncertainty quantification
✅ Regulatory scrutiny requires documented methodology

Consider Alternatives When:

❌ Data is highly irregular or sparse (events happen randomly)
❌ Structural breaks make history irrelevant (war, pandemic, major policy shift)
❌ You have massive datasets where complex models perform better
❌ Stakeholders only need point forecasts, not uncertainty
❌ Real-time streaming predictions required (millisecond latency)

Concluding Observations

The forecasting system described here represents an application of classical statistical methodology to a contemporary infrastructure monitoring problem. Several aspects of the implementation warrant emphasis because they generalize to other forecasting contexts in healthcare, pharmaceuticals, and regulatory compliance.

First, the integration of autoregressive temporal structure with theory-informed exogenous predictors provides a principled framework for incorporating both statistical patterns and substantive knowledge. The autoregressive component captures system-specific dynamics that persist across time, while the exogenous predictors translate external conditions into shifts in the conditional distribution of outcomes. This hybrid approach exploits the complementary strengths of statistical learning from data and mechanistic reasoning about causal drivers.

Second, the multi-layered approach to uncertainty quantification acknowledges that forecast uncertainty has multiple sources. Model-based prediction intervals capture parameter uncertainty and autoregressive error propagation under the assumption of correct specification. Conditional variance adjustments recognize that volatility varies systematically with observables like season and current system state. Tail event injection through Monte Carlo simulation addresses the fundamental problem that rare but consequential events are by definition poorly represented in historical training data. Each layer targets a specific limitation with a defensible modeling strategy.

Third, the emphasis on out-of-sample validation reflects the distinction between explaining historical data and predicting future data. In-sample diagnostics confirm model coherence and identify misspecification, but cannot evaluate genuine predictive performance. Rolling-origin validation enforces temporal separation between training and evaluation, revealing how the model performs in its intended use case. The close alignment between nominal and empirical prediction interval coverage demonstrates successful calibration, which is essential for applications where stakeholders make decisions based on stated uncertainty levels.

Fourth, transparent documentation of data sources, feature engineering rationale, model specification, validation procedures, and limitations serves multiple purposes. It enables technical review by statisticians and domain experts. It allows replication and extension by other researchers. It builds stakeholder confidence by demonstrating that modeling choices are defensible rather than arbitrary. And it prevents misuse by clearly delineating the scope of applicability and known weaknesses.

The forecasting framework developed here for German gas storage monitoring has direct analogues in clinical and regulatory contexts. Hospital capacity models must integrate autoregressive bed occupancy patterns with exogenous predictors like seasonal disease prevalence and demographic factors. Pharmaceutical inventory forecasting requires modeling prescription patterns while accounting for formulary changes and demographic shifts. Post-market surveillance systems must detect adverse event rate changes while controlling for confounding trends in product usage patterns. In each case, the methodological principles demonstrated here—theory-informed feature engineering, proper uncertainty quantification, rigorous validation, and transparent communication—provide a template for credible applied forecasting.

Need help with time series forecasting, clinical trial planning, or evidence synthesis for your organization?

I provide independent consulting on:

Statistical modeling and forecasting
Clinical evaluation and evidence synthesis for medical devices (EU MDR)
Post-market surveillance and vigilance analytics
Bayesian methods and uncertainty quantification
R&D portfolio optimization

Contact: marc@mh-analytics.eu
Web: www.mh-analytics.eu