Backtesting Mean‑Reversion with Z‑Scores & Filters Guide

Introduction — Why z‑scores and statistical filters?

Mean‑reversion systems (commonly implemented as pairs or spread strategies) aim to profit when a constructed spread deviates from, and then reverts to, its historical equilibrium. A standard way to measure how extreme a current spread is: the z‑score — the number of standard deviations the spread sits from its rolling mean. The z‑score is central for entry/exit rules and for sizing multi‑level bands (e.g., ±1, ±2 σ).

Academic work shows the historical viability of relative‑value rules, but also cautions that simple implementations lost edge as markets and microstructure changed — highlighting the need for robust selection, filtering and realistic backtesting. Use the original literature as a baseline, then add modern statistical controls and execution realism.

Design & implementation: data, spread construction and z‑scores

Follow a disciplined pipeline when you build a mean‑reversion backtest:

Data selection: use continuous (adjusted) price series or mid‑quotes; include delisted symbols and corporate actions when testing equities/ETFs. Intraday or tick data is required for execution‑sensitive intraday rules.
Pair / spread construction: choose candidates with strong economic similarity (same sector, comparable market cap) and test statistical linkage. For linear spreads use Spread_t = Y_t − β·X_t where β can be estimated by OLS or by a cointegrating regression. For multiple assets use VECM/Johansen frameworks if appropriate. Recent studies emphasise cointegration stability as a live‑market requirement — transient relationships will erode profitability.
Z‑score (rolling): compute s_t (log or level spread), rolling mean μ_t and rolling std σ_t over a lookback window L, then z_t = (s_t − μ_t) / σ_t. Choose L based on half‑life estimates or formation‑period experiments; shorter L reacts to regime changes but increases noise.
Signal rules:
- Enter long spread when z_t < −entry_sigma; short when z_t > +entry_sigma.
- Exit when z_t crosses zero or hits stop limits, or use a tighter target band (e.g., exit at ±0.5σ).
- Optional multi‑level entries (pyramiding) can be used but require stricter cost modeling.
Pre‑trade filters: require minimum co‑movement (correlation) and a passed stationarity/cointegration test (ADF, Engle–Granger or Johansen) to avoid non‑stationary spreads. Filter out pairs with long estimated half‑life (slow mean reversion) because they tie capital and raise risk.
Hedge ratio maintenance: re‑estimate β on a rolling schedule (formation window) and freeze it for the trading window; avoid using future information when recalibrating.

Document every assumption (lookback, thresholds, fees, slippage model, rebalancing cadence) so the backtest is reproducible and audit‑ready.

Testing, validation and robustness checks

Design backtests to mirror real trading. Key validation layers to include:

Transaction costs & slippage: always subtract realistic spreads, commissions and slippage from each simulated fill. Model slippage as a function of volatility, order size and liquidity (or use historical bid/ask fills if available). Conservative cost assumptions are safer than optimistic ones. Quantitative backtesting guides recommend integrating variable spreads or tick‑level fills when possible.
Walk‑forward analysis: avoid single in‑sample calibration. Use rolling windows that re‑optimise parameters on in‑sample data and then test on the next out‑of‑sample window. Aggregate out‑of‑sample results to estimate realistic live performance and parameter stability. Walk‑forward is computationally heavier but dramatically reduces overfitting risk.
Stress testing & Monte Carlo: run Monte Carlo resamplings, sign‑shuffles, and regime resamplings (volatile vs calm markets) to measure distribution of drawdowns, tail risk and probabilistic Sharpe/Probabilistic/Deflated Sharpe metrics.
Statistical filters & selection bias control: limit multiple‑testing bias by predefining parameter grids, using penalised model selection, and reporting adjustment metrics (e.g., Probabilistic Sharpe, p‑values adjusted for data snooping). Prefer strategies whose parameters are stable across many windows rather than those that peak in a single interval.
Practical execution tests: simulate order types (market vs limit), partial fills and minimum tick/size constraints. For larger notional trades, include market‑impact models or cap position sizes to realistic liquidity buckets.

Checklist before live deployment:

Out‑of‑sample equity curve consistent and distributed across periods
Sensible win rate, avg trade duration and drawdown profile
Parameter stability across walk‑forward windows
Stress tests show acceptable tail risk
Execution costs modelled conservatively and tested with worst‑case fills

Walk‑forward frameworks and vendor tools (e.g., platform WFO tools or custom WFO pipelines) automate repeated re‑optimisation and testing; use them to produce realistic rolling out‑of‑sample results rather than a single static split.

Final notes: historical academic results (for example, the original pairs trading studies) provide useful benchmarks but do not guarantee future profits — modern algorithms must add better pair selection (cointegration checks), robust filtering and realistic execution assumptions to remain viable. Always treat backtesting as a rigorous engineering process: repeatable, conservative, and well documented.

Backtesting a Mean‑Reversion System Using Z‑Scores and Statistical Filters

Introduction — Why z‑scores and statistical filters?

Design & implementation: data, spread construction and z‑scores

Testing, validation and robustness checks

Related Articles

Building a Scalable Momentum Portfolio: Allocation, Rebalancing & Turnover Control

Heatmaps & Market Breadth for FX Traders: Reading Strength Across Pairs

Support & Resistance: Building Reliable Zones with Order‑Flow and Volume Confluence