Backtesting a Mean‑Reversion System Using Z‑Scores and Statistical Filters
A practical guide to backtesting mean‑reversion systems: compute z‑scores, apply cointegration and statistical filters, simulate costs, and validate with walk‑forward tests.
Introduction — Why z‑scores and statistical filters?
Mean‑reversion systems (commonly implemented as pairs or spread strategies) aim to profit when a constructed spread deviates from, and then reverts to, its historical equilibrium. A standard way to measure how extreme a current spread is: the z‑score — the number of standard deviations the spread sits from its rolling mean. The z‑score is central for entry/exit rules and for sizing multi‑level bands (e.g., ±1, ±2 σ).
Academic work shows the historical viability of relative‑value rules, but also cautions that simple implementations lost edge as markets and microstructure changed — highlighting the need for robust selection, filtering and realistic backtesting. Use the original literature as a baseline, then add modern statistical controls and execution realism.
Design & implementation: data, spread construction and z‑scores
Follow a disciplined pipeline when you build a mean‑reversion backtest:
- Data selection: use continuous (adjusted) price series or mid‑quotes; include delisted symbols and corporate actions when testing equities/ETFs. Intraday or tick data is required for execution‑sensitive intraday rules.
- Pair / spread construction: choose candidates with strong economic similarity (same sector, comparable market cap) and test statistical linkage. For linear spreads use Spread_t = Y_t − β·X_t where β can be estimated by OLS or by a cointegrating regression. For multiple assets use VECM/Johansen frameworks if appropriate. Recent studies emphasise cointegration stability as a live‑market requirement — transient relationships will erode profitability.
- Z‑score (rolling): compute s_t (log or level spread), rolling mean μ_t and rolling std σ_t over a lookback window L, then z_t = (s_t − μ_t) / σ_t. Choose L based on half‑life estimates or formation‑period experiments; shorter L reacts to regime changes but increases noise.
- Signal rules:
- Enter long spread when z_t < −entry_sigma; short when z_t > +entry_sigma.
- Exit when z_t crosses zero or hits stop limits, or use a tighter target band (e.g., exit at ±0.5σ).
- Optional multi‑level entries (pyramiding) can be used but require stricter cost modeling.
- Pre‑trade filters: require minimum co‑movement (correlation) and a passed stationarity/cointegration test (ADF, Engle–Granger or Johansen) to avoid non‑stationary spreads. Filter out pairs with long estimated half‑life (slow mean reversion) because they tie capital and raise risk.
- Hedge ratio maintenance: re‑estimate β on a rolling schedule (formation window) and freeze it for the trading window; avoid using future information when recalibrating.
Document every assumption (lookback, thresholds, fees, slippage model, rebalancing cadence) so the backtest is reproducible and audit‑ready.
Testing, validation and robustness checks
Design backtests to mirror real trading. Key validation layers to include:
- Transaction costs & slippage: always subtract realistic spreads, commissions and slippage from each simulated fill. Model slippage as a function of volatility, order size and liquidity (or use historical bid/ask fills if available). Conservative cost assumptions are safer than optimistic ones. Quantitative backtesting guides recommend integrating variable spreads or tick‑level fills when possible.
- Walk‑forward analysis: avoid single in‑sample calibration. Use rolling windows that re‑optimise parameters on in‑sample data and then test on the next out‑of‑sample window. Aggregate out‑of‑sample results to estimate realistic live performance and parameter stability. Walk‑forward is computationally heavier but dramatically reduces overfitting risk.
- Stress testing & Monte Carlo: run Monte Carlo resamplings, sign‑shuffles, and regime resamplings (volatile vs calm markets) to measure distribution of drawdowns, tail risk and probabilistic Sharpe/Probabilistic/Deflated Sharpe metrics.
- Statistical filters & selection bias control: limit multiple‑testing bias by predefining parameter grids, using penalised model selection, and reporting adjustment metrics (e.g., Probabilistic Sharpe, p‑values adjusted for data snooping). Prefer strategies whose parameters are stable across many windows rather than those that peak in a single interval.
- Practical execution tests: simulate order types (market vs limit), partial fills and minimum tick/size constraints. For larger notional trades, include market‑impact models or cap position sizes to realistic liquidity buckets.
Checklist before live deployment:
- Out‑of‑sample equity curve consistent and distributed across periods
- Sensible win rate, avg trade duration and drawdown profile
- Parameter stability across walk‑forward windows
- Stress tests show acceptable tail risk
- Execution costs modelled conservatively and tested with worst‑case fills
Walk‑forward frameworks and vendor tools (e.g., platform WFO tools or custom WFO pipelines) automate repeated re‑optimisation and testing; use them to produce realistic rolling out‑of‑sample results rather than a single static split.
Final notes: historical academic results (for example, the original pairs trading studies) provide useful benchmarks but do not guarantee future profits — modern algorithms must add better pair selection (cointegration checks), robust filtering and realistic execution assumptions to remain viable. Always treat backtesting as a rigorous engineering process: repeatable, conservative, and well documented.