Implied Volatility Prediction from Options Data • Arthur Danjou

M2 Master's Project – Predicting implied volatility using advanced regression techniques and machine learning models on financial options data.

This project explores the prediction of implied volatility from options market data, combining classical statistical methods with modern machine learning approaches. The analysis covers data preprocessing, feature engineering, model benchmarking, and interpretability analysis using real-world financial panel data.

GitHub Repository: Implied-Volatility-from-Options-Data

Project Overview

Problem Statement

Implied volatility represents the market's forward-looking expectation of an asset's future volatility. Accurate prediction is crucial for:

Option pricing and valuation
Risk management and hedging strategies
Trading strategies based on volatility arbitrage

Dataset

The project uses a comprehensive panel dataset tracking 3,887 assets across 544 observation dates (2019-2022):

File	Description	Shape
`Train_ISF.csv`	Training data with target variable	1,909,465 rows × 21 columns
`Test_ISF.csv`	Test data for prediction	1,251,308 rows × 18 columns
`hat_y.csv`	Final predictions from both models	1,251,308 rows × 2 columns

Key Variables

Target Variable:

implied_vol_ref – The implied volatility to predict

Feature Categories:

Identifiers: asset_id, obs_date
Market Activity: call_volume, put_volume, call_oi, put_oi, total_contracts
Volatility Metrics: realized_vol_short, realized_vol_mid1-3, realized_vol_long1-4, market_vol_index
Option Structure: strike_dispersion, maturity_count

Methodology

Data Pipeline

Raw Data
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Splitting (Chronological 80/20)                   │
│  - Training: 2019-10 to 2021-07                         │
│  - Validation: 2021-07 to 2022-03                       │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Feature Engineering                                    │
│  - Aggregation of volatility horizons                   │
│  - Creation of financial indicators                     │
└─────────────────────────────────────────────────────────┘
    ↓
┌─────────────────────────────────────────────────────────┐
│  Data Preprocessing (tidymodels)                        │
│  - Winsorization (99.5th percentile)                    │
│  - Log/Yeo-Johnson transformations                      │
│  - Z-score normalization                                │
│  - PCA (95% variance retention)                         │
└─────────────────────────────────────────────────────────┘
    ↓
Three Datasets Generated:
├── Tree-based (raw, scale-invariant)
├── Linear (normalized, winsorized)
└── PCA (dimensionality-reduced)

Feature Engineering

New financial indicators created to capture market dynamics:

Feature	Description	Formula
`pulse_ratio`	Volatility trend direction	RV_short / RV_long
`stress_spread`	Asset vs market stress	RV_short - Market_VIX
`put_call_ratio_volume`	Immediate market stress	Put_Volume / Call_Volume
`put_call_ratio_oi`	Long-term risk structure	Put_OI / Call_OI
`liquidity_ratio`	Market depth	Total_Volume / Total_OI
`option_dispersion`	Market uncertainty	Strike_Dispersion / Total_Contracts
`put_low_strike`	Downside protection density	Strike_Dispersion / Put_OI
`put_proportion`	Hedging vs speculation	Put_Volume / Total_Volume

Models Implemented

Linear Models

Model	Description	Best RMSE
OLS	Ordinary Least Squares	11.26
Ridge	L2 regularization	12.48
Lasso	L1 regularization (variable selection)	12.03
Elastic Net	L1 + L2 combined	~12.03
PLS	Partial Least Squares (on PCA)	12.79

Linear Mixed-Effects Models (LMM)

Advanced panel data models accounting for asset-specific effects:

Model	Features	RMSE
LMM Baseline	All variables + Random Intercept	8.77
LMM Reduced	Collinearity removal	~8.77
LMM Interactions	Financial interaction terms	~8.77
LMM + Quadratic	Convexity terms (vol of vol)	8.41
LMM + Random Slopes (mod_lmm_5)	Asset-specific betas	8.10 ⭐

Tree-Based Models

Model	Strategy	Validation RMSE	Training RMSE
XGBoost	Level-wise, Bayesian tuning	10.70	0.57
LightGBM	Leaf-wise, feature regularization	10.61 ⭐	10.90
Random Forest	Bagging	DNF*	-

*DNF: Did Not Finish (computational constraints)

Neural Networks

Model	Architecture	Status
MLP	128-64 units, tanh activation	Failed to converge

Results Summary

Model Comparison

RMSE Performance (Lower is Better)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Linear Mixed-Effects (LMM5)     8.38 ████████████████████ Best Linear
Linear Mixed-Effects (LMM4)     8.41 ███████████████████
Linear Mixed-Effects (Baseline) 8.77 ██████████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
LightGBM                       10.61 ███████████████ Best Non-Linear
XGBoost                        10.70 ██████████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
OLS (with interactions)        11.26 █████████████
Lasso                          12.03 ███████████
OLS (baseline)                 12.01 ███████████
Ridge                          12.48 ██████████
PLS                            12.79 █████████
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Key Findings

Best Linear Model: LMM with Random Slopes (RMSE = 8.38)
- Captures asset-specific volatility sensitivities
- Includes quadratic terms for convexity effects
Best Non-Linear Model: LightGBM (RMSE = 10.61)
- Superior generalization vs XGBoost
- Feature regularization prevents overfitting
Interpretability Insights (SHAP Analysis):
- realized_vol_mid dominates (57% of gain)
- Volatility clustering confirmed as primary driver
- Non-linear regime switching in stress_spread

Repository Structure

PROJECT/
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd     # Main analysis (Quarto)
├── Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.html    # Rendered report
├── packages.R                                         # R dependencies installer
├── Train_ISF.csv                                      # Training data (~1.9M rows)
├── Test_ISF.csv                                       # Test data (~1.25M rows)
├── hat_y.csv                                          # Final predictions
├── README.md                                          # This file
└── results/
    ├── lightgbm/                                      # LightGBM model outputs
    └── xgboost/                                       # XGBoost model outputs

Getting Started

Prerequisites

R ≥ 4.0
Required packages (auto-installed via packages.R)

Installation

# Install all dependencies
source("packages.R")

Or manually install key packages:

install.packages(c(
  "tidyverse", "tidymodels", "caret", "glmnet",
  "lme4", "lmerTest", "xgboost", "lightgbm",
  "ranger", "pls", "shapviz", "rBayesianOptimization"
))

Running the Analysis

Open the Quarto document:

# In RStudio
rstudioapi::navigateToFile("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")

Render the document:

quarto::quarto_render("Projet_MRC_DANJOU_LEGRAND_MERIC_VONSIEMENS.qmd")

Or run specific sections interactively using the code chunks in the .qmd file

Technical Details

Data Split Strategy

Chronological split at 80th percentile of dates
Prevents look-ahead bias and data leakage
Training: ~1.53M observations
Validation: ~376K observations

Hyperparameter Tuning

Method: Bayesian Optimization (Gaussian Processes)
Acquisition: Expected Improvement (UCB)
Goal: Maximize negative RMSE

Evaluation Metric

Exponential RMSE on original scale:

RMSE_{real} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \exp(\hat{y}_{\log, i}) - y_i \right)^2}

Models trained on log-transformed target for variance stabilization.

Key Concepts

Financial Theories Applied

Volatility Clustering – Past volatility predicts future volatility
Variance Risk Premium – Spread between implied and realized volatility
Fear Gauge – Put-call ratio as sentiment indicator
Mean Reversion – Volatility tends to return to long-term average
Liquidity Premium – Illiquid assets command higher volatility

Statistical Methods

Panel data modeling with fixed and random effects
Principal Component Analysis (PCA)
Bayesian hyperparameter optimization
SHAP values for model interpretability

Authors

Team:

Arthur DANJOU
Camille LEGRAND
Axelle MERIC
Moritz VON SIEMENS

Course: Classification and Regression (M2) Academic Year: 2025-2026

Notes

Computational Constraints: Some models (Random Forest, MLP) failed due to hardware limitations (16GB RAM, CPU-only)
Reproducibility: Set seed = 2025 for consistent results
Language: Analysis documented in English, course materials in French

References

Key R packages used:

tidymodels – Modern modeling framework
glmnet – Regularized regression
lme4 / lmerTest – Mixed-effects models
xgboost / lightgbm – Gradient boosting
shapviz – Model interpretability
rBayesianOptimization – Hyperparameter tuning