Building a Real-Time F1 Analytics Dashboard with ML

Formula 1 strategy is one of the most complex real-time optimization problems in sports. Teams spend millions on proprietary tools to predict when to pit, which tire compound to use, and how to respond to safety cars. I wanted to build something accessible for fans that actually predicts rather than just reports.

Why F1?

I've been obsessed with F1 data since Max Verstappen's championship run in 2021. The sport generates terabytes of telemetry per race — throttle position at 50Hz, GPS at 10Hz, tire temperatures, fuel loads. Most of this data is publicly available via the FastF1 Python library. It was too interesting to ignore.

The Stack

``python


Core pipeline
import fastf1
import xgboost as xgb
import fastapi
import streamlit as st
import pandas as pd
import numpy as np



FastF1 is an unofficial Python library that wraps the F1 timing API. It gives you everything — session data, lap telemetry, car data, weather. The trick is that it caches aggressively (sessions can be 200MB+), so you build your pipeline around its caching layer.

Feature Engineering

The XGBoost model takes 40+ features per lap snapshot:

`python


features = [
    'lap_number',
    'tire_compound',          # SOFT, MEDIUM, HARD
    'tire_age_laps',          # How old the current tires are
    'lap_time_delta_p1',      # Delta to race leader
    'sector_1_time',
    'sector_2_time', 
    'sector_3_time',
    'fuel_load_estimate',     # Calculated from weight loss model
    'track_temp',
    'air_temp',
    'rain_probability',       # From weather API
    # ... 30+ more
]



The fuel load estimate was the hardest feature to engineer — F1 teams don't publish fuel loads. I approximated it using the known fuel consumption rate (~110kg over 305km) and fit a degradation curve per car.

The Model

XGBoost worked better than neural networks here because:
Small dataset: ~5000 pit events across 3 seasons
Feature importance is interpretable — crucial for debugging wrong predictions
Training time is seconds, not hours

`python


model = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
)



After Optuna hyperparameter tuning: 92% test accuracy on held-out 2024 season data.

The API Layer

FastAPI serves predictions with async refresh:

`python


@app.get("/predict/{session_key}/{driver_number}")
async def predict_pit_window(session_key: int, driver_number: int):
    features = await extract_live_features(session_key, driver_number)
    prediction = model.predict_proba(features)[0][1]
    return {
        "driver": driver_number,
        "pit_probability": float(prediction),
        "recommended_window": calculate_window(features, prediction)
    }



Response time: <50ms P99 in production (Streamlit Cloud region: us-east-1).

Key Challenges

1. Missing Telemetry Data

FastF1 sometimes returns incomplete laps — crashes, red flags, VSC periods. I handled these with cubic spline interpolation over the gap, following the same approach F1 teams use internally.

2. Rate Limiting

The unofficial F1 API has limits. I built a caching middleware that stores session data in SQLite and only fetches fresh data every 30 seconds during live sessions.

3. Class Imbalance

Only ~3% of laps end in a pit stop. Class weight balancing (scale_pos_weight=33`) in XGBoost handled this without needing SMOTE.

What I Learned

Domain knowledge beats raw ML — knowing that tire deg is exponential not linear was worth more than 50 extra features

Caching is architecture — a well-designed cache made the app feel real-time with 1/100th the API calls

Interpretability matters — when the model predicted a strange pit at lap 3, SHAP values let me trace it to a wet track anomaly in the training data

What's Next

→Multi-driver simultaneous strategy comparison
→Live undercut/overcut simulation
→MongoDB for persistent session cache across deployments

---

Want to see more ML projects? Back to portfolio →