Machine LearningPythonF1XGBoostFastAPI

Building a Real-Time F1 Analytics Dashboard with ML

8 min read
·2026-02-26·Kanisius Bagaskara

Building a Real-Time F1 Analytics Dashboard with ML

Formula 1 strategy is one of the most complex real-time optimization problems in sports. Teams spend millions on proprietary tools to predict when to pit, which tire compound to use, and how to respond to safety cars. I wanted to build something accessible for fans that actually predicts rather than just reports.

Why F1?

I've been obsessed with F1 data since Max Verstappen's championship run in 2021. The sport generates terabytes of telemetry per race — throttle position at 50Hz, GPS at 10Hz, tire temperatures, fuel loads. Most of this data is publicly available via the FastF1 Python library. It was too interesting to ignore.

The Stack

``python

Core pipeline

import fastf1

import xgboost as xgb

import fastapi

import streamlit as st

import pandas as pd

import numpy as np

`

FastF1 is an unofficial Python library that wraps the F1 timing API. It gives you everything — session data, lap telemetry, car data, weather. The trick is that it caches aggressively (sessions can be 200MB+), so you build your pipeline around its caching layer.

Feature Engineering

The XGBoost model takes 40+ features per lap snapshot:

`python

features = [

'lap_number',

'tire_compound', # SOFT, MEDIUM, HARD

'tire_age_laps', # How old the current tires are

'lap_time_delta_p1', # Delta to race leader

'sector_1_time',

'sector_2_time',

'sector_3_time',

'fuel_load_estimate', # Calculated from weight loss model

'track_temp',

'air_temp',

'rain_probability', # From weather API

# ... 30+ more

]

`

The fuel load estimate was the hardest feature to engineer — F1 teams don't publish fuel loads. I approximated it using the known fuel consumption rate (~110kg over 305km) and fit a degradation curve per car.

The Model

XGBoost worked better than neural networks here because:

  • Small dataset: ~5000 pit events across 3 seasons
  • Feature importance is interpretable — crucial for debugging wrong predictions
  • Training time is seconds, not hours
  • `python

    model = xgb.XGBClassifier(

    n_estimators=500,

    max_depth=6,

    learning_rate=0.05,

    subsample=0.8,

    colsample_bytree=0.8,

    eval_metric='logloss',

    )

    `

    After Optuna hyperparameter tuning: 92% test accuracy on held-out 2024 season data.

    The API Layer

    FastAPI serves predictions with async refresh:

    `python

    @app.get("/predict/{session_key}/{driver_number}")

    async def predict_pit_window(session_key: int, driver_number: int):

    features = await extract_live_features(session_key, driver_number)

    prediction = model.predict_proba(features)[0][1]

    return {

    "driver": driver_number,

    "pit_probability": float(prediction),

    "recommended_window": calculate_window(features, prediction)

    }

    `

    Response time: <50ms P99 in production (Streamlit Cloud region: us-east-1).

    Key Challenges

    1. Missing Telemetry Data

    FastF1 sometimes returns incomplete laps — crashes, red flags, VSC periods. I handled these with cubic spline interpolation over the gap, following the same approach F1 teams use internally.

    2. Rate Limiting

    The unofficial F1 API has limits. I built a caching middleware that stores session data in SQLite and only fetches fresh data every 30 seconds during live sessions.

    3. Class Imbalance

    Only ~3% of laps end in a pit stop. Class weight balancing (scale_pos_weight=33`) in XGBoost handled this without needing SMOTE.

    What I Learned

  • Domain knowledge beats raw ML — knowing that tire deg is exponential not linear was worth more than 50 extra features
  • Caching is architecture — a well-designed cache made the app feel real-time with 1/100th the API calls
  • Interpretability matters — when the model predicted a strange pit at lap 3, SHAP values let me trace it to a wet track anomaly in the training data
  • What's Next

    • Multi-driver simultaneous strategy comparison
    • Live undercut/overcut simulation
    • MongoDB for persistent session cache across deployments

    ---

    Want to see more ML projects? Back to portfolio →