benchmarkingboomdatadogobservabilityevaluationinfrastructure

BOOM: Datadog's Observability Forecasting Benchmark

BOOM evaluates time series models on 2,807 real-world production monitoring series from Datadog. Here's how it works and why observability data demands its own benchmark.

TSFM.ai Team

February 20, 20267 min read

Most forecasting benchmarks evaluate models on clean, curated academic datasets: retail sales, electricity demand, weather observations. These are useful for measuring general-purpose ability, but they do not reflect the statistical reality of a large and growing category of time series data: production infrastructure metrics. CPU utilization counters, request latency percentiles, error rates, memory pressure gauges, network throughput measurements — the telemetry that modern cloud systems generate at enormous scale.

BOOM (Benchmark Of Observability Metrics), created by Datadog and detailed in their research paper, fills this gap. It evaluates time series foundation models on 2,807 real-world observability time series drawn from production infrastructure, making it the largest domain-specific TSFM benchmark available and one of the most practically relevant for anyone operating cloud infrastructure.

We include BOOM results on our benchmarks page because observability forecasting is a first-class use case for our platform, and practitioners in this domain deserve benchmark results that reflect their actual data.

Why Observability Data Needs Its Own Benchmark

Observability metrics have statistical properties that differ fundamentally from the datasets used in general-purpose benchmarks like FEV Bench or GIFT-Eval. As we discuss in our Toto model overview, these differences are significant enough to change model rankings.

Heavy-tailed distributions

Production metrics regularly exhibit extreme values that would be outliers in retail or weather data but are routine in infrastructure. A 99th-percentile latency spike during a garbage collection pause, a 10x traffic surge from a viral event, a memory usage jump after a deployment — these events are normal operating behavior, not anomalies. Models trained primarily on well-behaved academic data often struggle to produce calibrated forecasts when the underlying distribution has fat tails.

Abrupt regime changes

Deployments, auto-scaling events, configuration changes, and infrastructure migrations can shift the baseline of a metric instantaneously. A CPU utilization series that hovered around 40% for weeks may jump to 70% after a code deployment that adds a new background process. These regime changes are far more common in observability data than in traditional forecasting domains, where shifts tend to be gradual (seasonal transitions) or structurally predictable (holiday effects).

Mixed and irregular sampling

Observability data arrives at various cadences. Some metrics report every 10 seconds, others every minute or every 5 minutes. Collection agents may drop data points under high load. Aggregation layers may switch resolution tiers based on data age. A model that assumes clean, regularly-sampled input will encounter constant friction with real observability data. For a broader discussion of how context length and history interact with sampling irregularity, see our dedicated post.

Correlated multivariate structure

Infrastructure metrics are deeply interrelated. CPU utilization, memory usage, request throughput, error rate, and latency for a single host are not independent series — they reflect the same underlying system state. Cross-metric correlations carry semantic meaning that a model can exploit if it has the architecture to do so. This is why multivariate forecasting capability matters more in observability than in many other domains.

The BOOM Dataset

BOOM evaluates on 2,807 individual time series extracted from 350 million observations of real production infrastructure data from Datadog's platform.

The series span five observability subcategories:

Infrastructure metrics: CPU, memory, disk, and load measurements from compute instances.
Networking metrics: throughput, packet rates, connection counts, and bandwidth utilization.
Database metrics: query rates, connection pool sizes, replication lag, and cache hit ratios.
Security metrics: authentication attempts, firewall event rates, and alert volumes.
Application metrics: request rates, error rates, response times, and queue depths.

This categorization matters because model performance can vary significantly across these subcategories. A model that forecasts CPU utilization well may struggle with bursty security alert data. BOOM's breadth across subcategories provides a more complete picture than evaluating on a single type of observability metric.

Metrics: CRPS and MASE

BOOM uses two metrics that together capture both probabilistic and point-forecast accuracy.

CRPS (Continuous Ranked Probability Score)

CRPS is the primary ranking metric. It measures the quality of a model's full probabilistic forecast by comparing the predicted cumulative distribution function against the observed value. Lower CRPS is better.

CRPS has several advantages for evaluating observability forecasts. It rewards both calibration (the predicted distribution assigns appropriate probability to the range where the actual value falls) and sharpness (the distribution is concentrated rather than diffusely spread out). A model that produces wide, uninformative prediction intervals will score poorly on CRPS even if the true value always falls within those intervals.

For observability use cases, calibrated probabilistic forecasts are essential. Anomaly detection systems that fire when observed values fall outside predicted intervals depend on those intervals being neither too tight (causing alert fatigue from false positives) nor too loose (missing genuine anomalies). CRPS directly measures this property. For background on what makes a good probabilistic forecast, see our post on prediction intervals vs. point forecasts, and for calibration techniques, see conformal prediction.

MASE (Mean Absolute Scaled Error)

MASE provides a complementary point-forecast accuracy measure, normalized against a seasonal naive baseline. It answers the practical question: is this model's forecast more accurate than simply repeating the last observed seasonal pattern?

For observability data, the seasonal naive baseline is often surprisingly strong. Many infrastructure metrics exhibit clear diurnal and weekly cycles driven by user activity patterns. A model needs to substantially beat this baseline to justify the complexity of running a foundation model in a monitoring pipeline. To put numbers on this: a MASE of 0.75 means the model's errors are 25% smaller than the naive baseline's. Below 0.5 represents a major improvement; above 1.0 means the model is worse than simply repeating the last week's pattern.

When CRPS and MASE rankings diverge for a given model, it usually indicates one of two situations: the model produces accurate point forecasts but poorly calibrated uncertainty estimates (good MASE, poor CRPS), or the model produces excellent distributional forecasts but its median prediction is slightly off-center (good CRPS, mediocre MASE). The first case is problematic for anomaly detection; the second is usually acceptable since the full distribution is still reliable.

Evaluation Protocol

BOOM follows a zero-shot evaluation protocol consistent with other major benchmarks:

No fine-tuning. Models receive raw observability series and must forecast directly. This tests whether a model's pretraining has equipped it to handle observability data patterns without adaptation.
Fixed horizons. Each series has a predefined forecast horizon appropriate to its monitoring context. Short-term horizons (minutes to hours) test the model's ability to capture high-frequency dynamics. Longer horizons test trend and seasonality extraction.
Full distributional evaluation. Models must produce probabilistic forecasts, not just point predictions. This is non-negotiable for the observability domain, where the forecast distribution directly drives alerting and capacity planning decisions.
Aggregation across all series. CRPS and MASE are computed per-series and then aggregated across the full 2,807 series. This aggregation dilutes any lucky performance on individual series and rewards models that are consistently strong across the diverse mix of observability data types.

What BOOM Results Reveal

BOOM results consistently show patterns that differ from general-purpose benchmarks:

Domain-specific models can outperform larger general models. Toto, Datadog's own 151M-parameter model trained on observability data, demonstrates that pretraining on domain-relevant data can matter more than model scale. Models with 10x more parameters but trained on general corpora sometimes underperform smaller models whose pretraining distribution matches observability data. This echoes a broader pattern we see with tiny, specialized models outperforming generalist architectures in specific niches.

Tokenization strategy matters more on observability data. Models that use binning-based tokenization (like Chronos) and models that operate on continuous values directly can show larger performance gaps on observability data than on general benchmarks. The heavy-tailed, regime-shifting nature of infrastructure metrics stresses tokenization schemes differently than smooth, well-behaved academic series. Models built on diffusion-based approaches sidestep tokenization entirely and may handle heavy tails more naturally.

Probabilistic calibration separates models. On general benchmarks, many models produce similarly adequate uncertainty estimates. On BOOM's heavy-tailed observability data, calibration differences become pronounced. Models that assume Gaussian output distributions tend to produce over-confident intervals that underestimate tail risk — a real problem for alerting systems. Mixture-of-experts architectures and models with Student-t output heads handle this better by construction.

How We Use BOOM at TSFM.ai

BOOM is particularly important for our model routing system because observability forecasting is a core use case for our platform. When a user submits infrastructure metrics for forecasting, BOOM rankings provide the most relevant signal for model selection.

Specifically:

BOOM CRPS rankings inform the router's model selection when incoming series are detected as observability data (based on frequency, statistical properties, and metadata).
Models that perform well on BOOM but are not available through our inference API are flagged for prioritized onboarding.
We use BOOM's subcategory breakdown to inform domain-specific routing — a model that excels on networking metrics may not be the best choice for application-level request rate data. See our network traffic and telecom forecasting post for more on this distinction.

For practitioners working with infrastructure metrics, BOOM results on our benchmarks page provide the most directly relevant guidance. If your data looks like production telemetry — high-frequency, spiky, heavy-tailed, with regime changes — BOOM rankings are a better predictor of real-world performance than general-purpose benchmarks. For guidance on building a production-grade pipeline around these models, see building production forecast pipelines and our notes on scaling inference with GPU optimization.

We recommend using BOOM alongside FEV Bench and GIFT-Eval for a complete picture. FEV Bench tells you about general zero-shot ability, GIFT-Eval tells you about probabilistic accuracy and multivariate capability, and BOOM tells you whether the model can actually handle the messy reality of production infrastructure data. For an overview of how all these benchmarks and models fit together, see our 2026 TSFM toolkit guide.