benchmarkinggift-evalsalesforceevaluationprobabilistic-forecasting

GIFT-Eval: Salesforce's Comprehensive TSFM Benchmark

GIFT-Eval tests foundation models across 23 datasets, 7 domains, and both univariate and multivariate settings. Here's what makes it one of the most thorough TSFM benchmarks available.

TSFM.ai Team

February 15, 20267 min read

Choosing a time series foundation model based on a single metric or a handful of datasets is a recipe for surprises in production. The GIFT-Eval benchmark (General Time Series Forecasting Model Evaluation), developed by Salesforce Research and introduced in their October 2024 paper, was built to address exactly this problem. It evaluates foundation models across 23 datasets, 7 domains, multiple frequencies, and — critically — both univariate and multivariate forecasting settings.

We surface GIFT-Eval rankings on our benchmarks page because it provides one of the most balanced and thorough assessments of how well a model performs across the full range of forecasting tasks that practitioners actually encounter.

Why GIFT-Eval Was Needed

Before GIFT-Eval, most TSFM evaluations relied on a handful of popular datasets — ETTh1/ETTh2, Electricity, Traffic, Weather — that appeared in nearly every paper. These datasets were often sourced from the Monash Forecasting Archive or the GluonTS dataset collection. The problem, as we discuss in our post on benchmarking challenges, is threefold.

First, many models include these exact datasets in their pretraining corpora, making zero-shot evaluation unreliable. Second, these datasets are overwhelmingly univariate, leaving multivariate capability untested. Third, they cluster heavily in a few domains (energy and transportation), giving a skewed picture of generalization.

GIFT-Eval was designed to close all three gaps simultaneously: contamination-aware dataset selection, systematic inclusion of multivariate tasks, and broad domain coverage.

The Dataset Collection

GIFT-Eval spans 23 datasets across 7 domains:

Energy: electricity consumption, solar generation, wind power — series with strong daily and seasonal patterns, weather-driven variability, and occasional regime shifts. See our energy forecasting case study for how these patterns affect model selection.
Transport: traffic volumes and transit ridership — high-frequency data with clear weekly cycles, holiday effects, and event-driven anomalies.
Nature: temperature, river flows, vegetation indices — series with complex multi-scale seasonality (diurnal, annual) and trend components driven by climate patterns.
Economics: macroeconomic indicators, employment figures, price indices — lower-frequency series where trend and structural breaks dominate. See our discussion of financial time series for the unique challenges in this domain.
Web traffic: page views, API call volumes — bursty, high-variance data with heavy-tailed distributions.
Healthcare: hospital admissions, disease incidence — series with seasonal patterns and occasional epidemic-driven outliers. Our healthcare forecasting overview covers this domain in detail.
Sales: retail demand, product sales — the classic forecasting domain, with promotions, stockouts, and strong calendar effects. See retail demand planning with TSFMs.

This domain diversity is what makes GIFT-Eval useful for practitioners. If your data comes from any of these domains — and most production forecasting problems fall into one of them — the benchmark provides a relevant signal about which models are likely to work.

Univariate and Multivariate Evaluation

One of GIFT-Eval's most important design choices is systematic evaluation of both univariate and multivariate forecasting.

In the univariate setting, each series is forecast independently. The model sees only the target series' history and must predict its future. This tests the model's ability to extract patterns from a single time series, which is the most common deployment scenario for foundation models.

In the multivariate setting, the model receives multiple related series simultaneously and must forecast all of them. This tests whether the model can leverage cross-series correlations — for instance, using correlated energy production across multiple solar farms to improve individual farm forecasts. Multivariate capability is important for use cases involving covariates or hierarchical data structures.

Some models excel in one setting but not the other. Moirai 2.0, for example, was explicitly designed for multivariate tasks and tends to perform relatively better in GIFT-Eval's multivariate evaluations. TimesFM from Google also handles multivariate inputs. Other models like Chronos are fundamentally univariate architectures and are evaluated only in that setting. For a broader view of where multivariate forecasting stands today, see our state of the art overview.

Metrics: Average Rank and WQL

GIFT-Eval uses two primary metrics.

Average Rank

The headline metric is Average Rank across all dataset-frequency evaluation slices. For each slice, models are ranked from best to worst. A model's Average Rank is the mean of its individual rankings across all slices.

Average Rank has a key property that makes it especially suitable for cross-dataset comparison: it is robust to outliers. If a model achieves a spectacular score on one dataset due to favorable data characteristics (or data contamination), this pulls down its average error metric but only improves its rank on that single dataset. Conversely, if a model performs catastrophically on one unusual dataset, its Average Rank absorbs only one bad ranking rather than being dominated by an extreme error value.

A model with a low Average Rank (closer to 1.0) is consistently near the top across all evaluation slices. To illustrate: if a benchmark has 30 evaluation slices and a model places 1st on 10, 2nd on 10, and 3rd on 10, its Average Rank is 2.0 — excellent consistency. A model that places 1st on 15 slices but 8th on the other 15 has an Average Rank of 4.5 despite "winning" more often. This is the property most practitioners care about: reliable performance across diverse tasks, not a model that trades extreme highs for unpredictable lows.

Weighted Quantile Loss (WQL)

WQL is the probabilistic accuracy metric. Models produce quantile forecasts at standard levels (0.1, 0.2, ..., 0.9), and WQL measures how well these predicted quantiles match the true data distribution. Lower WQL is better.

WQL evaluates probabilistic forecasting quality directly. A model that produces well-calibrated prediction intervals — tight when it is confident, wide when it is uncertain — will score well on WQL even if its point forecast is slightly less accurate than a competitor's. This matters because calibrated uncertainty estimates are often more valuable in practice than marginal improvements in point accuracy. For a deeper treatment of calibration, see our post on conformal prediction and calibrated intervals.

The interplay between Average Rank and WQL is informative. A model with a good Average Rank but mediocre WQL may be consistently placing in the top 3-5 on point accuracy but producing poorly calibrated uncertainty estimates. For applications where prediction intervals matter — inventory planning, risk management, capacity planning — WQL is the more relevant metric.

Evaluation Protocol

GIFT-Eval follows a standardized protocol designed for fair comparison:

Zero-shot evaluation. Models are tested without any training or fine-tuning on the target datasets. This tests out-of-the-box generalization.
Quantile forecasting. All participating models must produce quantile predictions at the standard levels. Models that only produce point forecasts are evaluated separately or adapted using conformal prediction methods.
Multiple frequency bands. Results are reported across different temporal frequencies. A model that ranks well on hourly data but poorly on monthly data will show this variation in its per-frequency rankings, even if its Average Rank hides this discrepancy.
Grouped aggregation. Results are grouped by univariate vs. multivariate setting, allowing practitioners to assess model suitability for their specific task type. The grouped CSVs published by Salesforce on the GIFT-Eval HuggingFace Space provide this breakdown transparently.

Interpreting GIFT-Eval Results

When reading GIFT-Eval rankings on our benchmarks page, keep these considerations in mind:

Average Rank reflects consistency, not magnitude. A model ranked 1st with an Average Rank of 2.3 and a model ranked 2nd with 2.5 may be effectively tied. Small differences in Average Rank often fall within statistical noise, especially across only 23 datasets. Look for separation of at least 0.5-1.0 ranking positions to identify meaningful differences.

Univariate and multivariate rankings can diverge. If your use case is univariate forecasting, focus on the univariate rankings. If you need multivariate capability, the multivariate rankings are more predictive. Our benchmarks page shows univariate-grouped results by default.

Domain-specific performance varies. A model's aggregate rank can mask significant domain-level differences. If you know your data domain, the per-dataset results in the GIFT-Eval source data provide more targeted guidance.

How We Use GIFT-Eval at TSFM.ai

GIFT-Eval complements FEV Bench in our evaluation framework. Where FEV Bench emphasizes zero-shot point accuracy via MASE-derived Skill Score, GIFT-Eval adds probabilistic evaluation (WQL), multivariate assessment, and a rank-based aggregation that is more robust to outlier datasets.

For our model routing system, GIFT-Eval's domain-stratified results are especially valuable. When a user submits series from the energy domain, the router can weight GIFT-Eval's energy-domain rankings more heavily than aggregate rankings, providing a more targeted model recommendation. For more on how we approach model selection, see our 2026 toolkit guide.

We also use GIFT-Eval results to validate new models before adding them to our inference platform. A model that performs well on FEV Bench but poorly on GIFT-Eval's probabilistic metrics may produce accurate point forecasts but unreliable prediction intervals — a combination that can be actively misleading for practitioners who rely on uncertainty estimates for decision-making.

For practitioners evaluating models, we recommend using GIFT-Eval alongside FEV Bench for general-purpose assessments, and adding BOOM if your data comes from the observability or infrastructure monitoring domain. If you are deciding between zero-shot and fine-tuned approaches, our fine-tuning vs. zero-shot comparison provides further context on when each strategy makes sense.