BE
← Leaderboard

GPT-5.1

Closed
OpenAI
Proprietary
text
vision
GPT-5Released 6mo ago
Avg score
56.4
/ 100
Context
400k
Output limit
100k
Input price
$1.25 /M
Output price
$10.00 /M

Pricing verified 17d ago · Estimate; tracks the GPT-5 base tier.

Benchmarks

preference

Chatbot Arena EloFresh
Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

math

FrontierMath Tiers 1-3Fresh
%

Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.

OTIS Mock AIME 2024-2025Fresh
%

AIME-style competition problems written specifically for the OTIS mock contest, then run as an evaluation by Epoch AI. Closer in spirit to the public AIME but with novel problems unlikely to appear in training data.

knowledge

SimpleQA VerifiedFresh
%

A human-validated factuality benchmark of short factual questions whose answers can be checked against a single ground truth. Penalises hallucinations by scoring confidently-wrong answers below abstentions.

reasoning

Humanity's Last ExamFresh
%

A challenging multi-disciplinary exam aggregating expert-written questions from across academic fields. Designed to discriminate at the very top of the capability range when MMLU-style tests saturate.

ARC-AGI 2Fresh
%

Second-generation ARC challenge testing fluid reasoning over abstract visual puzzles. Resists training-data memorisation by construction: each puzzle is novel and solutions require multi-step pattern induction. Frontier models are only just starting to score above chance on the harder tier.

agentic

Terminal-Bench 2Fresh
%

Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.

composite

Frontier CompositeFresh
ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Reliability monitor

Loading drift signal…

Hosted endpoints

HostInput $/MOutput $/MContextQuant
Host A$1.25$10.00400kunknown
Host A$1.25$10.00400kunknown
Host C$1.25$10.00400kunknown
Anonymised third-party hosts. Sorted by lowest output price.

Compare with...