GPT-4.1

Name: GPT-4.1
Brand: OpenAI
Rating: 32.6 (8 reviews)

Closed

OpenAI

Proprietary

text

vision

GPT-4Released 1y ago

Avg score

32.6

/ 100

Context

1.0M

Output limit

33k

Input price

—

Output price

—

Benchmarks

preference

Chatbot Arena EloFresh

Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

agentic

SWE-bench VerifiedSome risk

% resolved

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

math

FrontierMath Tiers 1-3Fresh

Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.

OTIS Mock AIME 2024-2025Fresh

AIME-style competition problems written specifically for the OTIS mock contest, then run as an evaluation by Epoch AI. Closer in spirit to the public AIME but with novel problems unlikely to appear in training data.

reasoning

Humanity's Last ExamFresh

A challenging multi-disciplinary exam aggregating expert-written questions from across academic fields. Designed to discriminate at the very top of the capability range when MMLU-style tests saturate.

ARC-AGI 2Fresh

Second-generation ARC challenge testing fluid reasoning over abstract visual puzzles. Resists training-data memorisation by construction: each puzzle is novel and solutions require multi-step pattern induction. Frontier models are only just starting to score above chance on the harder tier.

coding

Aider PolyglotFresh

Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.

composite

Frontier CompositeFresh

ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Reliability monitor

Loading drift signal…

Hosted endpoints

Host	Input $/M	Output $/M	Context	Quant
Host A	$2.00	$8.00	1.0M	unknown
Host A	$2.00	$8.00	1.0M	unknown
Host B	$2.00	$8.00	1.0M	unknown

Anonymised third-party hosts. Sorted by lowest output price.

Compare with...

vs GPT-4o vs GPT-4o mini vs o1 vs o1-mini vs o3 vs o4-mini vs o3-mini vs GPT-4 Turbo vs GPT-5 vs GPT-5 mini