Claude Opus 4

Closed

Anthropic

Proprietary

text

vision

Claude 4Released 1y ago

Avg score

69.4

/ 100

Context

200k

Output limit

32k

Input price

$15.00 /M

Output price

$75.00 /M

Pricing verified 11mo ago · source

Benchmarks

preference

Chatbot Arena EloFresh

Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

math

AIME 2024High risk

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

coding

HumanEvalSaturated

% pass@1

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

LiveCodeBenchFresh

% pass@1

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

agentic

SWE-bench VerifiedSome risk

% resolved

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

vision

MMMUSome risk

Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.

MathVistaSome risk

Math reasoning over visual contexts (charts, figures, geometry).

long context

RULER 128kFresh

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

performance

Output SpeedN/A

tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

general

LiveBench Global AverageFresh

/100

Contamination-free average across LiveBench's seven task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

data analysis

LiveBench Data AnalysisFresh

/100

Data-analysis subcategory of LiveBench. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.

Providers

Provider	Input $/M	Output $/M	Context	Quant
Google google-vertex	$15.00	$75.00	200k	unknown
Anthropic anthropic	$15.00	$75.00	200k	unknown
Amazon Bedrock amazon-bedrock	$15.00	$75.00	200k	unknown

Sourced from OpenRouter. Sorted by lowest output price.

Effort variants

Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.

Claude Opus 4 (Thinking)think

Compare with...

vs GPT-4o vs GPT-4o mini vs o1 vs o1-mini vs o3 vs o3-mini vs GPT-4 Turbo vs Claude 3.5 Sonnet vs Claude 3.5 Haiku vs Claude 3 Opus