BE
← Leaderboard

Claude Opus 4

Closed
Anthropic
Proprietary
text
vision
Claude 4Released 1y ago
Avg score
69.4
/ 100
Context
200k
Output limit
32k
Input price
$15.00 /M
Output price
$75.00 /M

Pricing verified 11mo ago · source

Benchmarks

preference

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

math

AIME 2024High risk
%

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

coding

HumanEvalSaturated
% pass@1

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

% pass@1

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

agentic

% resolved

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

vision

MMMUSome risk
%

Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.

MathVistaSome risk
%

Math reasoning over visual contexts (charts, figures, geometry).

long context

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

performance

tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

general

Contamination-free average across LiveBench's seven task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

data analysis

Data-analysis subcategory of LiveBench. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.

Providers

ProviderInput $/MOutput $/MContextQuant
Google
google-vertex
$15.00$75.00200kunknown
Anthropic
anthropic
$15.00$75.00200kunknown
Amazon Bedrock
amazon-bedrock
$15.00$75.00200kunknown
Sourced from OpenRouter. Sorted by lowest output price.

Effort variants

Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.

Compare with...