BE
← Leaderboard

Claude Opus 4 (Thinking)

think
Closed
Anthropic
Proprietary
text
vision
Claude 4Released 1y ago
Avg score
56.7
/ 100
Context
200k
Output limit
32k
Input price
$15.00 /M
Output price
$75.00 /M

Pricing verified 11mo ago · Same per-token price as the standard variant; reasoning mode burns significantly more output tokens per request.

Benchmarks

coding

LiveCodeBenchFresh
% pass@1

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

performance

Output SpeedN/A
tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A
ms

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

general

Rolling Contamination-Controlled AverageFresh
/100

Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

data analysis

Rolling Data AnalysisFresh
/100

Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.

Hosted endpoints

HostInput $/MOutput $/MContextQuant
Host D$15.00$75.00200kunknown
Host F$15.00$75.00200kunknown
Host C$15.00$75.00200kunknown
Anonymised third-party hosts. Sorted by lowest output price.

Effort variants

Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.

Compare with...