Claude Opus 4
Pricing verified 11mo ago · source
Benchmarks
preference
Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.
math
American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.
coding
164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.
Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.
agentic
Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.
vision
long context
Long-context retrieval and reasoning suite. We report the 128k token effective-context score.
performance
Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.
Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.
general
Contamination-free average across LiveBench's seven task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.
data analysis
Data-analysis subcategory of LiveBench. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.
Providers
| Provider | Input $/M | Output $/M | Context | Quant |
|---|---|---|---|---|
Google google-vertex | $15.00 | $75.00 | 200k | unknown |
Anthropic anthropic | $15.00 | $75.00 | 200k | unknown |
Amazon Bedrock amazon-bedrock | $15.00 | $75.00 | 200k | unknown |
Effort variants
Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.