Claude Sonnet 4 (Thinking)
thinkPricing verified 11mo ago · Same per-token price as the standard variant, but reasoning mode typically emits 3-5x more output tokens per request.
Benchmarks
coding
Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.
performance
Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.
Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.
general
Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.
data analysis
Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.
Hosted endpoints
| Host | Input $/M | Output $/M | Context | Quant |
|---|---|---|---|---|
| Host E | $3.00 | $15.00 | 1.0M | unknown |
| Host C | $3.00 | $15.00 | 200k | unknown |
| Host F | $3.00 | $15.00 | 1.0M | unknown |
| Host C | $3.00 | $15.00 | 200k | unknown |
| Host G | $3.00 | $15.00 | 1.0M | unknown |
| Host D | $3.00 | $15.00 | 1.0M | unknown |
Effort variants
Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.