Claude Sonnet 4 (Thinking)

Name: Claude Sonnet 4 (Thinking)
Brand: Anthropic
Price: 3 USD
Rating: 66.9 (5 reviews)

think

Closed

Anthropic

Proprietary

text

vision

Claude 4Released 1y ago

Avg score

66.9

/ 100

Context

200k

Output limit

64k

Input price

$3.00 /M

Output price

$15.00 /M

Pricing verified 11mo ago · Same per-token price as the standard variant, but reasoning mode typically emits 3-5x more output tokens per request.

Benchmarks

coding

LiveCodeBenchFresh

% pass@1

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

performance

Output SpeedN/A

tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

general

Rolling Contamination-Controlled AverageFresh

/100

Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

data analysis

Rolling Data AnalysisFresh

/100

Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.

Hosted endpoints

Host	Input $/M	Output $/M	Context	Quant
Host E	$3.00	$15.00	1.0M	unknown
Host C	$3.00	$15.00	200k	unknown
Host F	$3.00	$15.00	1.0M	unknown
Host C	$3.00	$15.00	200k	unknown
Host G	$3.00	$15.00	1.0M	unknown
Host D	$3.00	$15.00	1.0M	unknown

Anonymised third-party hosts. Sorted by lowest output price.

Effort variants

Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.

Claude Sonnet 4

Compare with...

vs GPT-4o vs GPT-4o mini vs o1 vs o1-mini vs o3 vs o3-mini vs GPT-4 Turbo vs Claude 3.5 Sonnet vs Claude 3.5 Haiku vs Claude 3 Opus