BE
← Leaderboard

DeepSeek V3 (Thinking)

think
Open source
DeepSeek
Open license
text
DeepSeek V3Released 1y ago
Avg score
79.5
/ 100
Context
128k
Output limit
8k
Input price
$0.27 /M
Output price
$1.10 /M

Pricing verified 1y ago · Same per-token price as the standard variant; reasoning mode emits many more output tokens per request.

Benchmarks

preference

Chatbot Arena EloFresh
Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

general

Rolling Contamination-Controlled AverageFresh
/100

Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.

data analysis

Rolling Data AnalysisFresh
/100

Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.

coding

Aider PolyglotFresh
%

Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.

composite

Frontier CompositeFresh
ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

Reliability monitor

Loading drift signal…

Hosted endpoints

HostInput $/MOutput $/MContextQuant
Host 36$0.20$0.77164kfp4
Host 38$0.22$0.80164kfp4
Host 37$0.22$0.88131kfp8
Host 39$0.25$1.00164kfp8
Host 30$0.27$1.12164kfp8
Host 40$0.29$1.14164kfp8
Anonymised third-party hosts. Sorted by lowest output price.

Effort variants

Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.

Compare with...