DeepSeek V3 (Thinking)
thinkPricing verified 1y ago · Same per-token price as the standard variant; reasoning mode emits many more output tokens per request.
Benchmarks
preference
Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.
general
Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.
data analysis
Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.
coding
Real-world refactoring and bug-fix tasks across multiple programming languages, scored by whether the model produces a passing patch in Aider's edit format. Tests practical coding ability beyond single-file generation; harder than HumanEval and not yet saturated.
composite
Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.
Reliability monitor
Loading drift signal…
Hosted endpoints
| Host | Input $/M | Output $/M | Context | Quant |
|---|---|---|---|---|
| Host 36 | $0.20 | $0.77 | 164k | fp4 |
| Host 38 | $0.22 | $0.80 | 164k | fp4 |
| Host 37 | $0.22 | $0.88 | 131k | fp8 |
| Host 39 | $0.25 | $1.00 | 164k | fp8 |
| Host 30 | $0.27 | $1.12 | 164k | fp8 |
| Host 40 | $0.29 | $1.14 | 164k | fp8 |
Effort variants
Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.