Qwen3 235B (Thinking)
thinkPricing verified 1y ago · Same per-token price as the standard variant; thinking mode emits substantially more output tokens per request.
Benchmarks
general
Contamination-controlled average across seven rolling task categories (reasoning, coding, agentic coding, mathematics, data analysis, language, instruction following). Questions are rotated every six months and ground-truth answers are objective, removing the need for LLM-as-judge scoring.
data analysis
Rolling contamination-controlled data-analysis evaluation. Table comprehension, CSV / spreadsheet reasoning, SQL-style joins, and chart interpretation. Refreshed every six months with new tables and questions to minimise contamination.
math
Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.
AIME-style competition problems written specifically for the OTIS mock contest, then run as an evaluation by Epoch AI. Closer in spirit to the public AIME but with novel problems unlikely to appear in training data.
knowledge
A human-validated factuality benchmark of short factual questions whose answers can be checked against a single ground truth. Penalises hallucinations by scoring confidently-wrong answers below abstentions.
composite
Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.
Reliability monitor
Loading drift signal…
Hosted endpoints
| Host | Input $/M | Output $/M | Context | Quant |
|---|---|---|---|---|
| Host 41 | $0.15 | $1.50 | 131k | unknown |
| Host 29 | $0.23 | $2.30 | 262k | fp8 |
| Host 37 | $0.28 | $2.30 | 128k | fp8 |
| Host 30 | $0.30 | $3.00 | 131k | fp8 |
Effort variants
Same API model, different reasoning budget. Thinking / xHigh modes usually score better on reasoning benchmarks but emit many more output tokens per request.