Qwen2.5 72B Instruct
Pricing verified 1y ago · source · Median of hosted endpoints
Benchmarks
preference
Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.
knowledge
Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.
reasoning
Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.
math
coding
164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.
instruction following
Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.
long context
Long-context retrieval and reasoning suite. We report the 128k token effective-context score.
performance
Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.
Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.
Providers
| Provider | Input $/M | Output $/M | Context | Quant |
|---|---|---|---|---|
DeepInfra deepinfra/fp8 | $0.36 | $0.40 | 33k | fp8 |
Novita novita/bf16 | $0.38 | $0.40 | 32k | bf16 |