Llama 3.3 70B Instruct

Open source

Benchmarks

preference

Chatbot Arena EloFresh

Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

knowledge

MMLU ProHigh risk

Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.

reasoning

GPQA DiamondSome risk

Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.

math

MATH-500Saturated

500 high-school competition math problems requiring multi-step solutions. Scored on final-answer correctness.

AIME 2024High risk

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

coding

HumanEvalSaturated

% pass@1

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

instruction following

IFEvalSome risk

Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.

long context

RULER 128kFresh

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

performance

Output SpeedN/A

tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

Providers

Provider	Input $/M	Output $/M	Context	Quant
DeepInfra deepinfra/turbo	$0.10	$0.32	131k	fp8
Inceptron inceptron/fp8	$0.12	$0.38	131k	fp8
Nebius nebius/fp8	$0.13	$0.40	131k	fp8
AkashML akashml/fp8	$0.13	$0.40	131k	fp8
Novita novita/bf16	$0.14	$0.40	131k	bf16
Parasail parasail/int8	$0.22	$0.50	131k	int8
Friendli friendli	$0.60	$0.60	131k	unknown
WandB wandb/fp16	$0.71	$0.71	128k	fp16
Google google-vertex	$0.72	$0.72	128k	unknown
Google google-vertex	$0.72	$0.72	128k	unknown
Groq groq	$0.59	$0.79	131k	unknown
Together together/fp8	$0.88	$0.88	131k	fp8
SambaNova sambanova-turbo	$0.45	$0.90	16k	bf16
SambaNova sambanova/bf16	$0.60	$1.20	131k	bf16
Cloudflare cloudflare/fp8	$0.29	$2.25	24k	fp8

Sourced from OpenRouter. Sorted by lowest output price.

Compare with...

vs GPT-4o vs GPT-4o mini vs o1 vs o1-mini vs o3 vs o3-mini vs GPT-4 Turbo vs Claude 3.5 Sonnet vs Claude 3.5 Haiku vs Claude 3 Opus