Gemini 2.5 Flash

Name: Gemini 2.5 Flash
Brand: Google
Price: 0.3 USD
Rating: 74 (10 reviews)

Closed

Google

Proprietary

text

vision

Gemini 2.5Released 1y ago

Avg score

74.0

/ 100

Context

1.0M

Output limit

66k

Input price

$0.30 /M

Output price

$2.50 /M

Pricing verified 17d ago

Benchmarks

preference

Chatbot Arena EloFresh

Elo

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

performance

Output SpeedN/A

tok/s

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First TokenN/A

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

math

FrontierMath Tiers 1-3Fresh

Mathematical research problems spanning analysis, algebra, combinatorics and number theory. Tiers 1-3 are progressively harder; even frontier reasoning models only solve a small fraction. The hardest publicly reported benchmark for general mathematical reasoning.

agentic

Terminal-Bench 2Fresh

Long-horizon shell-and-filesystem tasks executed in a sandboxed terminal, scored by whether the agent's final state matches a target state. Tests practical tool-using ability for everyday devops and data-wrangling work; one of the hardest agentic benchmarks today.

composite

Frontier CompositeFresh

ECI

Saturation-resistant composite capability score stitched together from ~40 underlying benchmarks using Item Response Theory. Each benchmark is weighted by its fitted difficulty and discriminative slope, so doing well on hard, contamination-resistant evals (FrontierMath, ARC-AGI 2, Humanity's Last Exam) moves the score and saturated benchmarks contribute almost nothing. Imported per-model from Epoch AI's published index; we anchor it to the same min-max scale we use for every other benchmark so it's directly weightable in scenarios.

reliability

Output StabilityN/A

/100

How consistent the model's outputs are across repeated runs of the same task. Higher means lower variance, fewer occasional hallucinations under identical inputs. Useful for production loops that need reproducible behaviour.

Format AdherenceN/A

/100

How reliably the model produces output in the requested format (JSON schemas, markdown structures, exact-string responses). Pairs well with IFEval but reflects how the deployed API is behaving day to day rather than how a frozen test set scores.

Recovery RateN/A

/100

How often the model self-corrects after producing an incorrect intermediate step (debugging axis upstream). Critical for agentic loops that depend on the model noticing and repairing its own mistakes rather than barrelling forward.

Safety HandlingN/A

/100

How well the model handles safety-sensitive prompts without false-refusing benign requests or producing unsafe output. The upstream signal does not separate refusal counts from substantive content-safety behaviour, so this single axis covers both.

Reliability monitor

Loading drift signal…

Hosted endpoints

Host	Input $/M	Output $/M	Context	Quant
Host M	$0.30	$2.50	1.0M	unknown
Host G	$0.30	$2.50	1.0M	unknown
Host L	$0.30	$2.50	1.0M	unknown
Host E	$0.30	$2.50	1.0M	unknown

Anonymised third-party hosts. Sorted by lowest output price.

Compare with...

vs GPT-4o vs GPT-4o mini vs o1 vs o1-mini vs o3 vs o4-mini vs o3-mini vs GPT-4 Turbo vs GPT-5 vs GPT-5 mini