BE

About

Benchmark Explorer is a free, ad-free, open-data tool for comparing AI models. There is no backend, no database, no tracking. The whole site is a static bundle that recomputes everything in your browser.

Methodology

For each benchmark, raw scores are min-max normalised to a 0–100 scale across the participating models. This makes scores from very different benchmarks (Elo points, % pass rate, % accuracy) directly comparable.

Scenario scores are a weighted average of normalised benchmark scores. If a model is missing data for some benchmarks, the weights are renormalised over what's available — we never penalise a model just because the upstream hasn't reported a score.

Cost is folded in via a separate cost-vs-quality slider. Cost is converted to a 0–100 score on a log scale (because pricing spans 4 orders of magnitude) and combined with the quality score to produce a composite ranking.

Data sources

All data is ingested nightly via GitHub Actions. Each source has a fallback dataset baked into the repo so the site keeps rendering even when an upstream is down. Data refreshed 17m ago.

endpoints
ok
0 rows
https://openrouter.ai/api/v1/models

Skipped 9 models without an openrouterId.

hf-open-llm
ok
16 rows
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

Skipped 17972 unrecognised model rows.

image-arenas
stale
18 rows
https://artificialanalysis.ai/text-to-image/arena

Used fallback (No stable upstream endpoint; using fallback)

livecodebench
ok
15 rows
https://livecodebench.github.io/leaderboard.html

Skipped 13 unrecognised models.

lmsys-arena
ok
36 rows
https://huggingface.co/datasets/lmarena-ai/leaderboard-dataset

Skipped 318 unrecognised models.

speed
ok
56 rows
https://artificialanalysis.ai/

Skipped 10 models without an aaSlug.

swe-bench
stale
13 rows
https://www.swebench.com/

Used fallback (Expecting ',' delimiter: line 1 column 297 (char 296))

vision-and-extras
stale
120 rows
https://github.com/<this-repo>/blob/main/scripts/ingest/ingest_vision_benchmarks.py

Hand-curated fallback dataset

Tracked benchmarks

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

MMLU Pro
knowledge

Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.

GPQA Diamond
reasoning

Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.

500 high-school competition math problems requiring multi-step solutions. Scored on final-answer correctness.

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

HumanEval
coding

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

IFEval
instruction following

Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.

MMMU
vision

Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.

MathVista
vision

Math reasoning over visual contexts (charts, figures, geometry).

RULER 128k
long context

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

Crowdsourced pairwise human preference for image generation models (artificialanalysis.ai / lmarena image arena).

How well the generated image matches the textual prompt as evaluated by human raters.

Output Speed
performance

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

By the numbers

38
Models
16
Benchmarks
8
Scenarios
38
Pricing entries

Contribute

Pricing changes? Want to add a model? File a PR against data/manual/pricing.yaml or data/manual/seed_models.yaml. CI runs the same nightly ingestion against your branch.

Found a wrong benchmark score? Most scores come from upstream leaderboards we trust; if one looks off, file an issue and we'll double-check the source.

The whole stack is open: a Next.js static export plus a small Python ingestion pipeline. Run it locally with python scripts/run_all.py.

Disclaimers

  • Benchmark scores are summary statistics. They don't predict how a model will do on your task.
  • Pricing is best-effort and changes constantly. Always confirm on the provider's page before relying on a number.
  • Open-source pricing reflects a median of major hosted endpoints (Together, Fireworks, Replicate). Your self-hosted cost will differ.
  • We don't run any of these models ourselves and have no commercial relationship with any provider.