About

Benchmark Explorer is a free, ad-free, open-data tool for comparing AI models. There is no backend, no database, no tracking. The whole site is a static bundle that recomputes everything in your browser.

Methodology

For each benchmark, raw scores are min-max normalised to a 0–100 scale across the participating models. This makes scores from very different benchmarks (Elo points, % pass rate, % accuracy) directly comparable.

Scenario scores are a weighted average of normalised benchmark scores. If a model is missing data for some benchmarks, the weights are renormalised over what's available — we never penalise a model just because the upstream hasn't reported a score.

Cost is folded in via a separate cost-vs-quality slider. Cost is converted to a 0–100 score on a log scale (because pricing spans 4 orders of magnitude) and combined with the quality score to produce a composite ranking.

Data sources

All data is ingested nightly via GitHub Actions. Each source has a fallback dataset baked into the repo so the site keeps rendering even when an upstream is down. Data refreshed 17m ago.

endpoints

0 rows

https://openrouter.ai/api/v1/models

Skipped 9 models without an openrouterId.

hf-open-llm

16 rows

https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

Skipped 17972 unrecognised model rows.

image-arenas

stale

18 rows

https://artificialanalysis.ai/text-to-image/arena

Used fallback (No stable upstream endpoint; using fallback)

livecodebench

15 rows

https://livecodebench.github.io/leaderboard.html

Skipped 13 unrecognised models.

lmsys-arena

36 rows

https://huggingface.co/datasets/lmarena-ai/leaderboard-dataset

Skipped 318 unrecognised models.

speed

56 rows

https://artificialanalysis.ai/

Skipped 10 models without an aaSlug.

swe-bench

stale

13 rows

https://www.swebench.com/

Used fallback (Expecting ',' delimiter: line 1 column 297 (char 296))

vision-and-extras

stale

120 rows

https://github.com/<this-repo>/blob/main/scripts/ingest/ingest_vision_benchmarks.py

Hand-curated fallback dataset

Tracked benchmarks

Chatbot Arena Elo

preference

Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.

MMLU Pro

knowledge

Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.

GPQA Diamond

reasoning

Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.

MATH-500

math

500 high-school competition math problems requiring multi-step solutions. Scored on final-answer correctness.

AIME 2024

math

American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.

HumanEval

coding

164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.

LiveCodeBench

coding

Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.

SWE-bench Verified

agentic

Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.

IFEval

instruction following

Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.

MMMU

vision

Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.

MathVista

vision

Math reasoning over visual contexts (charts, figures, geometry).

RULER 128k

long context

Long-context retrieval and reasoning suite. We report the 128k token effective-context score.

Image Arena Elo

image gen

Crowdsourced pairwise human preference for image generation models (artificialanalysis.ai / lmarena image arena).

Prompt Adherence

image gen

How well the generated image matches the textual prompt as evaluated by human raters.

Output Speed

performance

Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.

Time to First Token

performance

Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.

By the numbers

Models

Benchmarks

Scenarios

Pricing entries

Contribute

Pricing changes? Want to add a model? File a PR against data/manual/pricing.yaml or data/manual/seed_models.yaml. CI runs the same nightly ingestion against your branch.

Found a wrong benchmark score? Most scores come from upstream leaderboards we trust; if one looks off, file an issue and we'll double-check the source.

The whole stack is open: a Next.js static export plus a small Python ingestion pipeline. Run it locally with python scripts/run_all.py.

Disclaimers

Benchmark scores are summary statistics. They don't predict how a model will do on your task.
Pricing is best-effort and changes constantly. Always confirm on the provider's page before relying on a number.
Open-source pricing reflects a median of major hosted endpoints (Together, Fireworks, Replicate). Your self-hosted cost will differ.
We don't run any of these models ourselves and have no commercial relationship with any provider.