About
Benchmark Explorer is a free, ad-free, open-data tool for comparing AI models. There is no backend, no database, no tracking. The whole site is a static bundle that recomputes everything in your browser.
Methodology
For each benchmark, raw scores are min-max normalised to a 0–100 scale across the participating models. This makes scores from very different benchmarks (Elo points, % pass rate, % accuracy) directly comparable.
Scenario scores are a weighted average of normalised benchmark scores. If a model is missing data for some benchmarks, the weights are renormalised over what's available — we never penalise a model just because the upstream hasn't reported a score.
Cost is folded in via a separate cost-vs-quality slider. Cost is converted to a 0–100 score on a log scale (because pricing spans 4 orders of magnitude) and combined with the quality score to produce a composite ranking.
Data sources
All data is ingested nightly via GitHub Actions. Each source has a fallback dataset baked into the repo so the site keeps rendering even when an upstream is down. Data refreshed 17m ago.
Skipped 17972 unrecognised model rows.
Used fallback (No stable upstream endpoint; using fallback)
Skipped 13 unrecognised models.
Skipped 318 unrecognised models.
Used fallback (Expecting ',' delimiter: line 1 column 297 (char 296))
Hand-curated fallback dataset
Tracked benchmarks
Crowdsourced pairwise human preference rankings of LLM responses. Higher Elo means more frequently preferred by users.
Harder version of MMLU testing knowledge across 57 academic subjects; reduces guessing-friendly answers.
Graduate-level Google-proof Q&A in physics, chemistry, and biology. Diamond subset is the hardest tier with PhD-validated answers.
500 high-school competition math problems requiring multi-step solutions. Scored on final-answer correctness.
American Invitational Mathematics Examination 2024 problems. Three-digit integer answers; very hard for non-reasoning models.
164 hand-written Python programming problems scored by passing unit tests. Saturated for frontier models.
Continuously refreshed coding benchmark drawing from LeetCode, AtCoder, and Codeforces; reduces benchmark contamination.
Real GitHub issues solved end-to-end. Verified subset is a 500-task human-validated slice of SWE-bench.
Verifiable instruction-following benchmark; 25 categories of strict formatting / structural directives.
Massive Multi-discipline Multimodal Understanding; college-exam level questions with images across 30+ subjects.
Math reasoning over visual contexts (charts, figures, geometry).
Long-context retrieval and reasoning suite. We report the 128k token effective-context score.
Crowdsourced pairwise human preference for image generation models (artificialanalysis.ai / lmarena image arena).
How well the generated image matches the textual prompt as evaluated by human raters.
Median sustained output speed in tokens per second on the model's first-party API for medium-length prompts. Higher is faster.
Median time from request to first output chunk in milliseconds on the model's first-party API for medium-length prompts. Lower is snappier; reasoning models are penalised here because they think before talking.
By the numbers
Contribute
Pricing changes? Want to add a model? File a PR against data/manual/pricing.yaml or data/manual/seed_models.yaml. CI runs the same nightly ingestion against your branch.
Found a wrong benchmark score? Most scores come from upstream leaderboards we trust; if one looks off, file an issue and we'll double-check the source.
The whole stack is open: a Next.js static export plus a small Python ingestion pipeline. Run it locally with python scripts/run_all.py.
Disclaimers
- Benchmark scores are summary statistics. They don't predict how a model will do on your task.
- Pricing is best-effort and changes constantly. Always confirm on the provider's page before relying on a number.
- Open-source pricing reflects a median of major hosted endpoints (Together, Fireworks, Replicate). Your self-hosted cost will differ.
- We don't run any of these models ourselves and have no commercial relationship with any provider.