Chinilla mascot grading a robot's benchmark results

Public benchmark · v0.7

CHINI-bench

The first deterministic, simulator-graded benchmark for AI system design.

Models emit a Chinilla architecture. The simulator runs it through stress scenarios. Pass or fail is mechanical. No LLM-as-judge.

See leaderboard GitHub Methodology

450 problems · 3634 submissions scored

Chapter 1 · What we measure

Five classes, one simulator

Real systems break in specific ways. They double-charge a customer because a webhook fired twice. They fall over when traffic spikes 10x at launch. They blow the cloud bill on a fanout nobody capacity-planned. They take down everything when one pod dies.

CHINI-bench asks an AI to design a system on paper, then runs that paper design through a simulator that reproduces those exact failure modes. The problems span five classes, on purpose:

PC1SWE backend

URL shorteners, payment webhooks, rate limiters. The interview corpus.

PC2Operations

Cafes, restaurants, ER triage, pottery studios. Physical capacity, real bottlenecks.

PC3Personal

Habit loops, inbox zero, couch-to-5K. Cravings as packets, willpower as backpressure.

PC4Civic

Polling stations, vaccine rollouts, disaster shelters. Equity and cold-chain constraints.

PC5Adversarial

DDoS shields, phishing funnels. Attacker is in the graph; defenses must hold without dropping clean traffic.

The simulator is domain-blind. Same primitives (queue, retry, ratelimit, circuitbreaker, split, batch) score a backend service and a cafe morning rush with identical math. That is the moat against pretraining contamination: a model that crushes PC1 but tanks PC2-PC5 is recalling, not designing.

See all 450 problems in the appendix →

First findings · v0.3 (single-shot)

Four frontier models. Combined, they solve only a third of the bench.

Four flagships. 30 problems. 120 single-shot runs through the deterministic simulator. Composite score per class, average across all problems in the class:

Model	PC1 SWE	PC2 Ops	PC3 Personal	PC4 Civic	PC5 Adversarial	Avg	Pass
claude-sonnet-4.6	77	60	57	75	75	70	5/30
gemini-3.1-pro	77	48	75	73	70	70	4/30
gpt-5.4	68	60	73	77	75	70	4/30
grok-4.20	64	71	69	67	77	69	2/30

Three models tie at 70 on average; Grok trails by one point. The averages hide the real story: every model has a class it tanks on. Gemini-3.1-Pro leads PC1 alongside Claude (77) but craters on PC2 everyday operations with a 48 (the worst single-class score in the sweep). Claude sweeps strong everywhere except PC3 personal systems (57). GPT-5.4 is the most balanced. Grok is the least peaky but never the strongest. The deltas are not noise: same prompt, same simulator, same scoring math, every time.

Combined coverage: 10/30. Pool every passing run from all four frontier models and only ten of thirty problems have ever been solved by anyone. Two thirds of v0.3 remain open.

Full leaderboard →

Reflexion (multi-turn agentic) · v0.6

Two failure modes, opposite directions.

Same prompt, but now the model sees its v1 score, the failing scenarios, and which structural checks broke. One revision, one re-submit. Across 120 sweep runs (4 frontier models × 30 problems), only 3 v2 attempts passed. Three of four models degrade after revision. The fourth (Gemini 3.1 Pro) holds steady and lands the most v2 passes, but plateaus on problems that need a structural rewrite.

Model	Avg v1	Avg v2	Δ	Pass after rev	Struct. fix
gemini-3.1-pro	—^*	73	—^*	2/30	—^*
grok-4.20	65	68	+3	1/30	90%
gpt-5.4	64	60	-4	0/30	80%
claude-sonnet-4.6	62	53	-9	0/30	74%

^* v1/Δ/struct-fix for Gemini sync on next deploy. v2 and pass count are final.

Two failure modes, opposite directions. Claude and GPT overshoot: feedback says "this broke," they restructure aggressively, add a component, blow past the count limit, and tank the design score. Gemini undershoots: it patches what the feedback flagged without restructuring, which preserves what worked and lifts the v2 average above every other frontier model, but it leaves the harder "missing required behavior" failures untouched. The right move on those problems is to rewrite, not patch. One round of feedback exposes that frontier models read structural failure as either "add a piece" or "fix this exact thing" — almost never as "the shape is wrong, start over."

Reflexion leaderboard →

Chapter 2 · How it works

Same prompt, same simulator, same math

Same prompt, same simulator, same scoring weights, every model. No human judge, no LLM judge, no rubric. The same architecture submitted twice produces the same score, run by anyone, anywhere.

1.
A model reads the prompt + constraints.
Same prompt for every model. No retries, no hand-tuning.
2.
It emits a CanvasState (architecture as JSON).
Components, connections, behaviors. The same format Chinilla uses internally.
3.
The simulator runs every stress scenario.
Baseline. 5x spike. Outage of a critical component. Noisy network.
4.
Each scenario is graded against numeric criteria.
Stability score, drop rate, delivery rate, error count. Pass or fail. No vibes.
5.
A composite score lands the model on the leaderboard.
Weighted across stability, delivery, cost, and constraint compliance.

Full methodology, scoring formula, and reproducibility notes →

Chapter 3 · Why it matters

Why it matters for AI

Most popular AI benchmarks have one of three problems. They were memorized during pretraining. They are graded by another LLM (which has its own biases). Or they only measure what a model says, not what its output actually does when run.

CHINI-bench was designed to dodge all three:

Benchmark	Judge	Contamination risk	Tests output behavior?
MMLU / GPQA	String match (multiple choice)	High (public for years)	No
HumanEval	Unit tests	High (public, scraped)	Yes (small)
SWE-bench	Repo test suites	Medium	Yes
Chatbot Arena	Human pairwise vote	N/A	No (taste only)
MT-Bench / AlpacaEval	LLM-as-judge	Medium	No
CHINI-bench	Discrete-event simulator	Low (designs are novel + scored mechanically)	Yes (under stress)

Frontier models are getting deployed as system designers, on-call engineers, and architecture reviewers. The industry needs a public, reproducible signal for whether a model's design choices actually survive contact with traffic. Right now, that signal does not exist anywhere else.

Chapter 4 · How to submit

One CLI, your machine

One CLI. Your API key, your machine. No accounts, no queue, no upload of anything but the canvas the model produced.

Three steps

From clone to leaderboard in under a minute.

1 Get the CLI

git clone github.com/collapseindex/chini-bench-cli

2 Set a key

export OPENROUTER_API_KEY=...

Any provider. Key never leaves your machine.

3 Run

chini-bench

Interactive menu. No flags to remember.

Or one-shot, end-to-end

chini-bench run chini-001-url-shortener \
  --provider openrouter --model google/gemini-3.1-pro-preview \
  --as alex

Providers: openai · anthropic · google · openrouter · ollama

View on GitHub Quick start How scoring works

Your API key never leaves your machine. Only the resulting CanvasState is uploaded.

How rankings work

One row per user × model. Same person on different models = multiple rows; same model run by multiple people = multiple comparable rows.
Sorted by average composite score across every problem the row has run. Tie-breakers: pass rate, then run count.
The Reflexion track sorts by pass-after-revision rate instead.
Submitting the same problem twice keeps only the most recent run. Re-running cannot inflate your average.
You need 3+ scored runs to enter the ranked table. Below that you appear in Recent submissions.
Click any ranked row on the leaderboard for the full per-problem breakdown.
The By model tab aggregates the same data across submitters — useful for "how does this model do overall?"

See the leaderboard → / Full methodology / Try Chinilla yourself

Built by alex · @chinillaboard · DMs open for human-baseline submissions.