CHINI-bench
The first deterministic, simulator-graded benchmark for AI system design.
Models emit a Chinilla architecture. The simulator runs it through stress scenarios. Pass or fail is mechanical. No LLM-as-judge.
Five classes, one simulator
Real systems break in specific ways. They double-charge a customer because a webhook fired twice. They fall over when traffic spikes 10x at launch. They blow the cloud bill on a fanout nobody capacity-planned. They take down everything when one pod dies.
CHINI-bench asks an AI to design a system on paper, then runs that paper design through a simulator that reproduces those exact failure modes. The problems span five classes, on purpose:
The simulator is domain-blind. Same primitives (queue, retry, ratelimit, circuitbreaker, split, batch) score a backend service and a cafe morning rush with identical math. That is the moat against pretraining contamination: a model that crushes PC1 but tanks PC2-PC5 is recalling, not designing.
Four frontier models. Combined, they solve only a third of the bench.
Four flagships. 30 problems. 120 single-shot runs through the deterministic simulator. Composite score per class, average across all problems in the class:
| Model | PC1 SWE | PC2 Ops | PC3 Personal | PC4 Civic | PC5 Adversarial | Avg | Pass |
|---|---|---|---|---|---|---|---|
| claude-sonnet-4.6 | 77 | 60 | 57 | 75 | 75 | 70 | 5/30 |
| gemini-3.1-pro | 77 | 48 | 75 | 73 | 70 | 70 | 4/30 |
| gpt-5.4 | 68 | 60 | 73 | 77 | 75 | 70 | 4/30 |
| grok-4.20 | 64 | 71 | 69 | 67 | 77 | 69 | 2/30 |
Three models tie at 70 on average; Grok trails by one point. The averages hide the real story: every model has a class it tanks on. Gemini-3.1-Pro leads PC1 alongside Claude (77) but craters on PC2 everyday operations with a 48 (the worst single-class score in the sweep). Claude sweeps strong everywhere except PC3 personal systems (57). GPT-5.4 is the most balanced. Grok is the least peaky but never the strongest. The deltas are not noise: same prompt, same simulator, same scoring math, every time.
Two failure modes, opposite directions.
Same prompt, but now the model sees its v1 score, the failing scenarios, and which structural checks broke. One revision, one re-submit. Across 120 sweep runs (4 frontier models × 30 problems), only 3 v2 attempts passed. Three of four models degrade after revision. The fourth (Gemini 3.1 Pro) holds steady and lands the most v2 passes, but plateaus on problems that need a structural rewrite.
| Model | Avg v1 | Avg v2 | Δ | Pass after rev | Struct. fix |
|---|---|---|---|---|---|
| gemini-3.1-pro | —* | 73 | —* | 2/30 | —* |
| grok-4.20 | 65 | 68 | +3 | 1/30 | 90% |
| gpt-5.4 | 64 | 60 | -4 | 0/30 | 80% |
| claude-sonnet-4.6 | 62 | 53 | -9 | 0/30 | 74% |
* v1/Δ/struct-fix for Gemini sync on next deploy. v2 and pass count are final.
Two failure modes, opposite directions. Claude and GPT overshoot: feedback says "this broke," they restructure aggressively, add a component, blow past the count limit, and tank the design score. Gemini undershoots: it patches what the feedback flagged without restructuring, which preserves what worked and lifts the v2 average above every other frontier model, but it leaves the harder "missing required behavior" failures untouched. The right move on those problems is to rewrite, not patch. One round of feedback exposes that frontier models read structural failure as either "add a piece" or "fix this exact thing" — almost never as "the shape is wrong, start over."
Same prompt, same simulator, same math
Same prompt, same simulator, same scoring weights, every model. No human judge, no LLM judge, no rubric. The same architecture submitted twice produces the same score, run by anyone, anywhere.
- 1. A model reads the prompt + constraints.
Same prompt for every model. No retries, no hand-tuning.
- 2. It emits a CanvasState (architecture as JSON).
Components, connections, behaviors. The same format Chinilla uses internally.
- 3. The simulator runs every stress scenario.
Baseline. 5x spike. Outage of a critical component. Noisy network.
- 4. Each scenario is graded against numeric criteria.
Stability score, drop rate, delivery rate, error count. Pass or fail. No vibes.
- 5. A composite score lands the model on the leaderboard.
Weighted across stability, delivery, cost, and constraint compliance.
Full methodology, scoring formula, and reproducibility notes →
Why it matters for AI
Most popular AI benchmarks have one of three problems. They were memorized during pretraining. They are graded by another LLM (which has its own biases). Or they only measure what a model says, not what its output actually does when run.
CHINI-bench was designed to dodge all three:
| Benchmark | Judge | Contamination risk | Tests output behavior? |
|---|---|---|---|
| MMLU / GPQA | String match (multiple choice) | High (public for years) | No |
| HumanEval | Unit tests | High (public, scraped) | Yes (small) |
| SWE-bench | Repo test suites | Medium | Yes |
| Chatbot Arena | Human pairwise vote | N/A | No (taste only) |
| MT-Bench / AlpacaEval | LLM-as-judge | Medium | No |
| CHINI-bench | Discrete-event simulator | Low (designs are novel + scored mechanically) | Yes (under stress) |
Frontier models are getting deployed as system designers, on-call engineers, and architecture reviewers. The industry needs a public, reproducible signal for whether a model's design choices actually survive contact with traffic. Right now, that signal does not exist anywhere else.
One CLI, your machine
One CLI. Your API key, your machine. No accounts, no queue, no upload of anything but the canvas the model produced.
git clone github.com/collapseindex/chini-bench-cli export OPENROUTER_API_KEY=... chini-bench chini-bench run chini-001-url-shortener \
--provider openrouter --model google/gemini-3.1-pro-preview \
--as alex How rankings work
- One row per user × model. Same person on different models = multiple rows; same model run by multiple people = multiple comparable rows.
- Sorted by average composite score across every problem the row has run. Tie-breakers: pass rate, then run count.
- The Reflexion track sorts by pass-after-revision rate instead.
- Submitting the same problem twice keeps only the most recent run. Re-running cannot inflate your average.
- You need 3+ scored runs to enter the ranked table. Below that you appear in Recent submissions.
- Click any ranked row on the leaderboard for the full per-problem breakdown.
- The By model tab aggregates the same data across submitters — useful for "how does this model do overall?"