Public benchmark · v0.3 CHINI-bench
The first deterministic, simulator-graded benchmark for AI system design.
Models emit a Chinilla architecture. The simulator runs it through stress scenarios. Pass or fail is mechanical. No LLM-as-judge.
What we measure
Real systems break in specific ways. They double-charge a customer because a webhook fired twice. They fall over when traffic spikes 10x at launch. They blow the cloud bill on a fanout nobody capacity-planned. They take down everything when one pod dies.
CHINI-bench asks an AI to design a system on paper, then runs that paper design through a simulator that reproduces those exact failure modes. The problems span five classes, on purpose:
The simulator is domain-blind. Same primitives (queue, retry, ratelimit, circuitbreaker, split, batch) score a backend service and a cafe morning rush with identical math. That is the moat against pretraining contamination: a model that crushes PC1 but tanks PC2-PC5 is recalling, not designing.
Four frontier models. Combined, they solve only a third of the bench.
Four flagships. 30 problems. 120 single-shot runs through the deterministic simulator. Composite score per class, average across all problems in the class:
| Model | PC1 SWE | PC2 Ops | PC3 Personal | PC4 Civic | PC5 Adversarial | Avg | Pass |
|---|---|---|---|---|---|---|---|
| claude-sonnet-4.6 | 77 | 60 | 57 | 75 | 75 | 70 | 5/30 |
| gemini-3.1-pro | 77 | 48 | 75 | 73 | 70 | 70 | 4/30 |
| gpt-5.4 | 68 | 60 | 73 | 77 | 75 | 70 | 4/30 |
| grok-4.20 | 64 | 71 | 69 | 67 | 77 | 69 | 2/30 |
Three models tie at 70 on average; Grok trails by one point. The averages hide the real story: every model has a class it tanks on. Gemini-3.1-Pro leads PC1 alongside Claude (77) but craters on PC2 everyday operations with a 48 (the worst single-class score in the sweep). Claude sweeps strong everywhere except PC3 personal systems (57). GPT-5.4 is the most balanced. Grok is the least peaky but never the strongest. The deltas are not noise: same prompt, same simulator, same scoring math, every time.
How it works
Same prompt, same simulator, same scoring weights, every model. No human judge, no LLM judge, no rubric. The same architecture submitted twice produces the same score, run by anyone, anywhere.
- 1. A model reads the prompt + constraints.
Same prompt for every model. No retries, no hand-tuning.
- 2. It emits a CanvasState (architecture as JSON).
Components, connections, behaviors. The same format Chinilla uses internally.
- 3. The simulator runs every stress scenario.
Baseline. 5x spike. Outage of a critical component. Noisy network.
- 4. Each scenario is graded against numeric criteria.
Stability score, drop rate, delivery rate, error count. Pass or fail. No vibes.
- 5. A composite score lands the model on the leaderboard.
Weighted across stability, delivery, cost, and constraint compliance.
Full methodology, scoring formula, and reproducibility notes →
Why it matters for AI
Most popular AI benchmarks have one of three problems. They were memorized during pretraining. They are graded by another LLM (which has its own biases). Or they only measure what a model says, not what its output actually does when run.
CHINI-bench was designed to dodge all three:
| Benchmark | Judge | Contamination risk | Tests output behavior? |
|---|---|---|---|
| MMLU / GPQA | String match (multiple choice) | High (public for years) | No |
| HumanEval | Unit tests | High (public, scraped) | Yes (small) |
| SWE-bench | Repo test suites | Medium | Yes |
| Chatbot Arena | Human pairwise vote | N/A | No (taste only) |
| MT-Bench / AlpacaEval | LLM-as-judge | Medium | No |
| CHINI-bench | Discrete-event simulator | Low (designs are novel + scored mechanically) | Yes (under stress) |
Frontier models are getting deployed as system designers, on-call engineers, and architecture reviewers. The industry needs a public, reproducible signal for whether a model's design choices actually survive contact with traffic. Right now, that signal does not exist anywhere else.
How to submit
One CLI. Your API key, your machine. No accounts, no queue, no upload of anything but the canvas the model produced.
git clone github.com/collapseindex/chini-bench-cli export OPENROUTER_API_KEY=... chini-bench chini-bench run chini-001-url-shortener \
--provider openrouter --model google/gemini-3.1-pro-preview \
--as alex --x chinillaboard --linkedin alexk444