Launch special: 50% off Pro monthly with code LAUNCH50 Upgrade now
Skip to main content
Chinilla mascot grading a robot's benchmark results Public benchmark · v0.3

CHINI-bench

The first deterministic, simulator-graded benchmark for AI system design.

Models emit a Chinilla architecture. The simulator runs it through stress scenarios. Pass or fail is mechanical. No LLM-as-judge.

30
Problems
120
Submissions scored
4
Distinct submitters
Chapter 1

What we measure

Real systems break in specific ways. They double-charge a customer because a webhook fired twice. They fall over when traffic spikes 10x at launch. They blow the cloud bill on a fanout nobody capacity-planned. They take down everything when one pod dies.

CHINI-bench asks an AI to design a system on paper, then runs that paper design through a simulator that reproduces those exact failure modes. The problems span five classes, on purpose:

PC1SWE backend
URL shorteners, payment webhooks, rate limiters. The interview corpus.
PC2Operations
Cafes, restaurants, ER triage, pottery studios. Physical capacity, real bottlenecks.
PC3Personal systems
Habit loops, inbox zero, couch-to-5K. Cravings as packets, willpower as backpressure.
PC4Civic
Polling stations, vaccine rollouts, disaster shelters. Equity and cold-chain constraints.
PC5Adversarial
DDoS shields, phishing funnels. Attacker is in the graph; defenses must hold without dropping clean traffic.

The simulator is domain-blind. Same primitives (queue, retry, ratelimit, circuitbreaker, split, batch) score a backend service and a cafe morning rush with identical math. That is the moat against pretraining contamination: a model that crushes PC1 but tanks PC2-PC5 is recalling, not designing.

See all 30 problems in the appendix →

First findings · v0.3

Four frontier models. Combined, they solve only a third of the bench.

Four flagships. 30 problems. 120 single-shot runs through the deterministic simulator. Composite score per class, average across all problems in the class:

Model PC1
SWE
PC2
Ops
PC3
Personal
PC4
Civic
PC5
Adversarial
Avg Pass
claude-sonnet-4.67760577575705/30
gemini-3.1-pro7748757370704/30
gpt-5.46860737775704/30
grok-4.206471696777692/30

Three models tie at 70 on average; Grok trails by one point. The averages hide the real story: every model has a class it tanks on. Gemini-3.1-Pro leads PC1 alongside Claude (77) but craters on PC2 everyday operations with a 48 (the worst single-class score in the sweep). Claude sweeps strong everywhere except PC3 personal systems (57). GPT-5.4 is the most balanced. Grok is the least peaky but never the strongest. The deltas are not noise: same prompt, same simulator, same scoring math, every time.

Combined coverage: 10/30. Pool every passing run from all four frontier models and only ten of thirty problems have ever been solved by anyone. Two thirds of v0.3 remain open.

Full leaderboard →

Chapter 2

How it works

Same prompt, same simulator, same scoring weights, every model. No human judge, no LLM judge, no rubric. The same architecture submitted twice produces the same score, run by anyone, anywhere.

  1. 1.
    A model reads the prompt + constraints.

    Same prompt for every model. No retries, no hand-tuning.

  2. 2.
    It emits a CanvasState (architecture as JSON).

    Components, connections, behaviors. The same format Chinilla uses internally.

  3. 3.
    The simulator runs every stress scenario.

    Baseline. 5x spike. Outage of a critical component. Noisy network.

  4. 4.
    Each scenario is graded against numeric criteria.

    Stability score, drop rate, delivery rate, error count. Pass or fail. No vibes.

  5. 5.
    A composite score lands the model on the leaderboard.

    Weighted across stability, delivery, cost, and constraint compliance.

Full methodology, scoring formula, and reproducibility notes →

Chapter 3

Why it matters for AI

Most popular AI benchmarks have one of three problems. They were memorized during pretraining. They are graded by another LLM (which has its own biases). Or they only measure what a model says, not what its output actually does when run.

CHINI-bench was designed to dodge all three:

Benchmark Judge Contamination risk Tests output behavior?
MMLU / GPQA String match (multiple choice) High (public for years) No
HumanEval Unit tests High (public, scraped) Yes (small)
SWE-bench Repo test suites Medium Yes
Chatbot Arena Human pairwise vote N/A No (taste only)
MT-Bench / AlpacaEval LLM-as-judge Medium No
CHINI-bench Discrete-event simulator Low (designs are novel + scored mechanically) Yes (under stress)

Frontier models are getting deployed as system designers, on-call engineers, and architecture reviewers. The industry needs a public, reproducible signal for whether a model's design choices actually survive contact with traffic. Right now, that signal does not exist anywhere else.

Chapter 4

How to submit

One CLI. Your API key, your machine. No accounts, no queue, no upload of anything but the canvas the model produced.

Three steps
From clone to leaderboard in under a minute.
1 Get the CLI
git clone github.com/collapseindex/chini-bench-cli
2 Set a key
export OPENROUTER_API_KEY=...
Any provider. Key never leaves your machine.
3 Run
chini-bench
Interactive menu. No flags to remember.
Or one-shot, end-to-end
chini-bench run chini-001-url-shortener \
  --provider openrouter --model google/gemini-3.1-pro-preview \
  --as alex --x chinillaboard --linkedin alexk444
Providers: openai · anthropic · google · openrouter · ollama
Your API key never leaves your machine. Only the resulting CanvasState is uploaded.
Built by alex · @chinillaboard · DMs open for human-baseline submissions.