Launch special: 50% off Pro monthly with code LAUNCH50 Upgrade now
Skip to main content
← Back to bench
Specification · v0.3 · April 23, 2026

CHINI-bench Methodology

A simulator-graded benchmark for AI system design. This document specifies the protocol completely: input schema, scenario taxonomy, scoring math, default thresholds, abuse protections, and known limitations. Everything needed to reproduce a result independently.

1. Abstract

CHINI-bench measures whether large language models can design systems that survive specified failure modes. A model is given a natural-language brief plus a fixed set of stress scenarios, and emits a directed graph (CanvasState) describing the proposed architecture. The graph is executed by a deterministic discrete-event simulator under each scenario. Five sub-scores (stability, delivery, cost, constraints, design) are computed from simulator output and combined with problem-specific weights into a composite score in [0, 100]. No language model participates in grading.

As of v0.3 the bench contains 30 problems across 5 problem classes, totaling 112 scenarios. Four frontier models (Claude Sonnet 4.6, GPT-5.4, Grok 4.20, Gemini 3.1 Pro) cluster tightly at 69-70/100 average. The best (Claude Sonnet 4.6) passes only 5/30 problems; the lowest (Grok 4.20) passes 2/30. Across all four models combined, 10/30 problems have ever been solved by anyone.

Scope note

v0.3 is a one-shot evaluation: the model sees the brief once and emits one CanvasState. It does not see simulator output, cannot inspect failures, and cannot revise. This measures design ability under a single pass, not agentic refinement. An agentic mode (multi-turn, simulator-in-the-loop) is planned for a future version and will be reported as a separate track so the two signals do not contaminate each other.

There is also a substantive claim embedded in this choice. If a model's one-shot design is structurally broken (cyclic topology, no terminal sink, missing retry on flaky edges), iteration tends to preserve those priors rather than correct them: verbal self-reflection frameworks like Reflexion improve task performance primarily when the feedback signal is informative and the underlying policy is close to correct [1], and SWE-bench results show that agent scaffolds multiply base-model capability rather than replace it [2]. v0.3 measures whether the base-model design priors are sound.

We acknowledge the open question this leaves on the table: are broken priors fixable with feedback? That is a policy-relevant question and v0.3 does not answer it. It is the explicit motivation for the agentic track, which will pair each one-shot score with a matched multi-turn score on the same problem so the delta is directly readable. Until that data exists, the honest framing is: v0.3 reports the floor a model brings to the task before any scaffolding, not a verdict on what scaffolding can recover.

2. Evaluation pipeline

CHINI-bench evaluation pipeline diagrammed in Chinilla itself: Problem Input -> LLM Model -> Sanitizer -> Graph Simulator -> Metrics Checker -> Score Calculator -> Disk Writer
The pipeline below, designed in Chinilla. Open the live canvas →
  1. 2.1The model receives the problem brief, constraints, success criteria, and the CanvasState schema as a single user message. The system prompt is fixed (section 12).
  2. 2.2The model emits a JSON object conforming to the schema (section 3).
  3. 2.3The submission is sanitized: unknown component types and behavior modes are dropped, edges with missing endpoints are removed, position fields are stripped (section 13).
  4. 2.4For each of the problem's scenarios, the simulator runs the graph end-to-end and returns metrics (section 7).
  5. 2.5Each scenario's metrics are checked against scenario-specific and problem-overall criteria. Failures are recorded but do not abort.
  6. 2.6The five sub-scores are computed and combined (section 9). The result, including the full canvas and per-scenario metrics, is written to disk.
  7. 2.7A submission passes iff every scenario passed AND no constraint notes were emitted. A passing submission is shown with a green badge on the per-problem and leaderboard pages.

The pipeline is implemented in src/lib/bench/scoring.ts as a pure function. Same input always yields the same output.

3. CanvasState input schema

Every submission is a single JSON object. Position and styling fields are optional and ignored by the simulator. The minimum viable submission has at least one component and at least one connection.

{
  "name": "string",                          // optional, free text
  "components": [
    {
      "id":          "string",               // unique within graph
      "type":        "person|step|storage|decision|trigger|tool|channel",
      "label":       "string",               // human-readable
      "description": "string",               // optional
      "behavior": {                          // optional
        "mode":         "passthrough|transform|filter|queue|split|delay|condition|retry|ratelimit|circuitbreaker|batch|replicate",
        "capacity":     0,                   // queue depth, rate cap, batch size
        "dropRate":     0.0,                 // 0-1, for queue overflow / filter
        "maxRetries":   0,                   // for retry mode
        "failureRate":  0.0,                 // 0-1, for circuitbreaker threshold
        "delayMs":      0,                   // for delay mode
        "splitRatio":   [0.5, 0.5]           // for split mode
      },
      "cost":    { "monthly": 0, "setup": 0 },     // optional, USD
      "metrics": { "throughput": "10k req/s",      // optional, free-text strings
                   "capacity":   "50k req/s",
                   "processingTime": "8ms" }
    }
  ],
  "connections": [
    {
      "id":         "string",                // unique within graph
      "from":       "componentId",
      "to":         "componentId",
      "label":      "string",                // optional
      "latency_ms": 0                        // optional
    }
  ]
}

Schema enforced by src/lib/bench/sanitize.ts. Unknown fields are silently stripped. Unknown enum values fall back to the default (step / passthrough).

4. Component types (7)

Type Description
person A human actor in the system. Customers, baristas, dispatchers, attackers. Usually a source or sink.
step A unit of work. Process, validate, transform, route. The default workhorse.
storage Persistent state. Databases, caches, file stores, ledgers. Has read/write affinity.
decision A branch point. Routes packets down different connections based on a condition.
trigger An event source. Cron jobs, webhooks, timers, alerts. Generates packets without an upstream input.
tool An external service or instrument. Payment APIs, ML models, espresso machines, sensors.
channel A transport. Queues, message buses, network links, hallways. Carries packets between work components.

5. Behavior modes (12)

Behaviors attach to a component and modify how it processes packets. Most have parameters; sensible defaults apply if a parameter is omitted.

Mode Description
passthrough Default. Forwards every packet to all outgoing edges with no logic.
transform Applies a stateless mutation. Modeled as a small processing delay.
filter Drops a fraction of incoming packets. Use for content moderation, spam, deduplication.
queue Buffers packets up to `capacity`. Drops on overflow at `dropRate`. Smooths bursts.
split Distributes packets across outgoing edges (round-robin or by `splitRatio`).
delay Holds each packet for `delayMs`. Models human latency, sync calls, batching windows.
condition Routes by predicate. Pairs with `decision` components.
retry Retries failed downstream calls up to `maxRetries`. Hedges against transient failure.
ratelimit Caps throughput at `capacity` packets/sec. Excess packets dropped or queued depending on downstream.
circuitbreaker Fails fast when downstream `failureRate` exceeds threshold. Prevents cascade failure.
batch Accumulates `capacity` packets, releases as one bundle. Trades latency for throughput.
replicate Sends a copy of each packet to all outgoing edges. Use for fan-out and warm secondaries.

6. Scenario kinds (5)

A problem comprises a baseline plus 1-3 stress scenarios. The simulator runs each scenario independently against the same submitted canvas. A submission passes the problem only if it passes every scenario.

Kind Description
baseline Normal traffic at the design load. The "happy path" stress test.
spike Traffic multiplier (typically 2x-10x baseline). Tests headroom, queues, rate limits.
outage A named component is disabled mid-run. Tests fallback, redundancy, retry, circuit-breaker.
cascade Ambient failure rate elevated (every step has a chance to fail). Tests resilience under partial degradation.
adversarial Two-pass: clean traffic + hostile traffic injected at the same trigger. Scored on attack-block-rate AND clean-delivery-rate (must satisfy both).

7. Per-scenario metrics

The simulator emits the following measurements per scenario. Adversarial scenarios additionally emit attackBlockRate and cleanDeliveryRate (section 11).

8. Default success thresholds

Each scenario's metrics are checked against criteria computed by merging three layers (later wins):

  1. Default thresholds (below)
  2. Problem-level overallCriteria (per-problem, applies to all scenarios)
  3. Scenario-level scenarioCriteria[id] (per-scenario override)
minHealthScore = 70
maxDropRate = 0.05 // 5%
minDeliveryRate = 0.90 // 90%
maxErrors = 5
maxDurationMs = unset (no timeout unless set per problem)

Adversarial scenarios additionally enforce minAttackBlockRate=0.7 and minCleanDeliveryRate=0.8 by default.

9. Composite scoring

Each problem defines five sub-scores in [0, 100] and weights them into a single composite, also in [0, 100]:

composite = ( stability x ws
+ delivery x wd
+ cost x wc
+ constraints x wk
+ design x wg ) / (ws + wd + wc + wk + wg)

Weights are problem-specific and sum to 1.0. A typical PC1 problem weights stability at 0.40, delivery at 0.25, design at 0.15, cost at 0.10, constraints at 0.10. PC5 problems shift weight onto delivery and design, away from cost.

10. The design subscore (D1-D4)

Stability and delivery measure outcomes. Design measures structure: did the model include the right primitives for the failure modes the problem actually exercises? Each check is worth 25 points and is conditional, awarded full credit if the corresponding failure mode is not in the scenario set.

ID Check Triggered when Passes when
D1 Queue presence Any scenario has trafficMultiplier ≥ 2, or arrivalInterval > 0, or kind=spike A component has behavior.mode = queue
D2 Resilience Any scenario kind is outage or cascade, or ambientFailRate > 0 A component has circuitbreaker or retry
D3 Rate discipline Any scenario has trafficMultiplier ≥ 5, or kind=adversarial A component has ratelimit or batch
D4 Terminal sink Always At least one component has zero outgoing edges

D4 is the most-failed check in the v0.3 sweep. Several model submissions returned fully-cyclic graphs where every component had outgoing edges, meaning no packet could ever exit. The simulator runs them anyway and they tank delivery; D4 isolates the root cause structurally.

11. Adversarial scoring

A pure-throughput optimizer would happily forward attack traffic to its target and score perfectly on deliveryRate. To prevent this, adversarial scenarios run two passes through the simulator:

  1. Pass A (clean): seedPackets only, no attack volume, no inflated failure rate. Produces cleanDelivered.
  2. Pass B (under attack): seedPackets + attackPackets injected at the same trigger, ambient failure rate raised to model hostile inputs flaking downstream. Produces totalDeliveredUnderAttack.

Two derived metrics:

slipped = max(0, totalDeliveredUnderAttack - cleanDelivered)
attackBlockRate = clamp(1 - slipped / attackPackets, 0, 1)
cleanDeliveryRate = clamp(cleanDelivered / cleanSeeds, 0, 1)

A defense that drops everything scores 1.0 on attack-block but ~0 on clean-delivery (and fails). A pure throughput optimizer scores ~1.0 on clean-delivery but ~0 on attack-block (and fails). Both must clear their thresholds (default 0.7 and 0.8). This is the only place the bench has explicit dual-objective scoring; everywhere else, a single composite suffices.

12. Default harness

The CLI ships a single fixed harness used for every published model run:

Harness verification. Every chini-bench run command computes a SHA-256 of the system prompt and sends the first 12 hex characters along with the submission as harness=chini-bench-cli:<hash>. The server stores it in the result file. The leaderboard then renders one of three states:

Canonical harness hashes for this version
chini-bench-cli:06d0ffb42f19
chini-bench-server:77a8db71de89
The CLI runner and the server-side runner ship slightly different system prompts (the server adds a JSON schema constraint via the API). Both are unmodified, both are official, both produce default-badged results. Anyone can verify the CLI hash by running python -c "from chini_bench.prompt import system_prompt_hash; print(system_prompt_hash())" against an unmodified install of chini-bench-cli >= 0.4.0.

Beating the default harness with a better prompt is itself a useful research result, as long as the prompt is published. We treat this as transparent harness-engineering, not cheating, and the custom badge keeps the two categories cleanly separated on the leaderboard.

13. Sanitization & abuse protection

The submission endpoint POST /api/bench/submit applies the following before scoring:

Implemented in src/pages/api/bench/submit.ts and src/lib/bench/sanitize.ts.

14. Reproducibility

Determinism: the simulator (src/lib/flowRuntime.ts) and the scorer (src/lib/bench/scoring.ts) are pure functions. Same canvas + same problem definition always yields the same composite score and per-scenario metrics, byte-for-byte. The 120 result JSONs in src/content/bench/results/ can be regenerated from the published canvases by re-running the scorer.

Independence: every problem definition, scenario, and weight lives in src/content/bench/problems/ as plain JSON. Every result JSON includes the full submitted canvas (a) for inspection and (b) so any reader can re-score it locally and verify the published number.

The repository is currently private. Researchers who want to verify a result independently can request the source bundle by emailing squeak@chinilla.com. A public mirror under a research-use license is planned alongside the v1 leaderboard.

15. Limitations

We document these limitations because a benchmark that hides its weaknesses is not a benchmark, it is marketing.

16. Ethics & broader impact

Intended use. CHINI-bench is intended to surface concrete weaknesses in current frontier models for system-design tasks, particularly for safety-relevant deployments (PC4 civic systems, PC5 adversarial). Improvements here translate, in principle, to safer agentic systems in production.

Misuse potential. Public benchmark scores can be optimized against rather than learned from. We mitigate via (a) a deterministic non-LLM judge that cannot be flattered, (b) cross-class composite scoring that punishes overfitting to PC1, and (c) versioning so a model that overfits v0.3 will not transfer to v1.

Data & privacy. The benchmark contains no personal data. Submissions store only the canvas, the submitter handle, and an SHA-256 hash of the source IP (for rate-limiting and abuse triage; never reversed, never displayed).

Carbon & cost. A full v0.3 sweep is 120 LLM calls (~$4 in API spend across the 4 frontier models, ~1 hour wall-clock including retries on rate limits and malformed JSON). The simulator runs locally and uses negligible energy.

Appendix A: problem set (v0.3)

Problem classes
  • PC1SWE backend systems. The familiar interview corpus.
  • PC2Operations and physical workflows. Cafes, kitchens, ER triage.
  • PC3Personal systems and habits. Cravings as packets, willpower as backpressure.
  • PC4Civic and public-service systems. Polling, vaccine rollout, shelter.
  • PC5Adversarial. Attacker in the graph, defenses not just throughput.
ID Class Title Scenarios
chini-001-url-shortener PC1
URL Shortener (TinyURL)
Map long URLs to short tokens. Survive spike traffic on the redirect path.
3 View →
chini-002-checkout PC1
E-commerce Checkout with Idempotent Payments
Process checkouts without ever charging a customer twice. Survive a downstream payment-API outage.
4 View →
chini-003-twitter-timeline PC1
Social Timeline (Twitter-style fanout)
Generate a personalized timeline for millions of users. Don't melt when a celebrity posts.
4 View →
chini-004-uber-dispatch PC1
Ride Dispatch (Uber-style matching)
Match riders to drivers in real time. Stay alive when a region's matcher dies.
4 View →
chini-005-chat-fanout PC1
Group Chat Fanout (WhatsApp-style)
Deliver messages to large group chats in order. No drops, no duplicates.
4 View →
chini-006-rate-limiter PC1
Distributed Rate Limiter
Allow bursty legitimate traffic. Reject abuse without blocking the world.
3 View →
chini-007-payment-webhook PC1
Payment Webhook Receiver
Accept inbound webhooks. Never lose one. Never double-process one.
3 View →
chini-008-search-autocomplete PC1
Search Autocomplete
Suggest as you type. Stay snappy when one shard goes dark.
3 View →
chini-009-video-upload PC1
Video Upload Pipeline
Accept large uploads. Transcode in the background. Survive a worker meltdown.
4 View →
chini-010-notification-fanout PC1
Notification Fanout (Push + Email + SMS)
One event, three channels. Slow SMS provider must not block push.
3 View →
chini-011-cafe-morning-rush PC2
Cafe Morning Rush
One espresso machine, two baristas, a line out the door, and the milk steamer just died.
4 View →
chini-012-energy-drink-habit PC3
Quitting the Energy Drink Habit
A craving is a packet. Willpower is backpressure. Design the system that keeps you off the 4pm Red Bull.
4 View →
chini-013-pottery-studio PC2
Pottery Studio Firing Schedule
Two kilns, twenty members, four firing stages, one electrical limit. Don't crack the work.
4 View →
chini-014-restaurant-friday-night PC2
Restaurant Friday Night Service
Eight tables turning every 90 minutes. Three stations on the line. The walk-in just got delivered short on prep.
4 View →
chini-015-er-triage PC2
Emergency Department Triage
Five severity levels, finite beds, one CT scanner. The wrong queue means someone dies.
4 View →
chini-016-inbox-zero PC3
Inbox Zero Maintenance
300 emails a day, three contexts, two devices, one human attention budget.
4 View →
chini-017-couch-to-5k PC3
Couch to 5K
Three runs a week, nine weeks, one knee that hurts on Wednesdays. Get to the 5K without quitting.
4 View →
chini-018-polling-station PC4
Election Day Polling Station
One precinct, eight booths, three machines, a thousand voters, and the printer for ballot paper just jammed.
4 View →
chini-019-vaccine-rollout PC4
County Vaccine Rollout
Cold chain from a -70C freezer to a 95-year-old's deltoid. Don't waste a single dose.
4 View →
chini-020-disaster-shelter PC4
Disaster Shelter Intake
500 evacuees in 12 hours, finite cots, dietary restrictions, medical needs, families that must not be split.
4 View →
chini-021-ddos-shield PC5
DDoS Mitigation Shield
100M packets per second of garbage. Your customer's checkout still has to clear in 200ms.
4 View →
chini-022-phishing-funnel PC5
Phishing Defense Funnel
10,000 emails an hour. One of them is the spear-phish that gets the CFO's credentials. Find it.
4 View →
chini-023-airline-gate-turnaround PC2
Airline Gate Turnaround
25 minutes to deplane, refuel, clean, cater, and board 180 passengers. Anything later costs the airline a delay slot.
4 View →
chini-024-meal-prep-sunday PC3
Meal Prep Sunday
Cook once, eat for a week. Without the Wednesday-night takeout collapse.
4 View →
chini-025-job-search-pipeline PC3
Job Search Pipeline
100 applications in, 3 offers out. Without ghosting yourself in the middle.
4 View →
chini-026-food-bank-distribution PC4
Food Bank Distribution
Fresh produce in, hungry families out, nothing rots in the warehouse, nobody waits 4 hours.
4 View →
chini-027-911-dispatch PC4
911 Dispatch
Cardiac arrest at 9:01am, fender bender at 9:02am, fire at 9:03am. Three calls, two ambulances, one decision per second.
4 View →
chini-028-credential-stuffing PC5
Credential Stuffing Defense
100k stolen credentials replayed against your login. Block the attack without locking out 50k real users.
3 View →
chini-029-comment-spam-flood PC5
Comment Spam Flood
An LLM-driven spammer floods your forum with 50k near-human comments. Block them without false-flagging real users.
3 View →
chini-030-api-scraper PC5
API Scraper Defense
A distributed scraper drains your public API at 10x normal volume. Block it without blinding real apps.
3 View →

Appendix B: version history

Version Date Changes
v0.3 2026-04-23 Added 8 new problems (n=22 -> 30). Added design as a 5th subscore with 4 conditional structural checks (D1-D4). Added adversarial as a 5th scenario kind with two-pass scoring (attackBlockRate, cleanDeliveryRate). Wired OpenRouter; ran the first 4-model frontier sweep (Claude Sonnet 4.6, GPT-5.4, Grok 4.20, Gemini 3.1 Pro). Bumped max_tokens from 3000 to 12000 to accommodate thinking models.
v0.2 2026-04-22 22 problems across 5 classes. Single-model baseline (Grok). Public CLI + dashboard launch.
v0.1 2026-04-22 Internal pilot. 12 problems, PC1 + PC2 only. Established the simulator-as-judge protocol.

References

  1. Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366.
  2. Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770.

Citation

If you reference CHINI-bench in academic or industry work, please cite it as:

@misc{chinibench2026,
  title  = {{CHINI-bench}: A simulator-graded benchmark for {AI} system design},
  author = {Kwon, Alex},
  year   = {2026},
  note   = {Version 0.3. https://chinilla.com/bench},
  url    = {https://chinilla.com/bench}
}

Plain text:

Kwon, A. (2026). CHINI-bench: A simulator-graded benchmark for AI system design (Version 0.3) [Benchmark]. https://chinilla.com/bench