CHINI-bench Methodology
A simulator-graded benchmark for AI system design. This document specifies the protocol completely: input schema, scenario taxonomy, scoring math, default thresholds, abuse protections, and known limitations. Everything needed to reproduce a result independently.
1. Abstract
CHINI-bench measures whether large language models can design systems that survive specified failure modes. A model is given a natural-language brief plus a fixed set of stress scenarios, and emits a directed graph (CanvasState) describing the proposed architecture. The graph is executed by a deterministic discrete-event simulator under each scenario. Five sub-scores (stability, delivery, cost, constraints, design) are computed from simulator output and combined with problem-specific weights into a composite score in [0, 100]. No language model participates in grading.
As of v0.3 the bench contains 30 problems across 5 problem classes, totaling 112 scenarios. Four frontier models (Claude Sonnet 4.6, GPT-5.4, Grok 4.20, Gemini 3.1 Pro) cluster tightly at 69-70/100 average. The best (Claude Sonnet 4.6) passes only 5/30 problems; the lowest (Grok 4.20) passes 2/30. Across all four models combined, 10/30 problems have ever been solved by anyone.
v0.3 is a one-shot evaluation: the model sees the brief once and emits one CanvasState. It does not see simulator output, cannot inspect failures, and cannot revise. This measures design ability under a single pass, not agentic refinement. An agentic mode (multi-turn, simulator-in-the-loop) is planned for a future version and will be reported as a separate track so the two signals do not contaminate each other.
There is also a substantive claim embedded in this choice. If a model's one-shot design is structurally broken (cyclic topology, no terminal sink, missing retry on flaky edges), iteration tends to preserve those priors rather than correct them: verbal self-reflection frameworks like Reflexion improve task performance primarily when the feedback signal is informative and the underlying policy is close to correct [1], and SWE-bench results show that agent scaffolds multiply base-model capability rather than replace it [2]. v0.3 measures whether the base-model design priors are sound.
We acknowledge the open question this leaves on the table: are broken priors fixable with feedback? That is a policy-relevant question and v0.3 does not answer it. It is the explicit motivation for the agentic track, which will pair each one-shot score with a matched multi-turn score on the same problem so the delta is directly readable. Until that data exists, the honest framing is: v0.3 reports the floor a model brings to the task before any scaffolding, not a verdict on what scaffolding can recover.
2. Evaluation pipeline
- 2.1The model receives the problem brief, constraints, success criteria, and the CanvasState schema as a single user message. The system prompt is fixed (section 12).
- 2.2The model emits a JSON object conforming to the schema (section 3).
- 2.3The submission is sanitized: unknown component types and behavior modes are dropped, edges with missing endpoints are removed, position fields are stripped (section 13).
- 2.4For each of the problem's scenarios, the simulator runs the graph end-to-end and returns metrics (section 7).
- 2.5Each scenario's metrics are checked against scenario-specific and problem-overall criteria. Failures are recorded but do not abort.
- 2.6The five sub-scores are computed and combined (section 9). The result, including the full canvas and per-scenario metrics, is written to disk.
- 2.7A submission passes iff every scenario passed AND no constraint notes were emitted. A passing submission is shown with a green badge on the per-problem and leaderboard pages.
The pipeline is implemented in src/lib/bench/scoring.ts as a pure function. Same input always yields the same output.
3. CanvasState input schema
Every submission is a single JSON object. Position and styling fields are optional and ignored by the simulator. The minimum viable submission has at least one component and at least one connection.
{
"name": "string", // optional, free text
"components": [
{
"id": "string", // unique within graph
"type": "person|step|storage|decision|trigger|tool|channel",
"label": "string", // human-readable
"description": "string", // optional
"behavior": { // optional
"mode": "passthrough|transform|filter|queue|split|delay|condition|retry|ratelimit|circuitbreaker|batch|replicate",
"capacity": 0, // queue depth, rate cap, batch size
"dropRate": 0.0, // 0-1, for queue overflow / filter
"maxRetries": 0, // for retry mode
"failureRate": 0.0, // 0-1, for circuitbreaker threshold
"delayMs": 0, // for delay mode
"splitRatio": [0.5, 0.5] // for split mode
},
"cost": { "monthly": 0, "setup": 0 }, // optional, USD
"metrics": { "throughput": "10k req/s", // optional, free-text strings
"capacity": "50k req/s",
"processingTime": "8ms" }
}
],
"connections": [
{
"id": "string", // unique within graph
"from": "componentId",
"to": "componentId",
"label": "string", // optional
"latency_ms": 0 // optional
}
]
}
Schema enforced by src/lib/bench/sanitize.ts. Unknown fields are silently stripped. Unknown enum values fall back to the default (step / passthrough).
4. Component types (7)
| Type | Description |
|---|---|
| person | A human actor in the system. Customers, baristas, dispatchers, attackers. Usually a source or sink. |
| step | A unit of work. Process, validate, transform, route. The default workhorse. |
| storage | Persistent state. Databases, caches, file stores, ledgers. Has read/write affinity. |
| decision | A branch point. Routes packets down different connections based on a condition. |
| trigger | An event source. Cron jobs, webhooks, timers, alerts. Generates packets without an upstream input. |
| tool | An external service or instrument. Payment APIs, ML models, espresso machines, sensors. |
| channel | A transport. Queues, message buses, network links, hallways. Carries packets between work components. |
5. Behavior modes (12)
Behaviors attach to a component and modify how it processes packets. Most have parameters; sensible defaults apply if a parameter is omitted.
| Mode | Description |
|---|---|
| passthrough | Default. Forwards every packet to all outgoing edges with no logic. |
| transform | Applies a stateless mutation. Modeled as a small processing delay. |
| filter | Drops a fraction of incoming packets. Use for content moderation, spam, deduplication. |
| queue | Buffers packets up to `capacity`. Drops on overflow at `dropRate`. Smooths bursts. |
| split | Distributes packets across outgoing edges (round-robin or by `splitRatio`). |
| delay | Holds each packet for `delayMs`. Models human latency, sync calls, batching windows. |
| condition | Routes by predicate. Pairs with `decision` components. |
| retry | Retries failed downstream calls up to `maxRetries`. Hedges against transient failure. |
| ratelimit | Caps throughput at `capacity` packets/sec. Excess packets dropped or queued depending on downstream. |
| circuitbreaker | Fails fast when downstream `failureRate` exceeds threshold. Prevents cascade failure. |
| batch | Accumulates `capacity` packets, releases as one bundle. Trades latency for throughput. |
| replicate | Sends a copy of each packet to all outgoing edges. Use for fan-out and warm secondaries. |
6. Scenario kinds (5)
A problem comprises a baseline plus 1-3 stress scenarios. The simulator runs each scenario independently against the same submitted canvas. A submission passes the problem only if it passes every scenario.
| Kind | Description |
|---|---|
| baseline | Normal traffic at the design load. The "happy path" stress test. |
| spike | Traffic multiplier (typically 2x-10x baseline). Tests headroom, queues, rate limits. |
| outage | A named component is disabled mid-run. Tests fallback, redundancy, retry, circuit-breaker. |
| cascade | Ambient failure rate elevated (every step has a chance to fail). Tests resilience under partial degradation. |
| adversarial | Two-pass: clean traffic + hostile traffic injected at the same trigger. Scored on attack-block-rate AND clean-delivery-rate (must satisfy both). |
7. Per-scenario metrics
The simulator emits the following measurements per scenario. Adversarial scenarios additionally emit attackBlockRate and cleanDeliveryRate (section 11).
- healthScore (0-100). Internal aggregate of queue pressure, drop count, and throughput. Computed by
aggregateFlowStats. - delivered. Raw count of packets that reached a primary exit (a sink component).
- dropped. Raw count of packets dropped along the way.
- deliveryRate. delivered / max(injected, 1), clamped to [0, 1].
- dropRate. dropped / max(injected + delivered, 1), clamped to [0, 1].
- durationMs. Simulated wall-clock time for the scenario to complete.
- errorCount. Distinct error events emitted by the engine (cycles, missing endpoints, runtime exceptions).
8. Default success thresholds
Each scenario's metrics are checked against criteria computed by merging three layers (later wins):
- Default thresholds (below)
- Problem-level
overallCriteria(per-problem, applies to all scenarios) - Scenario-level
scenarioCriteria[id](per-scenario override)
Adversarial scenarios additionally enforce minAttackBlockRate=0.7 and minCleanDeliveryRate=0.8 by default.
9. Composite scoring
Each problem defines five sub-scores in [0, 100] and weights them into a single composite, also in [0, 100]:
- stability. Mean
healthScoreacross all scenarios. - delivery. Mean
deliveryRatex 100 across all scenarios. - cost.
round(100 x min(1, budget / max(1, totalMonthly))). 100 if no budget set. - constraints. Starts at 100. Subtract 10 per excess component (cap 40). Subtract 25 per missing required behavior. Subtract up to 40 for budget overshoot.
- design. Sum of 4 conditional structural checks worth 25 each (section 10).
Weights are problem-specific and sum to 1.0. A typical PC1 problem weights stability at 0.40, delivery at 0.25, design at 0.15, cost at 0.10, constraints at 0.10. PC5 problems shift weight onto delivery and design, away from cost.
10. The design subscore (D1-D4)
Stability and delivery measure outcomes. Design measures structure: did the model include the right primitives for the failure modes the problem actually exercises? Each check is worth 25 points and is conditional, awarded full credit if the corresponding failure mode is not in the scenario set.
| ID | Check | Triggered when | Passes when |
|---|---|---|---|
| D1 | Queue presence | Any scenario has trafficMultiplier ≥ 2, or arrivalInterval > 0, or kind=spike | A component has behavior.mode = queue |
| D2 | Resilience | Any scenario kind is outage or cascade, or ambientFailRate > 0 | A component has circuitbreaker or retry |
| D3 | Rate discipline | Any scenario has trafficMultiplier ≥ 5, or kind=adversarial | A component has ratelimit or batch |
| D4 | Terminal sink | Always | At least one component has zero outgoing edges |
D4 is the most-failed check in the v0.3 sweep. Several model submissions returned fully-cyclic graphs where every component had outgoing edges, meaning no packet could ever exit. The simulator runs them anyway and they tank delivery; D4 isolates the root cause structurally.
11. Adversarial scoring
A pure-throughput optimizer would happily forward attack traffic to its target and score perfectly on deliveryRate. To prevent this, adversarial scenarios run two passes through the simulator:
- Pass A (clean): seedPackets only, no attack volume, no inflated failure rate. Produces
cleanDelivered. - Pass B (under attack): seedPackets + attackPackets injected at the same trigger, ambient failure rate raised to model hostile inputs flaking downstream. Produces
totalDeliveredUnderAttack.
Two derived metrics:
A defense that drops everything scores 1.0 on attack-block but ~0 on clean-delivery (and fails). A pure throughput optimizer scores ~1.0 on clean-delivery but ~0 on attack-block (and fails). Both must clear their thresholds (default 0.7 and 0.8). This is the only place the bench has explicit dual-objective scoring; everywhere else, a single composite suffices.
12. Default harness
The CLI ships a single fixed harness used for every published model run:
- System prompt: names the CanvasState schema, lists the 7 component types and 12 behavior modes, instructs the model to emit one valid JSON object and nothing else. Frozen, lives in
chini_bench/prompt.py. - User prompt: the problem brief, constraints, success criteria, and scenario list, exactly as published.
- Sampling: temperature 0.2, top-p default, no system messages besides the harness, no chain-of-thought scaffolding, no retries on bad JSON.
max_tokens=12000for OpenRouter routes (raised from 3000 in v0.3 to accommodate thinking models).
Harness verification. Every chini-bench run command computes a SHA-256 of the system prompt and sends the first 12 hex characters along with the submission as harness=chini-bench-cli:<hash>. The server stores it in the result file. The leaderboard then renders one of three states:
- defaultHash matches the canonical bench-version hash. Unmodified CLI, default prompt, no scaffolding.
- customHash differs. Either someone forked the CLI and changed the prompt, or they re-ran via a different runner that declares its own harness id. The leaderboard surfaces the run but flags it.
- no badgeSubmission came in via
chini-bench submit -f file.jsonor directly via the HTTP endpoint. No harness was declared. Treated as community contribution, not a calibrated frontier-model result.
python -c "from chini_bench.prompt import system_prompt_hash; print(system_prompt_hash())" against an unmodified install of chini-bench-cli >= 0.4.0.
Beating the default harness with a better prompt is itself a useful research result, as long as the prompt is published. We treat this as transparent harness-engineering, not cheating, and the custom badge keeps the two categories cleanly separated on the leaderboard.
13. Sanitization & abuse protection
The submission endpoint POST /api/bench/submit applies the following before scoring:
- Body cap. 64 KB. Larger bodies return 413.
- Rate limit. 5 submissions per IP per 10-minute rolling window.
- Honeypot. A hidden form field rejects naive bots without surfacing why.
- Submitter validation. Letters, digits, dot, dash, underscore. 1-40 characters. Auto-prefixed with
community:to prevent impersonation of official model identifiers (e.g.x-ai/grok-4.20). - Content filter. Submitter strings are checked against a categorized blocklist. Rejections surface a generic message; the category is logged for triage.
- Optional metadata.
model(regex-validated),xhandle (1-15 chars, no @),linkedinslug (full URL accepted, slug extracted). - Canvas sanitization. Unknown component types fall back to
step. Unknown behavior modes fall back topassthrough. Connections referencing missing components are dropped. Position, color, and styling fields are stripped before storage.
Implemented in src/pages/api/bench/submit.ts and src/lib/bench/sanitize.ts.
14. Reproducibility
Determinism: the simulator (src/lib/flowRuntime.ts) and the scorer (src/lib/bench/scoring.ts) are pure functions. Same canvas + same problem definition always yields the same composite score and per-scenario metrics, byte-for-byte. The 120 result JSONs in src/content/bench/results/ can be regenerated from the published canvases by re-running the scorer.
Independence: every problem definition, scenario, and weight lives in src/content/bench/problems/ as plain JSON. Every result JSON includes the full submitted canvas (a) for inspection and (b) so any reader can re-score it locally and verify the published number.
The repository is currently private. Researchers who want to verify a result independently can request the source bundle by emailing squeak@chinilla.com. A public mirror under a research-use license is planned alongside the v1 leaderboard.
15. Limitations
- Construct validity. The simulator is a fair model of the failure modes it explicitly models (queue overflow, retry storms, circuit-breaker logic, rate limits, ambient failure). It is not a model of real-world cloud infrastructure. Beating CHINI-bench is necessary but not sufficient evidence of system-design competence.
- Technology-blindness. A "queue" component is not Kafka or SQS or a Redis list, just a queue with capacity and drop semantics. Models that reason at the level of named services have no advantage over models that reason at the level of behaviors.
- No human baseline yet. v0.3 reports model scores only. A controlled baseline study with hired senior engineers is planned for v1.
- Adversarial scenarios are stylized. The two-pass attack model is an abstraction. Real adversarial inputs (prompt injection, malicious payloads) are not directly tested; the bench tests whether a design has the structural primitives that would resist attack.
- Problem authorship is centralized. All 30 problems were authored by one person (the author). A community problem-submission process is planned for v1 to mitigate authorial blind spots.
- One-shot, not agentic. Models emit one CanvasState per problem. They cannot inspect simulator output and revise. This is a deliberate design choice, not an oversight: one-shot isolates design intuition under uncertainty from iterative search given a feedback loop, which are separable skills that deserve separable benchmarks. A multi-turn agentic track (model sees sim metrics, edits canvas, re-runs) is planned as a parallel leaderboard so the two signals are reported side-by-side, never conflated. Until then, v0.3 scores understate the capability of models given an agentic harness.
- Sample size. 30 problems, 4 models, 120 results. Adequate for the headline finding (no model passes >9/30) but not for fine-grained model-vs-model claims at the <5pt level.
We document these limitations because a benchmark that hides its weaknesses is not a benchmark, it is marketing.
16. Ethics & broader impact
Intended use. CHINI-bench is intended to surface concrete weaknesses in current frontier models for system-design tasks, particularly for safety-relevant deployments (PC4 civic systems, PC5 adversarial). Improvements here translate, in principle, to safer agentic systems in production.
Misuse potential. Public benchmark scores can be optimized against rather than learned from. We mitigate via (a) a deterministic non-LLM judge that cannot be flattered, (b) cross-class composite scoring that punishes overfitting to PC1, and (c) versioning so a model that overfits v0.3 will not transfer to v1.
Data & privacy. The benchmark contains no personal data. Submissions store only the canvas, the submitter handle, and an SHA-256 hash of the source IP (for rate-limiting and abuse triage; never reversed, never displayed).
Carbon & cost. A full v0.3 sweep is 120 LLM calls (~$4 in API spend across the 4 frontier models, ~1 hour wall-clock including retries on rate limits and malformed JSON). The simulator runs locally and uses negligible energy.
Appendix A: problem set (v0.3)
- PC1SWE backend systems. The familiar interview corpus.
- PC2Operations and physical workflows. Cafes, kitchens, ER triage.
- PC3Personal systems and habits. Cravings as packets, willpower as backpressure.
- PC4Civic and public-service systems. Polling, vaccine rollout, shelter.
- PC5Adversarial. Attacker in the graph, defenses not just throughput.
| ID | Class | Title | Scenarios | |
|---|---|---|---|---|
| chini-001-url-shortener | PC1 | URL Shortener (TinyURL) Map long URLs to short tokens. Survive spike traffic on the redirect path. | 3 | View → |
| chini-002-checkout | PC1 | E-commerce Checkout with Idempotent Payments Process checkouts without ever charging a customer twice. Survive a downstream payment-API outage. | 4 | View → |
| chini-003-twitter-timeline | PC1 | Social Timeline (Twitter-style fanout) Generate a personalized timeline for millions of users. Don't melt when a celebrity posts. | 4 | View → |
| chini-004-uber-dispatch | PC1 | Ride Dispatch (Uber-style matching) Match riders to drivers in real time. Stay alive when a region's matcher dies. | 4 | View → |
| chini-005-chat-fanout | PC1 | Group Chat Fanout (WhatsApp-style) Deliver messages to large group chats in order. No drops, no duplicates. | 4 | View → |
| chini-006-rate-limiter | PC1 | Distributed Rate Limiter Allow bursty legitimate traffic. Reject abuse without blocking the world. | 3 | View → |
| chini-007-payment-webhook | PC1 | Payment Webhook Receiver Accept inbound webhooks. Never lose one. Never double-process one. | 3 | View → |
| chini-008-search-autocomplete | PC1 | Search Autocomplete Suggest as you type. Stay snappy when one shard goes dark. | 3 | View → |
| chini-009-video-upload | PC1 | Video Upload Pipeline Accept large uploads. Transcode in the background. Survive a worker meltdown. | 4 | View → |
| chini-010-notification-fanout | PC1 | Notification Fanout (Push + Email + SMS) One event, three channels. Slow SMS provider must not block push. | 3 | View → |
| chini-011-cafe-morning-rush | PC2 | Cafe Morning Rush One espresso machine, two baristas, a line out the door, and the milk steamer just died. | 4 | View → |
| chini-012-energy-drink-habit | PC3 | Quitting the Energy Drink Habit A craving is a packet. Willpower is backpressure. Design the system that keeps you off the 4pm Red Bull. | 4 | View → |
| chini-013-pottery-studio | PC2 | Pottery Studio Firing Schedule Two kilns, twenty members, four firing stages, one electrical limit. Don't crack the work. | 4 | View → |
| chini-014-restaurant-friday-night | PC2 | Restaurant Friday Night Service Eight tables turning every 90 minutes. Three stations on the line. The walk-in just got delivered short on prep. | 4 | View → |
| chini-015-er-triage | PC2 | Emergency Department Triage Five severity levels, finite beds, one CT scanner. The wrong queue means someone dies. | 4 | View → |
| chini-016-inbox-zero | PC3 | Inbox Zero Maintenance 300 emails a day, three contexts, two devices, one human attention budget. | 4 | View → |
| chini-017-couch-to-5k | PC3 | Couch to 5K Three runs a week, nine weeks, one knee that hurts on Wednesdays. Get to the 5K without quitting. | 4 | View → |
| chini-018-polling-station | PC4 | Election Day Polling Station One precinct, eight booths, three machines, a thousand voters, and the printer for ballot paper just jammed. | 4 | View → |
| chini-019-vaccine-rollout | PC4 | County Vaccine Rollout Cold chain from a -70C freezer to a 95-year-old's deltoid. Don't waste a single dose. | 4 | View → |
| chini-020-disaster-shelter | PC4 | Disaster Shelter Intake 500 evacuees in 12 hours, finite cots, dietary restrictions, medical needs, families that must not be split. | 4 | View → |
| chini-021-ddos-shield | PC5 | DDoS Mitigation Shield 100M packets per second of garbage. Your customer's checkout still has to clear in 200ms. | 4 | View → |
| chini-022-phishing-funnel | PC5 | Phishing Defense Funnel 10,000 emails an hour. One of them is the spear-phish that gets the CFO's credentials. Find it. | 4 | View → |
| chini-023-airline-gate-turnaround | PC2 | Airline Gate Turnaround 25 minutes to deplane, refuel, clean, cater, and board 180 passengers. Anything later costs the airline a delay slot. | 4 | View → |
| chini-024-meal-prep-sunday | PC3 | Meal Prep Sunday Cook once, eat for a week. Without the Wednesday-night takeout collapse. | 4 | View → |
| chini-025-job-search-pipeline | PC3 | Job Search Pipeline 100 applications in, 3 offers out. Without ghosting yourself in the middle. | 4 | View → |
| chini-026-food-bank-distribution | PC4 | Food Bank Distribution Fresh produce in, hungry families out, nothing rots in the warehouse, nobody waits 4 hours. | 4 | View → |
| chini-027-911-dispatch | PC4 | 911 Dispatch Cardiac arrest at 9:01am, fender bender at 9:02am, fire at 9:03am. Three calls, two ambulances, one decision per second. | 4 | View → |
| chini-028-credential-stuffing | PC5 | Credential Stuffing Defense 100k stolen credentials replayed against your login. Block the attack without locking out 50k real users. | 3 | View → |
| chini-029-comment-spam-flood | PC5 | Comment Spam Flood An LLM-driven spammer floods your forum with 50k near-human comments. Block them without false-flagging real users. | 3 | View → |
| chini-030-api-scraper | PC5 | API Scraper Defense A distributed scraper drains your public API at 10x normal volume. Block it without blinding real apps. | 3 | View → |
Appendix B: version history
| Version | Date | Changes |
|---|---|---|
| v0.3 | 2026-04-23 |
Added 8 new problems (n=22 -> 30). Added design as a 5th subscore with 4 conditional structural checks (D1-D4). Added adversarial as a 5th scenario kind with two-pass scoring (attackBlockRate, cleanDeliveryRate). Wired OpenRouter; ran the first 4-model frontier sweep (Claude Sonnet 4.6, GPT-5.4, Grok 4.20, Gemini 3.1 Pro). Bumped max_tokens from 3000 to 12000 to accommodate thinking models.
|
| v0.2 | 2026-04-22 | 22 problems across 5 classes. Single-model baseline (Grok). Public CLI + dashboard launch. |
| v0.1 | 2026-04-22 | Internal pilot. 12 problems, PC1 + PC2 only. Established the simulator-as-judge protocol. |
References
- Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366.
- Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770.
Citation
If you reference CHINI-bench in academic or industry work, please cite it as:
@misc{chinibench2026,
title = {{CHINI-bench}: A simulator-graded benchmark for {AI} system design},
author = {Kwon, Alex},
year = {2026},
note = {Version 0.3. https://chinilla.com/bench},
url = {https://chinilla.com/bench}
} Plain text:
Kwon, A. (2026). CHINI-bench: A simulator-graded benchmark for AI system design (Version 0.3) [Benchmark]. https://chinilla.com/bench