chini-016-inbox-zero

Inbox Zero Maintenance

300 emails a day, three contexts, two devices, one human attention budget.

Source: Productivity literature, GTD methodology, every knowledge worker drowning in email

Prompt

Design a personal email-processing system to keep a 300-email-per-day inbox at zero by EOD without destroying focus.

Functional:
- Email arrives all day across three contexts: work, personal, newsletters/promo.
- Each email gets one of: archive (no action), reply now (<2 min), defer (snooze with action), delegate (forward + tag), file (reference).
- Two processing windows per day (morning + late afternoon). Outside those windows, email is queued, not read.
- Newsletters auto-route to a read-later bucket, never trigger a notification.

Non-functional:
- A bad day (4x normal volume, e.g. after PTO) must not blow the daily processing budget. System batches and defers aggressively.
- If a key person emails (boss, partner, named contacts), notification breaks the window-only rule but rate-limits to one ping per hour.
- If the user misses an evening window, morning window must absorb the backlog without consuming the entire morning.

Return a Chinilla CanvasState. Components: inbox, classifier, windows, action buckets, notifications. Behaviors: split (context routing), queue (window batching), ratelimit (notification cap), batch (bulk processing), filter (newsletter shunt).

Constraints

Max components: 12
Required behaviors: split, queue, ratelimit
Monthly budget: $50

Stress scenarios

Normal day

baseline

300 emails across two windows, mixed contexts.

Back from vacation

spike

4x backlog. Morning window must absorb without taking the whole morning.

Missed evening window

outage

User skipped late-afternoon processing. Backlog hits morning queue.

Hard-to-classify thread

latency

Long ambiguous threads require human read time. System must not block fresh email.

Pass criteria (overall)

Min stability score: 60
Max drop rate: 10.0%
Min delivery rate: 85.0%
Max errors: 6

Submit your run

Submissions go through the chini-bench CLI. It calls your model with your key, scores the result locally, and posts to the leaderboard. Nothing leaves your machine except the canvas it produces.

End-to-end:

pip install git+https://github.com/collapseindex/chini-bench-cli.git
export OPENROUTER_API_KEY=...

chini-bench run chini-016-inbox-zero \
  --provider openrouter --model google/gemini-2.0-flash-001 \
  --as alice

Or inspect the prompt first:

chini-bench prompt chini-016-inbox-zero

Providers: openai · anthropic · google · openrouter · ollama

Leaderboard

Rank	Submitter	Model	Score	Stability	Delivery	Design	Pass
#1	alex	anthropic/claude-sonnet-4.6 default single-shot	95	88.0	100.0	75.0	✓
#2	rl_v06_run2	rl_policy custom single-shot	94	86.0	100.0	75.0	✓
#3	alex	x-ai/grok-4.20 default single-shot	93	91.0	91.0	75.0	✗
#4	rl_v06_run1	rl_policy custom single-shot	92	83.0	100.0	75.0	✓
#5	rl_v06_run2	rl_policy custom single-shot	92	83.0	100.0	75.0	✓
#6	rl_v06_run2	rl_policy custom single-shot	92	83.0	100.0	75.0	✓
#7	rl_v06_run2	rl_policy custom single-shot	92	83.0	100.0	75.0	✓
#8	rl_v06_run2	rl_policy custom single-shot	91	83.0	100.0	75.0	✗
#9	rl_v06_run2	rl_policy custom single-shot	91	81.0	100.0	75.0	✓
#10	rl_v06_run1	rl_policy custom single-shot	90	81.0	100.0	75.0	✗
#11	rl_v06_run1	rl_policy custom single-shot	90	80.0	98.0	75.0	✓
#12	rl_v06_run2	rl_policy custom single-shot	90	78.0	100.0	75.0	✓
#13	rl_v06_run2	rl_policy custom single-shot	90	83.0	94.0	75.0	✗
#14	alex	openai/gpt-5.4 default single-shot	89	78.0	96.0	75.0	✗
#15	alex	google/gemini-3.1-pro-preview default single-shot	89	75.0	100.0	75.0	✓
#16	rl_v06_run1	rl_policy custom single-shot	89	76.0	100.0	75.0	✓
#17	rl_v06_run2	rl_policy custom single-shot	89	83.0	100.0	75.0	✗
#18	rl_v06_run2	rl_policy custom single-shot	89	82.0	92.0	75.0	✗
#19	rl_v06_run2	rl_policy custom single-shot	89	83.0	100.0	50.0	✗
#20	rl_v06_run1	rl_policy custom single-shot	88	82.0	100.0	75.0	✗
#21	rl_v06_run1	rl_policy custom single-shot	88	83.0	88.0	75.0	✗
#22	rl_v06_run2	rl_policy custom single-shot	88	83.0	88.0	75.0	✗
#23	rl_v06_run1	rl_policy custom single-shot	87	72.0	100.0	75.0	✓
#24	rl_v06_run2	rl_policy custom single-shot	87	72.0	100.0	75.0	✓
#25	rl_v06_run2	rl_policy custom single-shot	86	75.0	100.0	75.0	✗
#26	rl_v06_run2	rl_policy custom single-shot	86	82.0	83.0	75.0	✗
#27	rl_v06_run2	rl_policy custom single-shot	86	78.0	100.0	50.0	✗
#28	rl_v06_run2	rl_policy custom single-shot	86	83.0	83.0	75.0	✗
#29	rl_v06_run1	rl_policy custom single-shot	85	75.0	100.0	50.0	✗
#30	rl_v06_run1	rl_policy custom single-shot	84	83.0	92.0	50.0	✗
#31	rl_v06_run2	rl_policy custom single-shot	84	84.0	75.0	75.0	✗
#32	rl_v06_run2	rl_policy custom single-shot	84	83.0	92.0	75.0	✗
#33	rl_v06_run2	rl_policy custom single-shot	84	83.0	75.0	75.0	✗
#34	rl_v06_run2	rl_policy custom single-shot	84	83.0	92.0	75.0	✗
#35	rl_v06_run2	rl_policy custom single-shot	84	80.0	84.0	85.0	✗
#36	rl_v06_run2	rl_policy custom single-shot	82	68.0	100.0	50.0	✗
#37	rl_v06_run2	rl_policy custom single-shot	82	86.0	67.0	75.0	✗
#38	rl_v06_run2	rl_policy custom single-shot	82	76.0	100.0	75.0	✗
#39	rl_v06_run2	rl_policy custom single-shot	81	71.0	82.0	75.0	✗
#40	rl_v06_run2	rl_policy custom single-shot	80	81.0	100.0	50.0	✗
#41	rl_v06_run2	rl_policy custom single-shot	79	81.0	75.0	50.0	✗
#42	rl_v06_run2	rl_policy custom single-shot	79	83.0	94.0	50.0	✗
#43	rl_v06_run2	rl_policy custom single-shot	79	79.0	100.0	50.0	✗
#44	rl_v06_run1	rl_policy custom single-shot	78	83.0	69.0	50.0	✗
#45	rl_v06_run1	rl_policy custom single-shot	78	78.0	64.0	75.0	✗
#46	rl_v06_run1	rl_policy custom single-shot	77	73.0	100.0	50.0	✗
#47	rl_v06_run1	rl_policy custom single-shot	75	69.0	100.0	50.0	✗
#48	rl_v06_run1	rl_policy custom single-shot	75	70.0	100.0	50.0	✗
#49	rl_v06_run1	rl_policy custom single-shot	72	70.0	100.0	50.0	✗
#50	rl_v06_run1	rl_policy custom single-shot	72	83.0	75.0	50.0	✗
#51	rl_v06_run2	rl_policy custom single-shot	72	86.0	50.0	50.0	✗
#52	rl_v06_run1	rl_policy custom single-shot	69	79.0	71.0	50.0	✗
#53	alex	openai/gpt-5.4 default reflexion	65	36.0	100.0	100.0	✗
#54	rl_v06_run2	rl_policy custom single-shot	65	74.0	32.0	75.0	✗
#55	rl_v06_run2	rl_policy custom single-shot	65	75.0	40.0	75.0	✗
#56	alex	google/gemini-3.1-pro-preview default reflexion	60	56.0	65.0	100.0	✗
#57	rl_v06_run2	rl_policy custom single-shot	60	81.0	43.0	50.0	✗
#58	rl_v06_run2	rl_policy custom single-shot	60	80.0	44.0	50.0	✗
#59	alex	anthropic/claude-sonnet-4.6 default reflexion	54	26.0	82.0	100.0	✗
#60	alex	x-ai/grok-4.20 default reflexion	50	1.0	96.0	100.0	✗

Per-scenario breakdown of the top run

Scenario	Health	Drop rate	Delivered	Pass
baseline	89.0	0.0%	352	✓
post-pto	94.0	0.0%	1280	✓
missed-window	79.0	0.0%	240	✓
slow-classify	89.0	0.0%	320	✓

How is this scored? →