Launch special: 50% off Pro monthly with code LAUNCH50 Upgrade now
Skip to main content
← All problems
chini-016-inbox-zero

Inbox Zero Maintenance

300 emails a day, three contexts, two devices, one human attention budget.

Source: Productivity literature, GTD methodology, every knowledge worker drowning in email

Prompt

Design a personal email-processing system to keep a 300-email-per-day inbox at zero by EOD without destroying focus.

Functional:
- Email arrives all day across three contexts: work, personal, newsletters/promo.
- Each email gets one of: archive (no action), reply now (<2 min), defer (snooze with action), delegate (forward + tag), file (reference).
- Two processing windows per day (morning + late afternoon). Outside those windows, email is queued, not read.
- Newsletters auto-route to a read-later bucket, never trigger a notification.

Non-functional:
- A bad day (4x normal volume, e.g. after PTO) must not blow the daily processing budget. System batches and defers aggressively.
- If a key person emails (boss, partner, named contacts), notification breaks the window-only rule but rate-limits to one ping per hour.
- If the user misses an evening window, morning window must absorb the backlog without consuming the entire morning.

Return a Chinilla CanvasState. Components: inbox, classifier, windows, action buckets, notifications. Behaviors: split (context routing), queue (window batching), ratelimit (notification cap), batch (bulk processing), filter (newsletter shunt).

Constraints

Max components
12
Required behaviors
split, queue, ratelimit
Monthly budget
$50

Stress scenarios

Normal day

baseline

300 emails across two windows, mixed contexts.

Back from vacation

spike

4x backlog. Morning window must absorb without taking the whole morning.

Missed evening window

outage

User skipped late-afternoon processing. Backlog hits morning queue.

Hard-to-classify thread

latency

Long ambiguous threads require human read time. System must not block fresh email.

Pass criteria (overall)

Min stability score
60
Max drop rate
10.0%
Min delivery rate
85.0%
Max errors
6

Submit your run

Submissions go through the chini-bench CLI. It calls your model with your key, scores the result locally, and posts to the leaderboard. Nothing leaves your machine except the canvas it produces.

End-to-end:
pip install git+https://github.com/collapseindex/chini-bench-cli.git
export OPENROUTER_API_KEY=...

chini-bench run chini-016-inbox-zero \
  --provider openrouter --model google/gemini-2.0-flash-001 \
  --as alice
Or inspect the prompt first:
chini-bench prompt chini-016-inbox-zero
Providers: openai · anthropic · google · openrouter · ollama

Leaderboard

Rank Submitter Model Score Stability Delivery Design Pass
#1 alex
anthropic/claude-sonnet-4.6
default single-shot
95 88.0 100.0 75.0
#2 rl_v06_run2
rl_policy
custom single-shot
94 86.0 100.0 75.0
#3 alex
x-ai/grok-4.20
default single-shot
93 91.0 91.0 75.0
#4 rl_v06_run1
rl_policy
custom single-shot
92 83.0 100.0 75.0
#5 rl_v06_run2
rl_policy
custom single-shot
92 83.0 100.0 75.0
#6 rl_v06_run2
rl_policy
custom single-shot
92 83.0 100.0 75.0
#7 rl_v06_run2
rl_policy
custom single-shot
92 83.0 100.0 75.0
#8 rl_v06_run2
rl_policy
custom single-shot
91 83.0 100.0 75.0
#9 rl_v06_run2
rl_policy
custom single-shot
91 81.0 100.0 75.0
#10 rl_v06_run1
rl_policy
custom single-shot
90 81.0 100.0 75.0
#11 rl_v06_run1
rl_policy
custom single-shot
90 80.0 98.0 75.0
#12 rl_v06_run2
rl_policy
custom single-shot
90 78.0 100.0 75.0
#13 rl_v06_run2
rl_policy
custom single-shot
90 83.0 94.0 75.0
#14 alex
openai/gpt-5.4
default single-shot
89 78.0 96.0 75.0
#15 alex
google/gemini-3.1-pro-preview
default single-shot
89 75.0 100.0 75.0
#16 rl_v06_run1
rl_policy
custom single-shot
89 76.0 100.0 75.0
#17 rl_v06_run2
rl_policy
custom single-shot
89 83.0 100.0 75.0
#18 rl_v06_run2
rl_policy
custom single-shot
89 82.0 92.0 75.0
#19 rl_v06_run2
rl_policy
custom single-shot
89 83.0 100.0 50.0
#20 rl_v06_run1
rl_policy
custom single-shot
88 82.0 100.0 75.0
#21 rl_v06_run1
rl_policy
custom single-shot
88 83.0 88.0 75.0
#22 rl_v06_run2
rl_policy
custom single-shot
88 83.0 88.0 75.0
#23 rl_v06_run1
rl_policy
custom single-shot
87 72.0 100.0 75.0
#24 rl_v06_run2
rl_policy
custom single-shot
87 72.0 100.0 75.0
#25 rl_v06_run2
rl_policy
custom single-shot
86 75.0 100.0 75.0
#26 rl_v06_run2
rl_policy
custom single-shot
86 82.0 83.0 75.0
#27 rl_v06_run2
rl_policy
custom single-shot
86 78.0 100.0 50.0
#28 rl_v06_run2
rl_policy
custom single-shot
86 83.0 83.0 75.0
#29 rl_v06_run1
rl_policy
custom single-shot
85 75.0 100.0 50.0
#30 rl_v06_run1
rl_policy
custom single-shot
84 83.0 92.0 50.0
#31 rl_v06_run2
rl_policy
custom single-shot
84 84.0 75.0 75.0
#32 rl_v06_run2
rl_policy
custom single-shot
84 83.0 92.0 75.0
#33 rl_v06_run2
rl_policy
custom single-shot
84 83.0 75.0 75.0
#34 rl_v06_run2
rl_policy
custom single-shot
84 83.0 92.0 75.0
#35 rl_v06_run2
rl_policy
custom single-shot
84 80.0 84.0 85.0
#36 rl_v06_run2
rl_policy
custom single-shot
82 68.0 100.0 50.0
#37 rl_v06_run2
rl_policy
custom single-shot
82 86.0 67.0 75.0
#38 rl_v06_run2
rl_policy
custom single-shot
82 76.0 100.0 75.0
#39 rl_v06_run2
rl_policy
custom single-shot
81 71.0 82.0 75.0
#40 rl_v06_run2
rl_policy
custom single-shot
80 81.0 100.0 50.0
#41 rl_v06_run2
rl_policy
custom single-shot
79 81.0 75.0 50.0
#42 rl_v06_run2
rl_policy
custom single-shot
79 83.0 94.0 50.0
#43 rl_v06_run2
rl_policy
custom single-shot
79 79.0 100.0 50.0
#44 rl_v06_run1
rl_policy
custom single-shot
78 83.0 69.0 50.0
#45 rl_v06_run1
rl_policy
custom single-shot
78 78.0 64.0 75.0
#46 rl_v06_run1
rl_policy
custom single-shot
77 73.0 100.0 50.0
#47 rl_v06_run1
rl_policy
custom single-shot
75 69.0 100.0 50.0
#48 rl_v06_run1
rl_policy
custom single-shot
75 70.0 100.0 50.0
#49 rl_v06_run1
rl_policy
custom single-shot
72 70.0 100.0 50.0
#50 rl_v06_run1
rl_policy
custom single-shot
72 83.0 75.0 50.0
#51 rl_v06_run2
rl_policy
custom single-shot
72 86.0 50.0 50.0
#52 rl_v06_run1
rl_policy
custom single-shot
69 79.0 71.0 50.0
#53 alex
openai/gpt-5.4
default reflexion
65 36.0 100.0 100.0
#54 rl_v06_run2
rl_policy
custom single-shot
65 74.0 32.0 75.0
#55 rl_v06_run2
rl_policy
custom single-shot
65 75.0 40.0 75.0
#56 alex
google/gemini-3.1-pro-preview
default reflexion
60 56.0 65.0 100.0
#57 rl_v06_run2
rl_policy
custom single-shot
60 81.0 43.0 50.0
#58 rl_v06_run2
rl_policy
custom single-shot
60 80.0 44.0 50.0
#59 alex
anthropic/claude-sonnet-4.6
default reflexion
54 26.0 82.0 100.0
#60 alex
x-ai/grok-4.20
default reflexion
50 1.0 96.0 100.0
Per-scenario breakdown of the top run
Scenario Health Drop rate Delivered Pass
baseline 89.0 0.0% 352
post-pto 94.0 0.0% 1280
missed-window 79.0 0.0% 240
slow-classify 89.0 0.0% 320