chini-012-energy-drink-habit

Quitting the Energy Drink Habit

A craving is a packet. Willpower is backpressure. Design the system that keeps you off the 4pm Red Bull.

Source: Behavioral psychology, habit-loop literature, and a personal problem the author refuses to discuss further

Prompt

Design a personal system to taper someone off a 3-can-per-day energy drink habit over 30 days without crashing their workday.

Functional:
- Cravings arrive throughout the day (model them as packets). Each craving must be routed to a healthy substitute (water, walk, snack, deep breath) OR, in capped quantity, an actual drink.
- A cap on real-drink consumption per day. The cap shrinks weekly.
- Triggers are tracked: 9am wake, 2pm crash, 8pm gym. Each trigger emits a craving packet.

Non-functional:
- A bad day (work stress 4x normal craving rate) must not blow the daily cap. The system rate-limits and substitutes.
- If the planned substitute is unavailable (out of LaCroix, gym closed), the system must fail gracefully to a different substitute, not directly to a drink.
- The system must not be so restrictive that the user just abandons it. Some real drinks are allowed; the goal is taper, not cold-turkey.

Return a Chinilla CanvasState. Components are routines, substitutes, the user, and the cap. Behaviors are the same primitives: queue (urge backlog), ratelimit (daily cap), circuitbreaker (substitute failover), retry (try substitute again before caving), storage (snack stash), split (route by craving intensity).

Constraints

Max components: 10
Required behaviors: ratelimit, circuitbreaker, split
Monthly budget: $200

Stress scenarios

Normal day

baseline

Three trigger windows, baseline craving rate. System should keep within daily cap.

Bad work day

spike

Cravings 4x baseline. Cap must hold, substitutes must absorb the rest.

Primary substitute unavailable

outage

LaCroix stash is empty. System must reroute to walk/snack/breath, not collapse to a drink.

Walk takes longer than planned

latency

Substitute resolution time spikes (long meeting, no break). System must hold without dumping the queue.

Pass criteria (overall)

Min stability score: 60
Max drop rate: 15.0%
Min delivery rate: 80.0%
Max errors: 8

Submit your run

Submissions go through the chini-bench CLI. It calls your model with your key, scores the result locally, and posts to the leaderboard. Nothing leaves your machine except the canvas it produces.

End-to-end:

pip install git+https://github.com/collapseindex/chini-bench-cli.git
export OPENROUTER_API_KEY=...

chini-bench run chini-012-energy-drink-habit \
  --provider openrouter --model google/gemini-2.0-flash-001 \
  --as alice

Or inspect the prompt first:

chini-bench prompt chini-012-energy-drink-habit

Providers: openai · anthropic · google · openrouter · ollama

Leaderboard

Rank	Submitter	Model	Score	Stability	Delivery	Design	Pass
#1	alex	openai/gpt-5.4 default single-shot	92	83.0	100.0	100.0	✓
#2	rl_v06_run2	rl_policy custom single-shot	89	81.0	96.0	60.0	✓
#3	rl_v06_run2	rl_policy custom single-shot	88	80.0	92.0	100.0	✓
#4	rl_v06_run2	rl_policy custom single-shot	87	73.0	100.0	75.0	✓
#5	rl_v06_run2	rl_policy custom single-shot	87	73.0	100.0	60.0	✓
#6	rl_v06_run2	rl_policy custom single-shot	87	80.0	89.0	60.0	✓
#7	rl_v06_run2	rl_policy custom single-shot	86	83.0	100.0	85.0	✗
#8	rl_v06_run2	rl_policy custom single-shot	85	72.0	100.0	75.0	✗
#9	rl_v06_run1	rl_policy custom single-shot	84	80.0	89.0	60.0	✗
#10	rl_v06_run2	rl_policy custom single-shot	84	68.0	100.0	85.0	✓
#11	rl_v06_run2	rl_policy custom single-shot	84	67.0	100.0	60.0	✓
#12	rl_v06_run2	rl_policy custom single-shot	84	79.0	83.0	60.0	✗
#13	alex	google/gemini-3.1-pro-preview default single-shot	83	66.0	100.0	100.0	✓
#14	rl_v06_run2	rl_policy custom single-shot	83	74.0	100.0	60.0	✗
#15	rl_v06_run2	rl_policy custom single-shot	83	71.0	90.0	75.0	✓
#16	rl_v06_run2	rl_policy custom single-shot	82	66.0	96.0	60.0	✓
#17	rl_v06_run2	rl_policy custom single-shot	82	69.0	93.0	60.0	✓
#18	rl_v06_run2	rl_policy custom single-shot	81	76.0	78.0	60.0	✗
#19	rl_v06_run2	rl_policy custom single-shot	81	75.0	77.0	60.0	✗
#20	rl_v06_run2	rl_policy custom single-shot	81	69.0	100.0	100.0	✗
#21	rl_v06_run2	rl_policy custom single-shot	81	73.0	83.0	100.0	✓
#22	rl_v06_run2	rl_policy custom single-shot	81	69.0	87.0	75.0	✓
#23	rl_v06_run2	rl_policy custom single-shot	80	71.0	82.0	75.0	✓
#24	rl_v06_run1	rl_policy custom single-shot	79	77.0	69.0	60.0	✗
#25	rl_v06_run1	rl_policy custom single-shot	78	83.0	92.0	50.0	✗
#26	rl_v06_run2	rl_policy custom single-shot	78	79.0	100.0	50.0	✗
#27	rl_v06_run1	rl_policy custom single-shot	77	53.0	100.0	75.0	✗
#28	rl_v06_run2	rl_policy custom single-shot	77	58.0	92.0	60.0	✗
#29	rl_v06_run2	rl_policy custom single-shot	77	84.0	62.0	60.0	✗
#30	rl_v06_run2	rl_policy custom single-shot	76	82.0	88.0	50.0	✗
#31	rl_v06_run2	rl_policy custom single-shot	75	60.0	96.0	60.0	✗
#32	rl_v06_run2	rl_policy custom single-shot	72	61.0	73.0	60.0	✗
#33	rl_v06_run1	rl_policy custom single-shot	71	65.0	61.0	60.0	✗
#34	rl_v06_run2	rl_policy custom single-shot	71	62.0	65.0	60.0	✗
#35	rl_v06_run1	rl_policy custom single-shot	70	68.0	100.0	50.0	✗
#36	rl_v06_run2	rl_policy custom single-shot	70	60.0	68.0	75.0	✗
#37	alex	google/gemini-3.1-pro-preview default reflexion	67	61.0	66.0	100.0	✗
#38	rl_v06_run1	rl_policy custom single-shot	66	43.0	100.0	75.0	✗
#39	alex	x-ai/grok-4.20 default reflexion	65	53.0	73.0	100.0	✗
#40	rl_v06_run1	rl_policy custom single-shot	65	58.0	67.0	60.0	✗
#41	alex	openai/gpt-5.4 default reflexion	63	38.0	100.0	100.0	✗
#42	rl_v06_run2	rl_policy custom single-shot	63	61.0	55.0	60.0	✗
#43	rl_v06_run1	rl_policy custom single-shot	59	43.0	57.0	75.0	✗
#44	alex	x-ai/grok-4.20 default single-shot	58	19.0	94.0	100.0	✗
#45	rl_v06_run1	rl_policy custom single-shot	58	46.0	50.0	75.0	✗
#46	rl_v06_run2	rl_policy custom single-shot	57	47.0	45.0	60.0	✗
#47	rl_v06_run1	rl_policy custom single-shot	55	32.0	78.0	60.0	✗
#48	rl_v06_run1	rl_policy custom single-shot	53	43.0	51.0	100.0	✗
#49	rl_v06_run2	rl_policy custom single-shot	51	47.0	38.0	100.0	✗
#50	rl_v06_run2	rl_policy custom single-shot	50	38.0	35.0	60.0	✗
#51	rl_v06_run2	rl_policy custom single-shot	49	1.0	96.0	75.0	✗
#52	rl_v06_run1	rl_policy custom single-shot	40	0.0	100.0	60.0	✗
#53	rl_v06_run2	rl_policy custom single-shot	39	27.0	18.0	85.0	✗
#54	alex	anthropic/claude-sonnet-4.6 default single-shot	29	18.0	0.0	75.0	✗
#55	alex	anthropic/claude-sonnet-4.6 default reflexion	14	0.0	0.0	75.0	✗

Per-scenario breakdown of the top run

Scenario	Health	Drop rate	Delivered	Pass
baseline	83.0	0.8%	372	✓
stress-day	85.0	1.6%	1274	✓
substitute-out	82.0	0.0%	192	✓
delayed-relief	83.0	0.8%	372	✓

How is this scored? →