chini-020-disaster-shelter

Disaster Shelter Intake

500 evacuees in 12 hours, finite cots, dietary restrictions, medical needs, families that must not be split.

Source: FEMA shelter operations, Red Cross intake protocols, post-Hurricane lessons learned

Prompt

Design the intake and resource allocation flow for a 500-person disaster shelter activated for a hurricane evacuation.

Functional:
- Evacuee arrives at the door. Intake records: family unit, medical needs, dietary restrictions, mobility status, pets.
- Routed to one of 4 sleeping zones: family, single adult, medical (oxygen/insulin/dialysis), accessibility.
- Resources: cots, blankets, meal service (3x daily), medical station, charging stations, pet area.
- Family units cannot be split across zones. Medical-need evacuees get priority for medical zone cots.

Non-functional:
- A late surge (4x arrival in the last 4 hours before the storm hits) must NOT cause families to be split or medical evacuees to be turned away.
- If meal service runs short on a dietary restriction (kosher, halal, allergen-free), the system must source from neighboring shelter or document the gap, NOT serve a non-compliant meal.
- If the medical zone hits capacity, scheduler must convert overflow space rather than turn away an insulin-dependent evacuee.

Return a Chinilla CanvasState. Components: intake desk, classifier, zones, meal service, medical station, overflow logic. Behaviors: split (zone routing), filter (dietary check), ratelimit (zone capacity), circuitbreaker (overflow conversion), queue (cot wait), batch (meal cadence).

Constraints

Max components: 14
Required behaviors: split, filter, circuitbreaker
Monthly budget: $180000

Stress scenarios

Steady arrivals

baseline

Normal evacuee flow over 12 hours, mixed needs.

Pre-landfall surge

spike

Arrivals 4x in the final hours. Families must not be split, medical must not be turned away.

Medical zone full

outage

Medical zone at capacity. Overflow must be converted, not refused.

Halal meals short

outage

Dietary restriction supply low. Must source externally or document, not serve non-compliant.

Pass criteria (overall)

Min stability score: 60
Max drop rate: 8.0%
Min delivery rate: 88.0%
Max errors: 7

Submit your run

Submissions go through the chini-bench CLI. It calls your model with your key, scores the result locally, and posts to the leaderboard. Nothing leaves your machine except the canvas it produces.

End-to-end:

pip install git+https://github.com/collapseindex/chini-bench-cli.git
export OPENROUTER_API_KEY=...

chini-bench run chini-020-disaster-shelter \
  --provider openrouter --model google/gemini-2.0-flash-001 \
  --as alice

Or inspect the prompt first:

chini-bench prompt chini-020-disaster-shelter

Providers: openai · anthropic · google · openrouter · ollama

Leaderboard

Rank	Submitter	Model	Score	Stability	Delivery	Design	Pass
#1	rl_v06_run1	rl_policy custom single-shot	88	80.0	94.0	60.0	✗
#2	rl_v06_run2	rl_policy custom single-shot	88	79.0	96.0	85.0	✓
#3	rl_v06_run1	rl_policy custom single-shot	87	74.0	100.0	60.0	✓
#4	rl_v06_run1	rl_policy custom single-shot	87	80.0	89.0	60.0	✗
#5	rl_v06_run1	rl_policy custom single-shot	86	71.0	100.0	60.0	✓
#6	rl_v06_run2	rl_policy custom single-shot	86	80.0	100.0	85.0	✗
#7	rl_v06_run1	rl_policy custom single-shot	85	77.0	88.0	85.0	✗
#8	rl_v06_run1	rl_policy custom single-shot	84	73.0	98.0	85.0	✗
#9	rl_v06_run2	rl_policy custom single-shot	84	79.0	95.0	85.0	✗
#10	rl_v06_run2	rl_policy custom single-shot	84	80.0	93.0	60.0	✗
#11	alex	openai/gpt-5.4 default single-shot	83	78.0	81.0	100.0	✗
#12	alex	google/gemini-3.1-pro-preview default single-shot	83	71.0	91.0	100.0	✗
#13	rl_v06_run2	rl_policy custom single-shot	83	73.0	100.0	85.0	✗
#14	rl_v06_run2	rl_policy custom single-shot	83	80.0	89.0	60.0	✗
#15	rl_v06_run1	rl_policy custom single-shot	82	80.0	87.0	85.0	✗
#16	rl_v06_run2	rl_policy custom single-shot	82	74.0	97.0	60.0	✗
#17	rl_v06_run2	rl_policy custom single-shot	82	79.0	88.0	60.0	✗
#18	alex	anthropic/claude-sonnet-4.6 default single-shot	81	80.0	70.0	100.0	✗
#19	rl_v06_run2	rl_policy custom single-shot	81	67.0	96.0	60.0	✗
#20	rl_v06_run2	rl_policy custom single-shot	81	70.0	100.0	60.0	✗
#21	rl_v06_run2	rl_policy custom single-shot	81	77.0	87.0	60.0	✗
#22	rl_v06_run1	rl_policy custom single-shot	80	77.0	72.0	60.0	✗
#23	rl_v06_run2	rl_policy custom single-shot	80	80.0	78.0	60.0	✗
#24	rl_v06_run1	rl_policy custom single-shot	79	76.0	69.0	70.0	✗
#25	rl_v06_run2	rl_policy custom single-shot	79	66.0	86.0	85.0	✗
#26	rl_v06_run2	rl_policy custom single-shot	79	78.0	68.0	85.0	✗
#27	rl_v06_run2	rl_policy custom single-shot	79	78.0	66.0	70.0	✗
#28	rl_v06_run2	rl_policy custom single-shot	78	64.0	86.0	75.0	✗
#29	rl_v06_run2	rl_policy custom single-shot	77	79.0	72.0	60.0	✗
#30	rl_v06_run2	rl_policy custom single-shot	77	54.0	100.0	60.0	✗
#31	rl_v06_run1	rl_policy custom single-shot	76	70.0	87.0	85.0	✗
#32	rl_v06_run1	rl_policy custom single-shot	76	54.0	98.0	60.0	✗
#33	rl_v06_run1	rl_policy custom single-shot	76	73.0	94.0	50.0	✗
#34	rl_v06_run1	rl_policy custom single-shot	76	66.0	98.0	85.0	✗
#35	rl_v06_run1	rl_policy custom single-shot	76	77.0	71.0	60.0	✗
#36	rl_v06_run2	rl_policy custom single-shot	75	71.0	77.0	85.0	✗
#37	rl_v06_run1	rl_policy custom single-shot	74	78.0	51.0	60.0	✗
#38	rl_v06_run2	rl_policy custom single-shot	74	57.0	97.0	60.0	✗
#39	rl_v06_run2	rl_policy custom single-shot	73	69.0	75.0	85.0	✗
#40	rl_v06_run1	rl_policy custom single-shot	72	66.0	100.0	50.0	✗
#41	rl_v06_run1	rl_policy custom single-shot	71	68.0	100.0	50.0	✗
#42	rl_v06_run1	rl_policy custom single-shot	69	65.0	67.0	85.0	✗
#43	rl_v06_run2	rl_policy custom single-shot	69	65.0	67.0	60.0	✗
#44	rl_v06_run2	rl_policy custom single-shot	67	63.0	63.0	60.0	✗
#45	alex	google/gemini-3.1-pro-preview default reflexion	66	63.0	61.0	100.0	✗
#46	rl_v06_run2	rl_policy custom single-shot	65	57.0	66.0	60.0	✗
#47	rl_v06_run2	rl_policy custom single-shot	65	57.0	55.0	60.0	✗
#48	rl_v06_run1	rl_policy custom single-shot	64	71.0	67.0	50.0	✗
#49	rl_v06_run2	rl_policy custom single-shot	64	43.0	75.0	85.0	✗
#50	alex	x-ai/grok-4.20 default reflexion	63	58.0	46.0	100.0	✗
#51	rl_v06_run2	rl_policy custom single-shot	61	60.0	50.0	85.0	✗
#52	rl_v06_run2	rl_policy custom single-shot	61	47.0	58.0	60.0	✗
#53	rl_v06_run1	rl_policy custom single-shot	59	21.0	94.0	100.0	✗
#54	rl_v06_run2	rl_policy custom single-shot	59	72.0	23.0	60.0	✗
#55	alex	openai/gpt-5.4 default reflexion	57	26.0	100.0	100.0	✗
#56	rl_v06_run1	rl_policy custom single-shot	56	38.0	57.0	100.0	✗
#57	rl_v06_run2	rl_policy custom single-shot	56	62.0	15.0	60.0	✗
#58	rl_v06_run2	rl_policy custom single-shot	56	63.0	26.0	60.0	✗
#59	alex	x-ai/grok-4.20 default single-shot	55	73.0	0.0	75.0	✗
#60	rl_v06_run1	rl_policy custom single-shot	55	17.0	100.0	60.0	✗
#61	rl_v06_run2	rl_policy custom single-shot	54	44.0	53.0	85.0	✗
#62	rl_v06_run2	rl_policy custom single-shot	54	39.0	48.0	100.0	✗
#63	rl_v06_run2	rl_policy custom single-shot	50	49.0	19.0	100.0	✗
#64	rl_v06_run2	rl_policy custom single-shot	49	52.0	10.0	70.0	✗
#65	rl_v06_run2	rl_policy custom single-shot	47	15.0	64.0	60.0	✗
#66	rl_v06_run2	rl_policy custom single-shot	46	22.0	64.0	75.0	✗
#67	alex	anthropic/claude-sonnet-4.6 default reflexion	44	0.0	100.0	100.0	✗
#68	rl_v06_run1	rl_policy custom single-shot	44	0.0	91.0	85.0	✗
#69	rl_v06_run2	rl_policy custom single-shot	43	15.0	52.0	100.0	✗
#70	rl_v06_run1	rl_policy custom single-shot	42	21.0	82.0	60.0	✗

Per-scenario breakdown of the top run

Scenario	Health	Drop rate	Delivered	Pass
baseline	84.0	1.0%	519	✓
late-surge	83.0	1.3%	1894	✓
medical-overflow	76.0	0.0%	308	✓
meal-shortfall	76.0	0.0%	308	✗

How is this scored? →