chini-022-phishing-funnel
Phishing Defense Funnel
10,000 emails an hour. One of them is the spear-phish that gets the CFO's credentials. Find it.
Source: Enterprise email security, anti-phishing playbooks, the eternal war between SOC teams and attackers
Prompt
Design the inbound email defense pipeline for a 5,000-employee company. Functional: - Inbound email hits an MX gateway. Pre-filter: SPF/DKIM/DMARC, reputation, known-bad attachment hashes. - Surviving mail goes through a sandbox detonation layer for attachments and link unfurling. Verdict: clean, suspicious, malicious. - Suspicious mail routes to a quarantine with user-visible release option (with warning banner). Malicious mail dropped, alert raised. - VIP accounts (executives, finance) get an extra anomaly check (sender history deviation, unusual urgency phrasing). Non-functional: - A 4x volumetric campaign must NOT cause clean business mail to be delayed beyond 2 minutes end-to-end. - A targeted spear-phish to the CFO must trigger the VIP anomaly path even when SPF/DKIM pass and the sender domain looks legitimate. - If the sandbox is overloaded, attachments must be held in a soft-quarantine, NOT delivered without scanning to keep latency down. Return a Chinilla CanvasState. Components: MX gateway, pre-filter, sandbox, quarantine, VIP anomaly path, alerting. Behaviors: filter (reputation/auth checks), split (clean/quarantine/drop routing), ratelimit (sandbox capacity), circuitbreaker (sandbox overload soft-quarantine), batch (SOC alerting cadence).
Constraints
- Max components
- 13
- Required behaviors
- filter, split, circuitbreaker
- Monthly budget
- $30000
Stress scenarios
Normal mail flow
baselineSteady inbound volume, mixed clean and spam.
4x phishing campaign
adversarialVolumetric campaign. Block phish, deliver clean mail without delay.
CFO spear-phish
adversarialTargeted message passes auth checks. VIP anomaly path must catch it without flagging legit exec mail.
Sandbox queue full
latencyDetonation layer overloaded. Soft-quarantine must hold, not bypass scanning.
Pass criteria (overall)
- Min stability score
- 65
- Max drop rate
- 5.0%
- Min delivery rate
- 92.0%
- Max errors
- 5
Submit your run
Submissions go through the chini-bench CLI. It calls your model with your key, scores the result locally, and posts to the leaderboard. Nothing leaves your machine except the canvas it produces.
End-to-end:
pip install git+https://github.com/collapseindex/chini-bench-cli.git
export OPENROUTER_API_KEY=...
chini-bench run chini-022-phishing-funnel \
--provider openrouter --model google/gemini-2.0-flash-001 \
--as alice --x alice --linkedin alice-builds Or inspect the prompt first:
chini-bench prompt chini-022-phishing-funnel Providers: openai · anthropic · google · openrouter · ollama
Leaderboard
| Rank | Submitter | Model | Score | Stability | Delivery | Design | Pass | Links |
|---|---|---|---|---|---|---|---|---|
| #1 | alex default | A anthropic/claude-sonnet-4.6 | 92 | 59.0 | 100.0 | 100.0 | ✗ | X |
| #2 | alex default | X x-ai/grok-4.20 | 88 | 40.0 | 100.0 | 100.0 | ✗ | X |
| #3 | alex default | G google/gemini-3.1-pro-preview | 82 | 21.0 | 87.0 | 100.0 | ✗ | X |
| #4 | alex default | O openai/gpt-5.4 | 70 | 25.0 | 25.0 | 100.0 | ✗ | X |
Per-scenario breakdown of the top run
| Scenario | Health | Drop rate | Delivered | Pass |
|---|---|---|---|---|
| baseline | 68.0 | 0.9% | 513 | ✓ |
| campaign | 47.0 | 100.0% | 1978 | ✗ |
| spear-phish | 45.0 | 100.0% | 932 | ✗ |
| sandbox-overload | 74.0 | 0.3% | 483 | ✓ |