Part 4: When Things Go Wrong
On the mountain, things WILL go wrong. Eagles swoop. Landslides happen. Burrows collapse. The question isn’t “will something fail?” It’s “when it fails, does the colony survive?“
4.1 The Hurt Paw
Section titled “4.1 The Hurt Paw”The problem: A chinchilla hurts its front paw. It can’t climb as well. Option A: keep climbing and risk falling to death. Option B: stay on the ground, move slower, but survive.
The solution: Degrade gracefully. Do LESS, but do it SAFELY. Don’t pretend you’re fine when you’re not.
The principle: When part of a system fails, the whole system shouldn’t die. It should reduce functionality to what it can still do reliably.
The real names:
Section titled “The real names:”Graceful degradation: When a service is overloaded or partially broken, serve reduced functionality instead of erroring completely.
- Video streaming: Lower quality instead of buffering forever
- E-commerce: Show products but disable recommendations
- Search: Return cached results instead of “service unavailable”
Circuit breaker: If a downstream service keeps failing, STOP calling it for a while. Let it recover. Try again later.
- CLOSED (healthy): Requests flow normally
- OPEN (broken): Requests are blocked, return fallback/error immediately
- HALF-OPEN (testing): Allow a few requests through to see if it recovered
This prevents cascading failure: where one broken service causes every service that calls it to also break, which causes every service that calls THEM to break, domino-style.
Like a chinchilla that stops going to the bird feeder after getting shocked 3 times. Tries again after waiting. If shocked again, backs off longer.
Bulkhead: Isolate failures to one section. A ship has bulkheads (walls between compartments). If one compartment floods, the walls contain it: the whole ship doesn’t sink.
- In systems: Use separate thread pools, connection pools, or even separate services for different features. If the payment service has a problem, it shouldn’t affect search.
Instinct: SURVIVE
4.2 The Backup Den
Section titled “4.2 The Backup Den”The problem: A storm destroys the chinchilla’s den. It has nowhere to sleep. Winter is coming.
The solution: Always have a backup den, already built, ready to move into.
The principle: Redundancy. Have more than one of everything critical. When the primary fails, switch to the backup.
The real names:
Section titled “The real names:”Failover: When the primary system fails, a backup takes over.
- Active-passive: Backup sits idle, ready to activate. Fast switchover, but wasted resources while idle.
- Active-active: Both systems run simultaneously, sharing load. If one dies, the other absorbs the traffic. More efficient, but more complex (both must stay in sync).
Hot, warm, cold standby:
- Hot: Backup is running and in sync. Failover in seconds.
- Warm: Backup is running but not fully in sync. Failover in minutes.
- Cold: Backup hardware exists but isn’t running. Failover in hours.
RTO and RPO:
- Recovery Time Objective (RTO): “How fast must we recover?”: 5 seconds? 1 hour? 1 day?
- Recovery Point Objective (RPO): “How much data can we afford to lose?”: zero? 5 minutes? 1 hour?
A bank needs RTO of seconds and RPO of zero. A personal blog could tolerate RTO of hours and RPO of a day.
Active-passive: backup chinchilla sleeps until needed. Active-active: both chinchillas work, if one naps the other covers.
Tradeoff: More redundancy = more cost. Hot standby for everything is expensive. Choose based on how critical each component is.
Instinct: SURVIVE + REMEMBER
4.3 The Canary Seed
Section titled “4.3 The Canary Seed”The problem: A chinchilla finds a new food source. Is it safe? It doesn’t know. Eating all of it could be fatal. Eating none means missing out.
The solution: Eat ONE piece first. Wait. Feel fine? Eat more. Feel sick? Stop immediately. You only lost one seed’s worth of risk.
The principle: Test changes on a small subset before rolling them out to everyone. Limit the blast radius of a bad change.
The real names:
Section titled “The real names:”Canary deployment: Route 1-5% of traffic to the new version. Monitor error rates, latency, and user behavior. If everything looks good, gradually increase to 100%. If anything looks bad, roll back: only 5% of users were affected.
Blue-green deployment: Run two identical environments. “Blue” is the current live version. “Green” is the new version. Switch traffic from blue to green instantly. If green breaks, switch back to blue in seconds.
Feature flags: Wrap new features in if-statements. Turn them on for specific users (employees first, then beta users, then 10%, then everyone). Turn them off instantly if problems emerge.
Chaos engineering: Intentionally break things in production to test your resilience. Netflix’s Chaos Monkey randomly kills servers. If your system handles it, it’s truly fault-tolerant. If it doesn’t, you found a bug before your users did.
Canary = taste one seed first. Blue-Green = build a whole new stash, switch instantly if it’s good.
Tradeoff: All of these add deployment complexity. But they prevent the alternative: deploying a bug to 100% of users simultaneously.
Instinct: SURVIVE + PROTECT
4.4 The Squeak That Never Came
Section titled “4.4 The Squeak That Never Came”The problem: Two chinchillas agree: squeak every 5 minutes as a signal that everything’s OK. One stops squeaking. Is it dead? Asleep? Or did the wind just drown out the squeak?
The principle: In a distributed system, you can’t tell the difference between “the other node is dead” and “the network between us is broken.” Both look the same: silence.
The real names:
Section titled “The real names:”Heartbeat: Periodic “I’m alive” messages. If you miss N consecutive heartbeats, assume the node is dead and take action (failover, trigger alert).
Timeout: How long do you wait before declaring failure? Too short = false alarms (you thought it was dead, but the network was just slow). Too long = slow detection (it’s been dead for 5 minutes and you didn’t notice).
Phi Accrual Failure Detector: Instead of a binary alive/dead decision, compute a PROBABILITY that the node is dead, based on the history of heartbeat intervals. If heartbeats normally come every 100ms and you haven’t seen one in 500ms, that’s very suspicious. If they normally vary between 50-200ms, 500ms is less alarming.
Split brain: The scariest failure. Network splits in half. Both halves think the OTHER half is dead. Both elect a new leader. Now you have TWO leaders accepting writes. When the network heals, the data is diverged. This is why consensus algorithms exist: they prevent split-brain by requiring a majority (quorum).
Instinct: SURVIVE + AGREE
4.5 Don’t Overreact
Section titled “4.5 Don’t Overreact”The problem: A leaf falls and hits the chinchilla. The chinchilla FREAKS OUT, bolts to another ledge, drops its seed, alerts the whole colony. It was just a leaf.
The solution: Don’t react to every tiny signal. Wait for confirmation. React proportionally. Scale the response to the actual threat.
The principle: Systems that react too quickly to noise cause more damage than the noise itself. Stability requires deliberate slowness.
The real names:
Section titled “The real names:”Debouncing: Don’t react until the signal is stable. “Wait 300ms after the user stops typing before firing the search query”: catches the complete word, ignores partial keystrokes.
Hysteresis: Create a dead zone between thresholds. Like a thermostat set to 68-72 degrees: it doesn’t constantly cycle the furnace at 70. You have to move SIGNIFICANTLY past the boundary before the system reacts.
Exponential backoff: When retrying a failed operation, don’t retry immediately. Wait 1 second. Then 2. Then 4. Then 8. If the service is overloaded, hammering it with retries makes it WORSE. Exponential backoff gives it time to recover.
Jitter: Add randomness to the backoff. If 1,000 clients all back off to “retry at exactly 8 seconds,” you get a thundering herd at T+8. Add random jitter so they spread between 6-10 seconds.
Rate limiting: Cap the number of requests a client can make per time window. “Max 100 requests per minute.” Protects the system from abuse, buggy clients, and DDoS attacks.
Don’t hammer. Wait longer each time. Add randomness so 1,000 chinchillas don’t all retry simultaneously.
Tradeoff: Every smoothing mechanism adds latency. You react slower to real threats too. The art is tuning: fast enough to catch real problems, slow enough to ignore noise.
Instinct: SURVIVE + SUSTAIN