Skip to content

Part 2: Where Do the Seeds Go?

A chinchilla’s entire life revolves around one question: where are the seeds? Stash them wrong, and you starve. Stash them right, and you survive winter. Data is your seeds.

The problem: A chinchilla has 100 seeds and 1 burrow. It works fine. Then it has 10,000 seeds. The burrow is full. It can’t fit more. And finding one specific seed in 10,000 takes forever.

The solution: Multiple stashes, organized by some system. Alphabetically. By type. By season. Anything, as long as there’s a system.

The principle: One database can hold some data. At scale, you need to split it.

Vertical scaling (bigger burrow): Get a bigger database server: more RAM, faster CPU, bigger disk. Simple but has a ceiling. You can’t make a burrow infinitely deep.

Horizontal scaling (more burrows): Add more database servers. Each holds a subset of the data. No ceiling in theory, but now you have coordination problems: which seed is in which burrow?

Sharding: The specific technique of splitting data across multiple databases by a key. “Seeds A-M go in burrow 1, N-Z go in burrow 2.” The key choice MATTERS:

  • Shard by user ID? Each user’s data is together (fast for user queries, hard for cross-user analytics)
  • Shard by date? Recent data is together (fast for recent queries, slow for historical)
  • Bad shard key = uneven distribution = one burrow overflowing while others are empty (hot spot)

Tradeoff: Horizontal scaling = more capacity, but cross-shard queries (finding seeds across multiple burrows) are slow and complex.

Instinct: SUSTAIN + ORGANIZE

The problem: 10,000 seeds in a burrow. You need seed #7,382. Do you dig through all 10,000? That’s insane.

The solution: Keep a separate, sorted map: “seed #7,382, three inches from the left wall, seven inches deep.” Now you go directly to it.

The principle: Searching unsorted data is slow (O(n): check every item). A sorted index turns it into a sorted lookup (O(log n): much faster).

Index: A data structure (usually a B-tree) that maps keys to row locations. Like a book’s index: instead of reading every page to find “photosynthesis,” you check the index: “page 247.”

B-tree index: The default. Sorted, balanced, works for range queries (“all seeds between #100 and #200”). Most databases use this.

Hash index: Exact-match only. Like a hash map: O(1) lookups for “give me seed #7,382” but can’t do ranges. Faster for specific lookups, useless for “give me all seeds after #5,000.”

The tradeoff: Every index speeds up READS but slows down WRITES (need to update the index on every insert/update/delete) and uses extra STORAGE (the index itself takes space). Too many indexes = writes are slow. Too few = reads are slow.

The instinct question: “What queries will this table get 90% of the time?” Index THOSE columns.

Instinct: SUSTAIN

The problem: A landslide buries the chinchilla’s seed stash. Every seed is gone. Winter comes. Chinchilla dies.

The solution: Store copies in 3 different spots. If one gets buried, 2 remain.

The principle: If data exists in only one place, it’s one failure away from being gone forever. Copies protect against loss.

Replication: Copying data across multiple machines.

Leader-follower (master-slave): One node is the boss. All writes go to the leader. The leader copies changes to followers. Reads can go to any copy.

  • Pro: Simple. Clear authority. No write conflicts.
  • Con: Leader is a single point of failure. If it dies, you need to elect a new one (failover).

Multi-leader: Multiple nodes can accept writes.

  • Pro: No single point of failure for writes. Can write locally in different regions.
  • Con: CONFLICTS. Two leaders both modify the same seed: who wins? You need conflict resolution rules.

Leaderless (Dynamo-style): Any node can accept reads or writes. Quorum decides truth: “if 2 out of 3 copies agree, that’s the answer.”

  • Pro: Maximum availability. No single point of failure at all.
  • Con: Eventual consistency: copies might temporarily disagree.

Tradeoff: More copies = more durability, but more storage cost and more complexity keeping copies in sync.

Instinct: REMEMBER + SURVIVE

The problem: Seeds are round. Strips of bark are long and thin. Dried plants are oddly shaped. One storage shape doesn’t fit everything.

The solution: Different shapes of stashes for different shapes of things.

The principle: Different data has different access patterns. The structure that’s perfect for one use case is terrible for another.

Relational database (SQL): Structured, tabular, relationships between tables. Like a spreadsheet with rules. Great for: transactions, relationships, complex queries. Examples: PostgreSQL, MySQL.

Document database (NoSQL): Each item is a self-contained document (JSON blob). No rigid schema. Great for: varied data shapes, rapid development, applications where each record is independent. Examples: MongoDB, CouchDB.

Key-value store: Just a lookup table: key in, value out. Simplest possible. Great for: caching, session storage, configuration. Examples: Redis, DynamoDB.

Column-family store: Groups data by column rather than row. Great for: analytics, time-series, write-heavy workloads. Examples: Cassandra, HBase.

Graph database: Optimized for relationships between things. Great for: social networks, recommendation engines, fraud detection. Examples: Neo4j, Amazon Neptune.

Time-series database: Optimized for timestamped data. Great for: metrics, IoT sensors, monitoring. Examples: InfluxDB, TimescaleDB.

How to choose:

  • “I need transactions and relationships” -> SQL
  • “My data shapes vary and I iterate fast” -> Document
  • “I need blazing fast lookups by key” -> Key-value
  • “I have massive write throughput and analytics queries” -> Column-family
  • “Everything is about connections between things” -> Graph
  • “Everything is timestamped observations” -> Time-series

Instinct: ORGANIZE

The problem: A chinchilla is moving 5 seeds from stash A to stash B. After moving 3, a fox charges. The chinchilla bolts. Now there are 3 seeds in stash B and 2 still in stash A. But the chinchilla’s count says “0 in A, 5 in B.” The books don’t balance.

The solution: All 5 move or none do. The move is treated as a single, indivisible action.

The principle: Some operations must be all-or-nothing to maintain correctness.

Atomic (all or nothing): Move all 5 seeds or move none. No partial moves. If anything goes wrong mid-operation, roll it all back.

Consistent (the books balance): Before the move: 5 total seeds. After the move: 5 total seeds. The system’s rules (invariants) are never violated, even temporarily.

Isolated (privacy during the move): While the chinchilla is mid-move, no other chinchilla sees a half-finished state. They either see “5 in A, 0 in B” (before) or “0 in A, 5 in B” (after). Never “2 in A, 3 in B.”

Durable (permanent when done): Once the move is confirmed as complete, it STAYS complete. Even if the mountain shakes. Even if the power goes out. It’s written to disk, not just in memory.

The tradeoff: ACID is expensive. Guaranteeing all four properties requires locks, logs, and coordination: which slows things down. Some systems relax one or more properties for performance:

  • NoSQL databases often trade Consistency for Availability (see: CAP theorem)
  • Eventual consistency = “it’ll be correct… eventually”

Instinct: REMEMBER + AGREE

The problem: The chinchilla wants to stash a seed but gets interrupted mid-dig. When it comes back, it doesn’t know: did the stash complete? Was the burrow half-dug? Is the seed in there or not?

The solution: Before doing anything, write down “I’m ABOUT to stash a seed in burrow 7.” Do the stash. Then write “DONE: seed is in burrow 7.” If you get interrupted, check the diary on restart: if it says “about to” but never “done,” redo it.

The principle: Write your intention before doing the action. If you crash, replay the intention log.

Write-ahead log (WAL): Before modifying data, write the intended change to a log file. If the system crashes, replay the log on recovery to complete or undo partial operations.

This is how databases guarantee durability. PostgreSQL’s WAL. MySQL’s redo log. Kafka’s commit log. Redis’s AOF. Same idea everywhere.

Why this is powerful: The log is append-only (writes go to the end, never modify existing entries). Append-only writes are:

  • Fast (no seeking around the disk)
  • Safe (you never corrupt existing data)
  • Replayable (recovery is just “read the log, redo everything”)

Instinct: REMEMBER