
The Approach
In distributed systems, traditional debuggers don't work. Only telemetry (logs, metrics, traces) provides clues. The key insight: treat debugging like science — form hypotheses and try to falsify them.
What Makes a Good Hypothesis?
A hypothesis must be falsifiable — you must be able to prove it wrong.
Good: "CPU is high because of traffic" → Test: "If I increase traffic and CPU doesn't rise, hypothesis is wrong." Bad: "CPU is high because it hasn't run long enough" → No test can ever disprove this.
Real-World Example
System: User → HTTP request → App → Queue → Worker → Database. The 99th percentile processing time is 10 seconds — too slow.
Hypothesis 1: "Slowness is due to queuing technology." Test: Replace queue with in-memory function calls. Result: Still slow. Falsified.
Hypothesis 2: "Slowness is due to database performance." Test: Replace database with in-memory cache. Result: 99th percentile dropped from 10s to 5s. Not fully solved, but database bottleneck confirmed.
The Power
This method provides a systematic way to narrow down problems, gain insights, learn facts, and ultimately trace hard-to-find root causes.