This article introduces the author's experience and insights on building stable, practical systems. It emphasizes that predictable, 'boring' architectures make for good design rather than complex ones, and covers core principles, tools, and real-world practices that are useful when actually building systems. Each key concept is explained in an accessible way with anecdotes, concrete examples, and vivid advice.
1. What Is System Design?
System design is distinct from software design (designing program code structure) -- it's the domain concerned with how to assemble multiple services to complete an overall system. The services here refer to various components such as app servers, databases, caches, queues, event buses, proxies, and more.
"If software design is about how you assemble individual lines of code, system design is about how you assemble services."
The author aims to compile principles, drawn from experience, for building systems that are easy to manage and don't easily run into errors.
2. What Makes Good System Design?
Good design is surprisingly simple and unnoticeable. If a system goes a long time without problems and developers don't need to worry about certain parts because they just work, that's a well-designed system. In fact, the more complex and impressive a structure appears, the more likely it is hiding problems or is over-engineered.
"In practice, good design has a faint presence. Whenever I see an impressive-looking system, my first instinct is suspicion."
Experienced developers find stability in simplicity rather than clever-looking tricks or patterns. Complexity isn't necessarily bad, but you must always start with a simple system and incrementally add complexity.
3. The Difficulty and Principles of Handling State
The most challenging part of system design is handling 'state.' Whenever an app stores information in any form, countless considerations arise about where and how to store and manipulate that state.
Conversely, if you don't store information (i.e., stateless), the design becomes much simpler. For example, a stateless service like GitHub's PDF-to-HTML conversion API can recover from errors simply by restarting the container.
Stateful components should be minimized. If possible, have only one service communicate directly with the database, and connect everything else through API requests or events.
"Don't have five services writing directly to one table. Have one service do the writing, and have the rest send requests to that service."
4. Databases: Core Design Principles
The core of managing system state is, of course, the database. The author offers several practical tips based on real experience with MySQL and PostgreSQL.
(1) Schema and Indexes
- Table design should consider flexibility, but shouldn't be too loose either.
- Maintain a human-readable schema.
- Create indexes matching frequently used query conditions, and prioritize fields with the highest cardinality.
- Too many indexes actually degrade performance.
"If you have more than a few tables, make sure you add indexes."
(2) Bottlenecks and Scalability
- Database access is often the bottleneck for high-traffic systems.
- Use
JOINto combine data across tables. Splitting queries isn't always better, but is occasionally necessary. - Most systems operate with a single write node + read replicas. Read from replicas, and only use the master when real-time synchronization is essential.
- Throttling and rate limiting for write transactions and bulk query requests should also be considered.
5. Slow Tasks, Fast Tasks: Background Jobs
System interfaces require fast responses, but some tasks inevitably take a long time. These should be handled by doing the bare minimum immediately and offloading the rest to background jobs.
"For PDF-to-HTML conversion, show the result for just the first page immediately and process the rest in the background."
Background jobs typically consist of queues (e.g., Redis) and job execution services. For scheduled tasks (e.g., run one month later) that can't sit in a queue for long, you can record them in a DB table and process them periodically.
6. Caching: Only When Absolutely Necessary!
Caching should only be introduced when common data access is slow. Junior engineers tend to want to apply caching everywhere, but in reality, DB indexes and design optimizations are often sufficient.
"Bad caching is the starting point for 'weird state' in a system."
If large-scale data caching is necessary, various strategies can be applied, such as scheduled generation + document storage (e.g., saving results to S3).
7. Event Processing and Appropriate Usage
Most tech companies have an event hub (e.g., Kafka). The event hub serves to notify multiple services that "something happened."
- Remember that rather than over-using events, API integration is often more appropriate.
- Events are suitable when the success of downstream processing isn't immediately critical, or when volume is high but real-time requirements are low.
8. Data Flow: Pull vs Push
When delivering data to many destinations, choose between Pull (request-based) and Push (server proactively delivers) depending on the situation.
- Push is easier to manage when providing changed data to a small number of services.
- For many clients (e.g., Gmail), you scale using read-replica servers, event queues, etc. as needed.
9. Focus on the 'Hot Path'
Even in complex systems, the 'hot path' (critical path) deserves special attention in design. A single mistake here can bring down the entire service.
"A settings page can be built a thousand different ways and they'll all work about the same. But code that aggregates every user action has far fewer viable options."
10. Logging and Monitoring: Make It Solid
To quickly identify problems, log aggressively on the 'unhappy path' -- ultimately, practicality matters more than elegant code.
- Basic operational metrics like resource usage (CPU, memory), queue sizes, and processing time per request/job are essential.
- Don't overlook upper percentile metrics like p95 and p99, not just averages.
"A slow request or two could be coming from a key customer with serious complaints."
11. Preparing for Failure: Kill Switches and Safe Failures
When parts of the system inevitably experience failures, you must design in advance how to handle them.
- Blindly retrying only increases load, so use the 'circuit breaker' pattern.
- To prevent duplicate execution, include an idempotency key in requests so that the same request can be safely retried.
- When failures occur, choosing between 'fail open' (allow functionality) and 'fail closed' (block) is important.
- Example: Rate limiting should fail open (briefly allow traffic when the limiter fails).
- Authentication should fail closed (deny access rather than risk it).
12. In Closing
This article repeatedly emphasizes that good system design means using boring but proven patterns and components in the right places, rather than clever tricks. Especially at large companies where such infrastructure already exists, ordinary system design is what practically protects teams and products -- more so than novel architectures.
"If you chase what's too new and exciting, you might end up with a complete mess. Truly good system design goes unnoticed."
Conclusion
In summary, good system design means a structure composed of components that are as simple as possible, reliable, and faithful to their roles. As experience accumulates, the 'reliability of boring, ordinary combinations' shines brightest -- remember that.