Marc Brooker on Incidents and AI Engineering

This is an interview with Marc Brooker, AWS Distinguished Engineer, in which he shares technical insights drawn from reviewing more than 3,000 cloud-system postmortems and offers a deep take on how software engineering is evolving in the AI era. He covers how to find important problems, the downsides of caching, the value of on-call rotations, and why writing is a superpower for engineers. He also gives candid advice to junior and senior engineers on the skills and adaptation strategies they need right now.

1. How to Find Problems That Matter 🤔

Early in his career, Marc was puzzled by why senior engineers had so much more impact than he did. He came to realize that it wasn't about working more hours or writing more code — what matters far more is the direction of the work, i.e., which problems you choose to solve. He describes three lenses for finding important problems.

Listen to customers: He spends significant time with AWS customers, paying attention to what they still find hard, what they're investing in, and where they don't want to burn time.

"I spend a lot of time with AWS customers listening to what they find still hard in our space, what they're investing in, where they don't want to spend time."
Track technology trends: Watch the pace of change in networking, storage, GPUs, and other layers — those shifts open new possibilities that didn't exist before.
Understand big shifts in the world: Look at the macro picture of how industries and society are changing. Moments of large-scale change are precisely when opportunities arise to build new things and recognize new problems.

Using this approach, while on the Lambda team in 2020, Marc noticed that customers were excited about serverless and container-based development but that relational data didn't fit that paradigm well. That insight led him to join the Aurora team, where he contributed to Aurora Serverless and DSQL — a case where customer need and technology trend aligned perfectly.

2. Lessons from 3,000+ Postmortems 📝

Marc spent 15 years doing on-call rotations, and he credits that experience for giving him genuine, hands-on knowledge of how distributed systems behave. On-call work taught him how systems actually operate, what happens when customers use them in unexpected ways, and how to make systems more robust.

The value of on-call: On-call isn't just repetitive firefighting — it's one of the best ways to understand how a system really works, how it behaves, and how customers actually use it. Repeat issues should be automated away; deep problems should be dug into, the system improved, and the knowledge shared.

"On-call is one of the best ways to learn those things. It's one of the best ways to see how systems actually work, how they actually behave, and how customers are actually using them."
What makes a great postmortem:
1. Deep understanding: You need to understand every detail of what happened — logs, metrics, observability, simulations — so you know precisely what occurred.
2. Root-cause analysis: Go beyond the immediate code bug and ask "why" repeatedly through multiple layers. Were there gaps in testing or verification? Were assumptions about system behavior wrong? Find the deeper causes.
3. Multi-dimensional fixes: Pursue not just a tactical fix for the proximate cause, but broad remedies spanning technology, organization, and product. When the same pattern appears across multiple postmortems, build a service or library that eliminates that class of problem at the root.
  
  "A great postmortem identifies not just fixes for the proximate cause but also broader fixes for technology, organization, and product."
AWS's postmortem culture: AWS holds weekly meetings where engineers and leaders review postmortems together and propagate the lessons company-wide. This practice is one of the core reasons for AWS's success and helps build deep understanding of how and why systems operate the way they do.

3. Why Caching Can Be Dangerous 😱

"Just add a cache" is common advice in system design, but Marc argues there are plenty of cases where caching is the wrong call.

The upside and downside of caches: Caches exploit temporal and spatial locality — a fundamental principle in computer science — and are highly effective at speeding up systems and improving scalability. But in a distributed system, a cache has two modes: one where it's working correctly and the system is fast and healthy, and another where the cache is empty or holds stale data, causing the system to slow down or fall over.
The danger of metastable failures: When a cache empties or becomes corrupted, the backend gets hammered by traffic it normally never sees, and customers suffer. This condition is called a metastable failure — the system stabilizes in a broken state and struggles to recover on its own. For example, excess traffic can overload the database or saturate the network, preventing the cache from being repopulated.
Alternatives to caching: To avoid metastable failures, Marc says he prefers to avoid caching wherever possible. Instead, he favors using a complete materialized view of the data or a scalable backend that can handle the required load directly, without a cache. In DSQL, the storage layer itself functions as a complete cache containing every row in the database, eliminating the empty-cache problem. Aurora uses a different approach: the primary continuously tells failover targets what to keep cached, so the cache stays warm through a failover.

"I prefer teams that avoid caching as much as they can. I prefer the pattern of having a complete materialized view of the data."
Frequency and impact: Metastable failures aren't especially common, but when they do occur they tend to cause large-scale outages, long recovery times, and complex incident response. The industry and engineering community need to understand this phenomenon deeply and prepare for it.

4. How AI Is Changing Software Engineering 🤖

Marc offers some compelling observations on how AI will reshape the field.

Software remains a world of unlimited opportunity: Software is still supply-constrained, and the opportunities for more software, bigger software, better software, and more personalized software in the world are enormous. AI is changing the economics of software development, and that represents a massive opportunity for software developers to build more.

"We have only begun to see the beginning of the impact software is going to have on the world. There are so many opportunities for more software, bigger software, better software, more personalized software."
Two career paths in the AI era:
- "The old way": Like analog circuits, building software using older techniques and languages will still exist but will increasingly become a niche. Economic opportunity will remain there, but it will shrink in relative terms.
- "The new way" (mainstream): AI-assisted development, agent-based development, and specification-driven development will become the mainstream approach, and this is where the majority of careers and economic opportunity will be created.
- Physical world integration: In domains where software meets the physical world, there will be fascinating open questions about how to apply new techniques and practices.
Advice for junior engineers:
- Understand customers and the business: Understanding customers, the business, the systems, and the economics will matter as much as writing code. These were once senior-engineer concerns; now they'll be required from junior engineers too.
- Problem-solving over pure coding: Roles defined purely by writing code will shrink. The ability to understand customer problems and business context and solve them collaboratively will matter more.
- Deep technical knowledge as leverage: Engineers with deep expertise in specific areas — optimization problems, infrastructure, databases — will find that AI gives them far greater leverage to apply that knowledge. Work that used to require too much friction to be worthwhile is now much more accessible.
- Learning and mentorship: Organizations should not expect new graduates to arrive with all the skills they need. Companies must provide guidelines, mentoring, and feedback so junior engineers can learn new techniques and customer communication skills on the job.
  
  "This requires understanding of customers, understanding of business, understanding of economics, understanding of systems. And this is almost moving out of the role of the senior engineer…"
Advice for senior engineers:
- Get back into the work: Senior engineers shouldn't rely solely on past experience and knowledge. They need to get back to actually building things in order to deeply understand new technologies and the changing nature of software development.
  
  "The thing I think really matters is that you need to go build again. You need to get back into the work."
- Stay curious and use the tools: New tools enable developers to have dramatically more impact — so stay curious, learn them, and use them. If you've been spending your time in meetings trying to look smart, it's time to recover that original instinct to learn, build, solve customer problems, and explore new technology.
- Ground yourself in reality: In the AI era, if you aren't doing the work hands-on, your understanding of new technology can be wildly inaccurate. No matter how impressive your title or track record, intellectual humility and a willingness to learn are non-negotiable.
  
  "Right now, at this moment, if you're not doing it, there's a very high chance your opinion of it is just completely wrong."

5. Why Engineers Should Write ✍️

Marc argues that writing is one of the most powerful things an engineer can do.

Extending your expertise: Writing is a powerful tool for sharing the ideas in your head with the world and extending your expertise across time and space. Building great products and sharing knowledge through mentoring are both valuable, but writing can reach far more people for far longer.

"Writing extends the impact of your expertise in space and time."
Clarifying your thinking: Writing forces a level of mental clarity that talking or making slides simply doesn't. The act of writing compels you to think rigorously and structure your ideas. This is one of Amazon's core cultural values, and Marc himself sometimes writes documents that he never shares with anyone — purely to clarify his own thinking.

"Writing forces a different level of mental clarity than speaking, making slide decks, and so on."
Communication and institutional memory:
- Collaboration and historical record: Writing creates important artifacts that document complex technical decisions so that teammates can understand them and future team members can reference them when improving the system.
- Capturing intent: Documenting important technical decisions helps people who come later distinguish between decisions that were carefully considered and ones that were arbitrary. That reduces unnecessary rework and lets people focus their deep thinking where it's actually needed.

6. Balancing Visibility and Expertise ⚖️

In a blog post called "hobbies and apparent expertise," Marc presents a 2×2 matrix of doing vs. discussing and hobby vs. gear, and explores the subtle balance between genuine expertise and external visibility.

The cost of all doing, no talking: An engineer who does nothing but code can accumulate tremendous expertise but struggles to communicate their value to the outside world. There's also a risk: if you keep your head down all the time, you may miss what the truly important problems are and end up working on the wrong things.

"If you've got your head in your IDE all day, there's a reasonably high probability you're doing the wrong things."
The cost of all talking, no doing: On the other side, engineers who focus only on communication and visibility can gain high profile but may lack real coding ability or technical depth. Their opinions are likely to be disconnected from reality.
The right balance: Marc says that for him, roughly 75–80% doing and 20–25% communicating has been the right balance. Within that range, engineers maintain real expertise while still sharing knowledge and expanding their influence.
Overrated vs. underrated: Asked whether it's better to be overrated or underrated in your career, Marc says that over the long run, being underrated is better. Being overrated can feel good in the moment but isn't sustainable — reality has a way of catching up with you. He points to athletes and craftspeople as examples of fields where you simply can't fool yourself about your actual ability.

"In the long run, I think being underrated, if I'm using that term, is better. Being overrated can feel good in the moment but is not sustainable."

7. Engineers He Admires and Book Recommendations 📚

Marc counts it as one of his great fortunes to have worked with so many exceptional people at AWS. He singles out Elva Muan as someone he particularly admires — a key contributor to the design of S3 who could hold the fine-grained details of the Paxos paper in mind while simultaneously engaging in high-level conversations about cloud strategy. That combination of depth and breadth is rare and impressive.

For technical reading, he recommends:

Martin Kleppmann's work on distributed systems: Strongly recommended for anyone building distributed systems.
Hennessy and Patterson's Computer Architecture: A Quantitative Approach: A useful reference covering the full breadth of computer architecture.

Marc says he reads technical papers more than books, often using AI tools (Claude included) to summarize a paper before diving into it in depth. He also emphasizes that older papers and textbooks can contain surprising insight. For example, some of the algorithms Lambda uses for traffic management and burst handling trace back to work Erlang did a hundred years ago when studying how to manage telephone switching centers.

8. Advice to His Younger Self 🚀

If he could go back and give his younger self one piece of advice, Marc says he would say: "Be a little bit bolder." He loved the teams he worked on, but looking back, he sometimes should have moved to a new team or taken on a new challenge sooner, rather than missing opportunities to learn and grow.

He made about four major organizational moves over his career, but guesses five or six would have been optimal. The key is to continually ask yourself what you're learning, who you're learning from, and whether there's an environment where you could learn and grow faster. Every time he followed his curiosity and made a move, it turned out to be personally satisfying and worthwhile — and that, he says, is ultimately what sustains compounding growth as an engineer.

Closing Thoughts ✨

Marc Brooker offered deep insight into the skills and mindset engineers need in a rapidly changing technological landscape. Above all, he emphasized that the path to a successful engineering career means going beyond writing code — understanding customer problems, building on deep technical knowledge, extending your influence through writing, and maintaining the humility to keep learning and stay in the work. His advice will resonate with any engineer trying to find their footing in the wave of new technology. 🌊

1. How to Find Problems That Matter 🤔

Listen to customers: He spends significant time with AWS customers, paying attention to what they still find hard, what they're investing in, and where they don't want to burn time.

"I spend a lot of time with AWS customers listening to what they find still hard in our space, what they're investing in, where they don't want to spend time."
Track technology trends: Watch the pace of change in networking, storage, GPUs, and other layers — those shifts open new possibilities that didn't exist before.
Understand big shifts in the world: Look at the macro picture of how industries and society are changing. Moments of large-scale change are precisely when opportunities arise to build new things and recognize new problems.

2. Lessons from 3,000+ Postmortems 📝

The value of on-call: On-call isn't just repetitive firefighting — it's one of the best ways to understand how a system really works, how it behaves, and how customers actually use it. Repeat issues should be automated away; deep problems should be dug into, the system improved, and the knowledge shared.

"On-call is one of the best ways to learn those things. It's one of the best ways to see how systems actually work, how they actually behave, and how customers are actually using them."
What makes a great postmortem:
1. Deep understanding: You need to understand every detail of what happened — logs, metrics, observability, simulations — so you know precisely what occurred.
2. Root-cause analysis: Go beyond the immediate code bug and ask "why" repeatedly through multiple layers. Were there gaps in testing or verification? Were assumptions about system behavior wrong? Find the deeper causes.
3. Multi-dimensional fixes: Pursue not just a tactical fix for the proximate cause, but broad remedies spanning technology, organization, and product. When the same pattern appears across multiple postmortems, build a service or library that eliminates that class of problem at the root.
  
  "A great postmortem identifies not just fixes for the proximate cause but also broader fixes for technology, organization, and product."
AWS's postmortem culture: AWS holds weekly meetings where engineers and leaders review postmortems together and propagate the lessons company-wide. This practice is one of the core reasons for AWS's success and helps build deep understanding of how and why systems operate the way they do.

3. Why Caching Can Be Dangerous 😱

"Just add a cache" is common advice in system design, but Marc argues there are plenty of cases where caching is the wrong call.

The upside and downside of caches: Caches exploit temporal and spatial locality — a fundamental principle in computer science — and are highly effective at speeding up systems and improving scalability. But in a distributed system, a cache has two modes: one where it's working correctly and the system is fast and healthy, and another where the cache is empty or holds stale data, causing the system to slow down or fall over.
The danger of metastable failures: When a cache empties or becomes corrupted, the backend gets hammered by traffic it normally never sees, and customers suffer. This condition is called a metastable failure — the system stabilizes in a broken state and struggles to recover on its own. For example, excess traffic can overload the database or saturate the network, preventing the cache from being repopulated.
Alternatives to caching: To avoid metastable failures, Marc says he prefers to avoid caching wherever possible. Instead, he favors using a complete materialized view of the data or a scalable backend that can handle the required load directly, without a cache. In DSQL, the storage layer itself functions as a complete cache containing every row in the database, eliminating the empty-cache problem. Aurora uses a different approach: the primary continuously tells failover targets what to keep cached, so the cache stays warm through a failover.

"I prefer teams that avoid caching as much as they can. I prefer the pattern of having a complete materialized view of the data."
Frequency and impact: Metastable failures aren't especially common, but when they do occur they tend to cause large-scale outages, long recovery times, and complex incident response. The industry and engineering community need to understand this phenomenon deeply and prepare for it.

4. How AI Is Changing Software Engineering 🤖

Marc offers some compelling observations on how AI will reshape the field.

Software remains a world of unlimited opportunity: Software is still supply-constrained, and the opportunities for more software, bigger software, better software, and more personalized software in the world are enormous. AI is changing the economics of software development, and that represents a massive opportunity for software developers to build more.

"We have only begun to see the beginning of the impact software is going to have on the world. There are so many opportunities for more software, bigger software, better software, more personalized software."
Two career paths in the AI era:
- "The old way": Like analog circuits, building software using older techniques and languages will still exist but will increasingly become a niche. Economic opportunity will remain there, but it will shrink in relative terms.
- "The new way" (mainstream): AI-assisted development, agent-based development, and specification-driven development will become the mainstream approach, and this is where the majority of careers and economic opportunity will be created.
- Physical world integration: In domains where software meets the physical world, there will be fascinating open questions about how to apply new techniques and practices.
Advice for junior engineers:
- Understand customers and the business: Understanding customers, the business, the systems, and the economics will matter as much as writing code. These were once senior-engineer concerns; now they'll be required from junior engineers too.
- Problem-solving over pure coding: Roles defined purely by writing code will shrink. The ability to understand customer problems and business context and solve them collaboratively will matter more.
- Deep technical knowledge as leverage: Engineers with deep expertise in specific areas — optimization problems, infrastructure, databases — will find that AI gives them far greater leverage to apply that knowledge. Work that used to require too much friction to be worthwhile is now much more accessible.
- Learning and mentorship: Organizations should not expect new graduates to arrive with all the skills they need. Companies must provide guidelines, mentoring, and feedback so junior engineers can learn new techniques and customer communication skills on the job.
  
  "This requires understanding of customers, understanding of business, understanding of economics, understanding of systems. And this is almost moving out of the role of the senior engineer…"
Advice for senior engineers:
- Get back into the work: Senior engineers shouldn't rely solely on past experience and knowledge. They need to get back to actually building things in order to deeply understand new technologies and the changing nature of software development.
  
  "The thing I think really matters is that you need to go build again. You need to get back into the work."
- Stay curious and use the tools: New tools enable developers to have dramatically more impact — so stay curious, learn them, and use them. If you've been spending your time in meetings trying to look smart, it's time to recover that original instinct to learn, build, solve customer problems, and explore new technology.
- Ground yourself in reality: In the AI era, if you aren't doing the work hands-on, your understanding of new technology can be wildly inaccurate. No matter how impressive your title or track record, intellectual humility and a willingness to learn are non-negotiable.
  
  "Right now, at this moment, if you're not doing it, there's a very high chance your opinion of it is just completely wrong."

5. Why Engineers Should Write ✍️

Marc argues that writing is one of the most powerful things an engineer can do.

Extending your expertise: Writing is a powerful tool for sharing the ideas in your head with the world and extending your expertise across time and space. Building great products and sharing knowledge through mentoring are both valuable, but writing can reach far more people for far longer.

"Writing extends the impact of your expertise in space and time."
Clarifying your thinking: Writing forces a level of mental clarity that talking or making slides simply doesn't. The act of writing compels you to think rigorously and structure your ideas. This is one of Amazon's core cultural values, and Marc himself sometimes writes documents that he never shares with anyone — purely to clarify his own thinking.

"Writing forces a different level of mental clarity than speaking, making slide decks, and so on."
Communication and institutional memory:
- Collaboration and historical record: Writing creates important artifacts that document complex technical decisions so that teammates can understand them and future team members can reference them when improving the system.
- Capturing intent: Documenting important technical decisions helps people who come later distinguish between decisions that were carefully considered and ones that were arbitrary. That reduces unnecessary rework and lets people focus their deep thinking where it's actually needed.

6. Balancing Visibility and Expertise ⚖️

The cost of all doing, no talking: An engineer who does nothing but code can accumulate tremendous expertise but struggles to communicate their value to the outside world. There's also a risk: if you keep your head down all the time, you may miss what the truly important problems are and end up working on the wrong things.

"If you've got your head in your IDE all day, there's a reasonably high probability you're doing the wrong things."
The cost of all talking, no doing: On the other side, engineers who focus only on communication and visibility can gain high profile but may lack real coding ability or technical depth. Their opinions are likely to be disconnected from reality.
The right balance: Marc says that for him, roughly 75–80% doing and 20–25% communicating has been the right balance. Within that range, engineers maintain real expertise while still sharing knowledge and expanding their influence.
Overrated vs. underrated: Asked whether it's better to be overrated or underrated in your career, Marc says that over the long run, being underrated is better. Being overrated can feel good in the moment but isn't sustainable — reality has a way of catching up with you. He points to athletes and craftspeople as examples of fields where you simply can't fool yourself about your actual ability.

"In the long run, I think being underrated, if I'm using that term, is better. Being overrated can feel good in the moment but is not sustainable."

7. Engineers He Admires and Book Recommendations 📚

For technical reading, he recommends:

Martin Kleppmann's work on distributed systems: Strongly recommended for anyone building distributed systems.
Hennessy and Patterson's Computer Architecture: A Quantitative Approach: A useful reference covering the full breadth of computer architecture.

1. How to Find Problems That Matter 🤔

2. Lessons from 3,000+ Postmortems 📝

3. Why Caching Can Be Dangerous 😱

4. How AI Is Changing Software Engineering 🤖

5. Why Engineers Should Write ✍️

6. Balancing Visibility and Expertise ⚖️

7. Engineers He Admires and Book Recommendations 📚

8. Advice to His Younger Self 🚀

Closing Thoughts ✨

Related writing

Building a Self-Improving Company

Cloudflare CEO on Replacing Roles With AI

Lessons from the claw-code Repository

Reading

1. How to Find Problems That Matter 🤔

2. Lessons from 3,000+ Postmortems 📝

3. Why Caching Can Be Dangerous 😱

4. How AI Is Changing Software Engineering 🤖

5. Why Engineers Should Write ✍️

6. Balancing Visibility and Expertise ⚖️

7. Engineers He Admires and Book Recommendations 📚

8. Advice to His Younger Self 🚀

Closing Thoughts ✨

Related writing

Building a Self-Improving Company

Cloudflare CEO on Replacing Roles With AI

Lessons from the claw-code Repository