Runbooks are not checklists
They are how teams turn operational knowledge into repeatable judgment
Great engineering managers don’t just ship—they build organizations capable of shipping. But that meta-skill is rarely named, let alone taught.
Our book Engineering Manager’s Compass focuses on the unspoken rules of the role: how to read organizational structures, how to turn messy metrics into real decisions, and how to build teams that deliver without you holding everything together.
Again and again, after every incident, I hear teams say that they need better runbooks. For some reason, we create a page, fill it up with commands, a few dashboard links, and a warning in bold. Yet, still, when the next incident comes, we cannot trust what we’ve written.
That is the core problem: a runbook can exist without capturing operational knowledge.
A useful runbook is not a long checklist of commands to be run. It is a snapshot of what to check first, what to avoid, when to branch, and when to escalate. When done correctly, runbooks should reduce dependence on specific people, lower onboarding risk, and make incident response less fragile.
Start with orientation, not commands
When someone opens a runbook, they usually don’t exactly know what they are looking for. They are on call but they can still be tired, or merely supporting a system they do not normally own therefore when writing a runbook we need to help the reader understand the context.
A good runbook should try to answer the following questions:
What problem does this page cover?
How do I confirm this is that problem?
What should I try first?
What should I avoid?
When do I escalate, and to whom?
A command like “restart a worker” is not enough. A useful instruction includes conditions and boundaries. For example: restart one worker only if queue depth is increasing, error rate is stable, and no deploy is running. This will allow the reader to make the right decision in moments of stress.
Encode judgment, not heroics
You would be surprised how often you find yourself in an outage, where if you’re lucky you find a runbook which vaguely resembles the issue that you’re facing. You’re excited, and hopeful. You open it up and you see:
Check queue depth
Restart worker
Escalate if unresolved
This list would have been no different if it would have just said “Resolve the outage.” Yeah sure. As if we don’t know that. But why should I check the queue depth? How would restarting the worker help? All the information is not there.
Wouldn’t it be better to come across something like this:
Confirm queue depth has increased for at least 10 minutes
Check for deploys or backfills before restarting anything
Restart one instance first; do not purge queues without customer-impact confirmation
Escalate immediately for payroll, billing, security, privacy, or compliance impact
Move to incident mode if likely cause is still unclear after 10 minutes
This list would allow me to use my judgement instead of shoving me into the depths of hell.
Add decision points and branches
Operations are rarely linear. A linear checklist can create false confidence whereas in reality, there will always be variations. Good runbooks include branches:
If this signal is present, use path A
If mitigation fails, stop and escalate
If impact is internal only, continue with lower-severity path
If regulated or customer-critical data is involved, use high-severity path
These branches do two important things:
They make the escalation path clear
They prevent responders from blindly following steps
Incidents often worsen when we have to deviate from the runbook whenever we face a slightly different situation. Having such decision points pre-embedded in a design book helps immensely.
Use runbooks for training, not only emergencies
Runbooks decay fast. As services change, alerts get renamed, dashboards move, and owners rotate, runbooks will very quickly become outdated. A stale runbook can be worse than having no runbook because since it will give false confidence and misguide the reader.
Therefore it’s important not to have the first real use of a runbook should not happen during a customer-facing outage. Use runbooks during onboarding, on-call shadowing, and game days. Practice will reveal issues with your runbook early on:
missing access
ambiguous dashboard names
region-specific mitigations
escalation paths that are socially known but undocumented
This will allow us to keep our runbooks up-to-date.
A lightweight template
Start simple:
Situation: what symptoms does this cover?
Non-situations: what similar cases does it not cover?
Confirmation: how do we know this is the known problem?
Impact: who or what is affected?
First action: what is safe to try first?
Boundaries: what should we not do?
Decision points: where does the path branch?
Escalation: when do we stop and who do we call?
Ownership: who maintains this and when was it last tested?
That is enough to produce a useful first version.
Conclusion
A runbook is not proof your team is prepared. It is a tool for transferring judgment under pressure. If it stores only commands, it will be ignored or misused. If it stores context, boundaries, and escalation logic, it becomes part of how the team operates.

