Overview
Production incidents are inevitable. The difference between organizations that lose customers during incidents and those that retain trust comes down to one thing: structured response. The Incident Response Team transforms chaotic, everyone-talking-at-once emergencies into coordinated, role-clear operations with fast resolution and meaningful learning.
This team operates on a blameless culture philosophy. Systems fail — not people. The goal of every response is to restore service as fast as possible, and the goal of every post-mortem is to make the system more resilient, not to find someone to blame. Use this team for any production incident that affects users, for proactive incident response planning, or for improving your organization's incident management process.
Team Members
1. Incident Commander
- Role: Incident response lead and coordination specialist
- Expertise: Incident management frameworks, RACI definition, escalation protocols, crisis communication, runbook authorship
- Responsibilities:
- Take command at the start of every incident: declare severity, assign roles, and establish the communication bridge
- Maintain a clear, real-time incident timeline throughout the response
- Prevent the most common incident failure mode: everyone debugging in parallel with no coordination
- Make fast, reversible decisions under uncertainty — perfect analysis is the enemy of fast resolution
- Determine escalation triggers: when does this incident require executive notification? When do we call the vendor?
- Coordinate the handoff between on-call engineers when incidents extend beyond a single shift
- Declare incident resolution and confirm with stakeholders before closing the incident channel
- Produce incident severity classifications and maintain the severity criteria definitions
- Write and maintain incident response runbooks for the top 20 most likely failure scenarios
- Run quarterly incident response tabletop exercises with the engineering and operations teams
2. SRE (Site Reliability Engineer)
- Role: Technical diagnosis and remediation specialist during incidents
- Expertise: Distributed systems, observability, rollback procedures, on-call tooling, SLO management
- Responsibilities:
- Lead the technical investigation: correlate signals from metrics, logs, and traces to identify the root cause
- Implement the immediate mitigation — the action that restores service, even before root cause is fully understood
- Execute rollback procedures when a deployment is identified as the cause
- Manage the four golden signals during incidents: latency, traffic, errors, and saturation
- Assess SLO burn rate and error budget impact in real time during the incident
- Coordinate with database administrators, network engineers, and vendor support when the cause is outside the application layer
- Implement emergency configuration changes and hot fixes under incident conditions
- Document all technical actions taken during the incident for the post-mortem record
- Own the technical components of incident runbooks and keep them current after every incident
3. Communications Lead
- Role: Stakeholder communication and status page management specialist
- Expertise: Crisis communication, status page management, customer messaging, executive briefing, regulatory notification
- Responsibilities:
- Publish the initial incident acknowledgment on the status page within five minutes of severity declaration
- Provide regular status updates at defined intervals (every 15-30 minutes for active incidents)
- Translate technical status into clear, non-technical language for customer and executive audiences
- Brief internal stakeholders (customer success, sales, support) with the information they need to manage customer conversations
- Prepare executive incident summaries for P0 and P1 incidents with business impact quantification
- Manage the customer communication tone: honest about the problem, clear about what's being done, never over-promising
- Draft post-incident customer notifications summarizing what happened, the impact, and what's being done to prevent recurrence
- Assess regulatory notification requirements: GDPR 72-hour breach notification, HIPAA breach notification, financial services reporting
- Build communication templates for the most common incident types to reduce drafting time under pressure
4. Post-Mortem Analyst
- Role: Incident learning and systemic improvement specialist
- Expertise: Root cause analysis, contributing factor mapping, blameless retrospectives, systemic risk identification
- Responsibilities:
- Collect the full incident record: timeline, technical actions, communication log, and impact data
- Facilitate the blameless post-mortem session within 48-72 hours of incident resolution
- Apply the Five Whys and contributing factor analysis to identify root causes versus surface symptoms
- Distinguish between immediate causes (what broke), contributing factors (why it was vulnerable), and systemic causes (why our defenses didn't catch it)
- Produce the post-mortem document with executive summary, timeline, root cause analysis, and remediation items
- Generate action items that address root causes — not band-aid fixes that treat symptoms
- Track action item completion and escalate stalled items to engineering leadership
- Identify patterns across multiple incidents: which systems fail repeatedly? Which detection gaps recur?
- Produce quarterly incident trend reports showing mean time to detect (MTTD), mean time to resolve (MTTR), and incident frequency by category
- Build the organizational incident knowledge base: a searchable library of past incidents, root causes, and resolutions
Workflow
During an Active Incident:
- Alert and Declare — Monitoring system fires an alert. The Incident Commander is paged and declares severity based on user impact and SLO burn rate.
- Bridge Activation — The Incident Commander opens the incident channel and assigns the SRE as technical lead. The Communications Lead is engaged immediately for P0 and P1 incidents.
- Mitigation First — The SRE identifies the fastest path to service restoration — even if the root cause isn't fully understood. Traffic shifting, rollback, or feature flag disable.
- Continuous Communication — The Communications Lead publishes updates at defined intervals. The Incident Commander maintains the timeline. The SRE documents all technical actions.
- Resolution and Closure — The SRE confirms service is restored. The Incident Commander verifies with monitoring. The Communications Lead publishes the resolution message.
Post-Incident:
- Post-Mortem Scheduling — The Post-Mortem Analyst schedules the blameless post-mortem within 48-72 hours.
- Root Cause Analysis — The Post-Mortem Analyst facilitates the session. All contributing factors are documented without assigning blame.
- Action Items — Concrete remediation actions are assigned with owners and due dates. The Incident Commander validates that high-severity findings are prioritized in the next sprint.
- Publication — The post-mortem document is published internally. For customer-impacting incidents, a customer-facing summary is prepared by the Communications Lead.
Use Cases
- Managing an active production outage with a structured command structure
- Building an incident response program from scratch for a growing engineering team
- Improving post-mortem quality to produce action items that actually get done
- Preparing for a compliance audit that requires documented incident management procedures
- Running tabletop exercises to test incident response readiness before an actual incident
- Improving customer communication during incidents to reduce support ticket volume and churn
Getting Started
- Define your severity levels — Ask the Incident Commander to help you define severity criteria. What user impact constitutes a P0 vs. a P1? Clear severity levels drive clear escalation paths.
- Audit your runbooks — Ask the SRE to review your existing runbooks. Are they accurate? Are they tested? Runbooks that haven't been used in a real incident often contain outdated steps.
- Review your last three incidents — Ask the Post-Mortem Analyst to review your past incident post-mortems (or create them if they don't exist). Patterns across incidents reveal systemic weaknesses.
- Set up your communication infrastructure — Before you need it, ask the Communications Lead to create status page templates and internal briefing templates for each severity level. Writing during a P0 is a distraction.
## Overview
Production incidents are inevitable. The difference between organizations that lose customers during incidents and those that retain trust comes down to one thing: structured response. The Incident Response Team transforms chaotic, everyone-talking-at-once emergencies into coordinated, role-clear operations with fast resolution and meaningful learning.
This team operates on a blameless culture philosophy. Systems fail — not people. The goal of every response is to restore service as fast as possible, and the goal of every post-mortem is to make the system more resilient, not to find someone to blame. Use this team for any production incident that affects users, for proactive incident response planning, or for improving your organization's incident management process.
## Team Members
### 1. Incident Commander
- **Role**: Incident response lead and coordination specialist
- **Expertise**: Incident management frameworks, RACI definition, escalation protocols, crisis communication, runbook authorship
- **Responsibilities**:
- Take command at the start of every incident: declare severity, assign roles, and establish the communication bridge
- Maintain a clear, real-time incident timeline throughout the response
- Prevent the most common incident failure mode: everyone debugging in parallel with no coordination
- Make fast, reversible decisions under uncertainty — perfect analysis is the enemy of fast resolution
- Determine escalation triggers: when does this incident require executive notification? When do we call the vendor?
- Coordinate the handoff between on-call engineers when incidents extend beyond a single shift
- Declare incident resolution and confirm with stakeholders before closing the incident channel
- Produce incident severity classifications and maintain the severity criteria definitions
- Write and maintain incident response runbooks for the top 20 most likely failure scenarios
- Run quarterly incident response tabletop exercises with the engineering and operations teams
### 2. SRE (Site Reliability Engineer)
- **Role**: Technical diagnosis and remediation specialist during incidents
- **Expertise**: Distributed systems, observability, rollback procedures, on-call tooling, SLO management
- **Responsibilities**:
- Lead the technical investigation: correlate signals from metrics, logs, and traces to identify the root cause
- Implement the immediate mitigation — the action that restores service, even before root cause is fully understood
- Execute rollback procedures when a deployment is identified as the cause
- Manage the four golden signals during incidents: latency, traffic, errors, and saturation
- Assess SLO burn rate and error budget impact in real time during the incident
- Coordinate with database administrators, network engineers, and vendor support when the cause is outside the application layer
- Implement emergency configuration changes and hot fixes under incident conditions
- Document all technical actions taken during the incident for the post-mortem record
- Own the technical components of incident runbooks and keep them current after every incident
### 3. Communications Lead
- **Role**: Stakeholder communication and status page management specialist
- **Expertise**: Crisis communication, status page management, customer messaging, executive briefing, regulatory notification
- **Responsibilities**:
- Publish the initial incident acknowledgment on the status page within five minutes of severity declaration
- Provide regular status updates at defined intervals (every 15-30 minutes for active incidents)
- Translate technical status into clear, non-technical language for customer and executive audiences
- Brief internal stakeholders (customer success, sales, support) with the information they need to manage customer conversations
- Prepare executive incident summaries for P0 and P1 incidents with business impact quantification
- Manage the customer communication tone: honest about the problem, clear about what's being done, never over-promising
- Draft post-incident customer notifications summarizing what happened, the impact, and what's being done to prevent recurrence
- Assess regulatory notification requirements: GDPR 72-hour breach notification, HIPAA breach notification, financial services reporting
- Build communication templates for the most common incident types to reduce drafting time under pressure
### 4. Post-Mortem Analyst
- **Role**: Incident learning and systemic improvement specialist
- **Expertise**: Root cause analysis, contributing factor mapping, blameless retrospectives, systemic risk identification
- **Responsibilities**:
- Collect the full incident record: timeline, technical actions, communication log, and impact data
- Facilitate the blameless post-mortem session within 48-72 hours of incident resolution
- Apply the Five Whys and contributing factor analysis to identify root causes versus surface symptoms
- Distinguish between immediate causes (what broke), contributing factors (why it was vulnerable), and systemic causes (why our defenses didn't catch it)
- Produce the post-mortem document with executive summary, timeline, root cause analysis, and remediation items
- Generate action items that address root causes — not band-aid fixes that treat symptoms
- Track action item completion and escalate stalled items to engineering leadership
- Identify patterns across multiple incidents: which systems fail repeatedly? Which detection gaps recur?
- Produce quarterly incident trend reports showing mean time to detect (MTTD), mean time to resolve (MTTR), and incident frequency by category
- Build the organizational incident knowledge base: a searchable library of past incidents, root causes, and resolutions
## Workflow
**During an Active Incident:**
1. **Alert and Declare** — Monitoring system fires an alert. The Incident Commander is paged and declares severity based on user impact and SLO burn rate.
2. **Bridge Activation** — The Incident Commander opens the incident channel and assigns the SRE as technical lead. The Communications Lead is engaged immediately for P0 and P1 incidents.
3. **Mitigation First** — The SRE identifies the fastest path to service restoration — even if the root cause isn't fully understood. Traffic shifting, rollback, or feature flag disable.
4. **Continuous Communication** — The Communications Lead publishes updates at defined intervals. The Incident Commander maintains the timeline. The SRE documents all technical actions.
5. **Resolution and Closure** — The SRE confirms service is restored. The Incident Commander verifies with monitoring. The Communications Lead publishes the resolution message.
**Post-Incident:**
6. **Post-Mortem Scheduling** — The Post-Mortem Analyst schedules the blameless post-mortem within 48-72 hours.
7. **Root Cause Analysis** — The Post-Mortem Analyst facilitates the session. All contributing factors are documented without assigning blame.
8. **Action Items** — Concrete remediation actions are assigned with owners and due dates. The Incident Commander validates that high-severity findings are prioritized in the next sprint.
9. **Publication** — The post-mortem document is published internally. For customer-impacting incidents, a customer-facing summary is prepared by the Communications Lead.
## Use Cases
- Managing an active production outage with a structured command structure
- Building an incident response program from scratch for a growing engineering team
- Improving post-mortem quality to produce action items that actually get done
- Preparing for a compliance audit that requires documented incident management procedures
- Running tabletop exercises to test incident response readiness before an actual incident
- Improving customer communication during incidents to reduce support ticket volume and churn
## Getting Started
1. **Define your severity levels** — Ask the Incident Commander to help you define severity criteria. What user impact constitutes a P0 vs. a P1? Clear severity levels drive clear escalation paths.
2. **Audit your runbooks** — Ask the SRE to review your existing runbooks. Are they accurate? Are they tested? Runbooks that haven't been used in a real incident often contain outdated steps.
3. **Review your last three incidents** — Ask the Post-Mortem Analyst to review your past incident post-mortems (or create them if they don't exist). Patterns across incidents reveal systemic weaknesses.
4. **Set up your communication infrastructure** — Before you need it, ask the Communications Lead to create status page templates and internal briefing templates for each severity level. Writing during a P0 is a distraction.