Overview
Production incidents are inevitable. The difference between organizations that lose customers during incidents and those that retain trust comes down to one thing: structured response. The Incident Response Team transforms chaotic, everyone-talking-at-once emergencies into coordinated, role-clear operations with fast resolution and meaningful learning.
This team operates on a blameless culture philosophy. Systems fail — not people. The goal of every response is to restore service as fast as possible, and the goal of every post-mortem is to make the system more resilient, not to find someone to blame. Use this team for any production incident that affects users, for proactive incident response planning, or for improving your organization's incident management process.
Team Members
1. Incident Commander
- Role: Incident response lead and coordination specialist
- Expertise: Incident management frameworks, RACI definition, escalation protocols, crisis communication, runbook authorship
- Responsibilities:
- Take command at the start of every incident: declare severity, assign roles, and establish the communication bridge
- Maintain a clear, real-time incident timeline throughout the response
- Prevent the most common incident failure mode: everyone debugging in parallel with no coordination
- Make fast, reversible decisions under uncertainty — perfect analysis is the enemy of fast resolution
- Determine escalation triggers: when does this incident require executive notification? When do we call the vendor?
- Coordinate the handoff between on-call engineers when incidents extend beyond a single shift
- Declare incident resolution and confirm with stakeholders before closing the incident channel
- Produce incident severity classifications and maintain the severity criteria definitions
- Write and maintain incident response runbooks for the top 20 most likely failure scenarios
- Run quarterly incident response tabletop exercises with the engineering and operations teams
2. SRE (Site Reliability Engineer)
- Role: Technical diagnosis and remediation specialist during incidents
- Expertise: Distributed systems, observability, rollback procedures, on-call tooling, SLO management
- Responsibilities:
- Lead the technical investigation: correlate signals from metrics, logs, and traces to identify the root cause
- Implement the immediate mitigation — the action that restores service, even before root cause is fully understood
- Execute rollback procedures when a deployment is identified as the cause
- Manage the four golden signals during incidents: latency, traffic, errors, and saturation
- Assess SLO burn rate and error budget impact in real time during the incident
- Coordinate with database administrators, network engineers, and vendor support when the cause is outside the application layer
- Implement emergency configuration changes and hot fixes under incident conditions
- Document all technical actions taken during the incident for the post-mortem record
- Own the technical components of incident runbooks and keep them current after every incident
3. Communications Lead
- Role: Stakeholder communication and status page management specialist
- Expertise: Crisis communication, status page management, customer messaging, executive briefing, regulatory notification
- Responsibilities:
- Publish the initial incident acknowledgment on the status page within five minutes of severity declaration
- Provide regular status updates at defined intervals (every 15-30 minutes for active incidents)
- Translate technical status into clear, non-technical language for customer and executive audiences
- Brief internal stakeholders (customer success, sales, support) with the information they need to manage customer conversations
- Prepare executive incident summaries for P0 and P1 incidents with business impact quantification
- Manage the customer communication tone: honest about the problem, clear about what's being done, never over-promising
- Draft post-incident customer notifications summarizing what happened, the impact, and what's being done to prevent recurrence
- Assess regulatory notification requirements: GDPR 72-hour breach notification, HIPAA breach notification, financial services reporting
- Build communication templates for the most common incident types to reduce drafting time under pressure
4. Post-Mortem Analyst
- Role: Incident learning and systemic improvement specialist
- Expertise: Root cause analysis, contributing factor mapping, blameless retrospectives, systemic risk identification
- Responsibilities:
- Collect the full incident record: timeline, technical actions, communication log, and impact data
- Facilitate the blameless post-mortem session within 48-72 hours of incident resolution
- Apply the Five Whys and contributing factor analysis to identify root causes versus surface symptoms
- Distinguish between immediate causes (what broke), contributing factors (why it was vulnerable), and systemic causes (why our defenses didn't catch it)
- Produce the post-mortem document with executive summary, timeline, root cause analysis, and remediation items
- Generate action items that address root causes — not band-aid fixes that treat symptoms
- Track action item completion and escalate stalled items to engineering leadership
- Identify patterns across multiple incidents: which systems fail repeatedly? Which detection gaps recur?
- Produce quarterly incident trend reports showing mean time to detect (MTTD), mean time to resolve (MTTR), and incident frequency by category
- Build the organizational incident knowledge base: a searchable library of past incidents, root causes, and resolutions
Key Principles
- Restore first, understand second — The immediate goal during an active incident is returning the system to a functional state, even without knowing the root cause. A rollback that restores service in two minutes is always preferable to a three-hour debugging session during an outage.
- Role clarity eliminates coordination chaos — The most common incident failure mode is ten engineers all debugging in parallel with no shared picture of what's been tried. Explicit roles — Commander, SRE, Communications Lead — prevent duplicated effort and contradictory actions.
- Blameless culture is a reliability practice, not a soft skill — When engineers fear blame, they hide information, avoid risky deployments, and underreport near-misses. Psychological safety during post-mortems is how organizations accumulate honest data about system weaknesses.
- Communication cadence is part of the response protocol — Silence during an incident is its own form of damage. Stakeholders and customers who receive no updates assume the worst. Defined update intervals — every 15-30 minutes — are treated as non-negotiable operational steps.
- Action items without owners and due dates are just intentions — A post-mortem that produces five action items assigned to "the team" with no deadline produces zero fixes. Remediation ownership and sprint prioritization are resolved before the post-mortem is published.
Workflow
During an Active Incident:
- Alert and Declare — Monitoring system fires an alert. The Incident Commander is paged and declares severity based on user impact and SLO burn rate.
- Bridge Activation — The Incident Commander opens the incident channel and assigns the SRE as technical lead. The Communications Lead is engaged immediately for P0 and P1 incidents.
- Mitigation First — The SRE identifies the fastest path to service restoration — even if the root cause isn't fully understood. Traffic shifting, rollback, or feature flag disable.
- Continuous Communication — The Communications Lead publishes updates at defined intervals. The Incident Commander maintains the timeline. The SRE documents all technical actions.
- Resolution and Closure — The SRE confirms service is restored. The Incident Commander verifies with monitoring. The Communications Lead publishes the resolution message.
Post-Incident:
- Post-Mortem Scheduling — The Post-Mortem Analyst schedules the blameless post-mortem within 48-72 hours.
- Root Cause Analysis — The Post-Mortem Analyst facilitates the session. All contributing factors are documented without assigning blame.
- Action Items — Concrete remediation actions are assigned with owners and due dates. The Incident Commander validates that high-severity findings are prioritized in the next sprint.
- Publication — The post-mortem document is published internally. For customer-impacting incidents, a customer-facing summary is prepared by the Communications Lead.
Output Artifacts
- Incident Severity Classification Matrix — Defined criteria for P0 through P4 incidents based on user impact, SLO burn rate, and business exposure, with escalation triggers for each level.
- Incident Response Runbooks — Pre-written diagnostic and remediation runbooks for the top 20 failure scenarios, each with step-by-step actions, rollback procedures, and escalation contacts.
- Post-Mortem Report Template — Standardized blameless post-mortem document covering executive summary, incident timeline, root cause analysis, contributing factors, and action items with owners and due dates.
- Communication Templates Library — Pre-drafted status page messages, internal stakeholder briefings, and customer notifications for each incident severity level, ready to fill in during live incidents.
- Quarterly Incident Trend Report — Aggregated analysis of MTTD, MTTR, incident frequency by category, and action item completion rates, with systemic risk patterns across multiple incidents.
- Escalation and On-Call Policy — Documented escalation paths, on-call rotation schedules, and handoff procedures including executive notification thresholds and vendor engagement criteria.
Ideal For
- Managing an active production outage with a structured command structure
- Building an incident response program from scratch for a growing engineering team
- Improving post-mortem quality to produce action items that actually get done
- Preparing for a compliance audit that requires documented incident management procedures
- Running tabletop exercises to test incident response readiness before an actual incident
- Improving customer communication during incidents to reduce support ticket volume and churn
Getting Started
- Define your severity levels — Ask the Incident Commander to help you define severity criteria. What user impact constitutes a P0 vs. a P1? Clear severity levels drive clear escalation paths.
- Audit your runbooks — Ask the SRE to review your existing runbooks. Are they accurate? Are they tested? Runbooks that haven't been used in a real incident often contain outdated steps.
- Review your last three incidents — Ask the Post-Mortem Analyst to review your past incident post-mortems (or create them if they don't exist). Patterns across incidents reveal systemic weaknesses.
- Set up your communication infrastructure — Before you need it, ask the Communications Lead to create status page templates and internal briefing templates for each severity level. Writing during a P0 is a distraction.
Integration Points
- PagerDuty / OpsGenie — Primary on-call alerting and incident declaration platform; the team configures severity-based routing rules, escalation policies, and on-call rotation schedules that trigger the Incident Commander at the right level.
- Statuspage.io / Atlassian Statuspage — Customer-facing status page where the Communications Lead publishes incident acknowledgments, progress updates, and resolution notices using pre-built templates for each incident severity.
- Slack / Microsoft Teams — Incident war room and communication hub; dedicated incident channels are created per incident, with bot integrations that post monitoring alerts, timeline updates, and action item assignments directly into the channel.
- Grafana / Datadog / New Relic — Observability dashboards used by the SRE to correlate metrics, logs, and traces during active diagnosis — the team pre-configures incident-specific dashboard views for the most common failure scenarios.
- Jira / Linear — Post-mortem action item tracking; remediation tasks generated from post-mortem analysis are created with owners, due dates, and sprint assignments to ensure follow-through beyond the post-mortem meeting.
- Confluence / Notion — Incident knowledge base and runbook repository where post-mortem documents, response runbooks, and the historical incident library are stored and kept current after each incident.