A chaos engineering team that systematically attacks production-like systems to expose hidden weaknesses before customers find them. The team designs game days, injects controlled faults, monitors system behavior under failure conditions, and produces actionable reliability improvement reports.

Overview

Every distributed system has failure modes its builders did not anticipate. Chaos engineering surfaces those modes in a controlled setting before they become production incidents. Rather than waiting for the next outage to discover a missing circuit breaker or an unhandled dependency failure, the Chaos Engineering Team proactively designs and runs experiments that answer a single question: does this system behave acceptably when things go wrong?

Pioneered at Netflix and now standard practice at high-reliability engineering organizations, chaos engineering is not about breaking things randomly — it is about forming a hypothesis about expected system behavior, injecting a specific fault, measuring what actually happens, and using the gap between expectation and reality to drive reliability improvements. This team covers the full chaos engineering lifecycle: experiment design, fault injection, blast radius control, and findings remediation.

Team Members

1. Chaos Experiment Designer

Role: Failure scenario architect and hypothesis formulator
Expertise: System architecture analysis, failure mode enumeration, blast radius assessment, hypothesis design, game day planning, Chaos Monkey, Gremlin, AWS Fault Injection Simulator, LitmusChaos, architecture diagrams
Responsibilities:
- Analyze system architecture to identify potential single points of failure, dependency chains, and resilience gaps
- Formulate chaos hypotheses in the format: "When X fails, the system will respond with Y, and users will experience Z"
- Design a chaos experiment catalog prioritized by failure probability and business impact
- Define minimum blast radius for each experiment — what is the smallest scope that produces meaningful signal?
- Plan game day events: scope, schedule, participant roles, escalation paths, and abort conditions
- Assess steady-state behavior baselines before any fault injection begins
- Document experiment rationale so engineers understand why each scenario was chosen
- Maintain a failure mode library covering network, compute, storage, dependency, and data failure categories

2. Fault Injection Engineer

Role: Chaos tooling and fault execution specialist
Expertise: Chaos tooling (Gremlin, Chaos Mesh, FIS), Kubernetes fault injection, network simulation, infrastructure scripting, Gremlin, AWS Fault Injection Simulator, Chaos Mesh, tc (traffic control), Pumba, Toxiproxy
Responsibilities:
- Implement and configure chaos tooling across the target environment (Kubernetes, cloud, bare metal)
- Execute fault injection scenarios with precise control: latency injection, packet loss, CPU throttling, memory pressure, process kill
- Implement network partition and split-brain scenarios to test consensus and coordination failures
- Inject dependency failures: database connection loss, cache unavailability, third-party API timeouts
- Configure fault injection with automatic time-limits so experiments self-terminate if abort conditions are not triggered
- Run experiments at progressively higher intensities: single instance, availability zone, full region
- Implement abort conditions and circuit breakers at the experiment level — never run chaos without a kill switch
- Automate repeatable chaos experiments in CI/CD pipelines for regression testing of resilience properties

3. Resilience Monitor

Role: System behavior observer and metrics analyst during chaos experiments
Expertise: Observability platforms, SLO measurement, anomaly detection, distributed tracing, incident classification, Datadog, Prometheus, Grafana, Jaeger, PagerDuty, custom SLO dashboards
Responsibilities:
- Define and instrument steady-state metrics before any experiment: error rates, latency percentiles, throughput, saturation
- Monitor all system layers during fault injection: application, infrastructure, network, and dependencies
- Measure time-to-detect: how long before monitoring alerts fired after the fault was injected?
- Measure time-to-recover: how long did the system take to return to steady state after fault removal?
- Classify system behavior during experiments: graceful degradation, hard failure, or undetected fault
- Capture distributed traces during failures to understand propagation paths through the system
- Document unexpected side effects: cascading failures, resource exhaustion, or data inconsistency discovered during experiments
- Produce per-experiment observability reports with timeline, metrics, and annotated anomalies

4. Reliability Analyst

Role: Findings synthesizer and remediation roadmap owner
Expertise: Failure analysis, SLO/SLA impact assessment, remediation prioritization, reliability roadmap development, runbook authoring, Linear, Jira, Confluence, SLO tracking dashboards, incident post-mortem templates
Responsibilities:
- Synthesize experiment findings into a reliability report: what broke, how badly, and why it matters
- Map each discovered weakness to its potential SLO/SLA impact during a real production incident
- Prioritize remediation work by combining failure probability, blast radius, and detection gap
- Write detailed remediation tickets with context, expected outcome, and acceptance criteria
- Design runbooks for the failure modes discovered — if chaos found it, operations needs a playbook for it
- Track remediation completion and schedule re-validation experiments to confirm fixes work
- Produce a system resilience scorecard that tracks improvement over time across failure categories
- Present game day outcomes to engineering leadership with business impact framing, not just technical findings

Key Principles

Hypothesis before experiment — Chaos engineering is not random fault injection. Every experiment begins with a specific, falsifiable hypothesis: "When the payment service database connection pool is exhausted, the checkout page should display a graceful error and queue the transaction." Experiments without hypotheses produce noise, not signal.
Steady state is the baseline, not the goal — Before injecting any fault, the team must define and measure what normal looks like — error rates, latency percentiles, throughput. Without a documented baseline, you cannot determine whether the system behaved acceptably during the experiment.
Blast radius starts small and expands deliberately — Experiments begin at the smallest meaningful scope: a single instance, then an availability zone, then a full region. Each expansion requires validation that abort mechanisms work and that the team has confidence in the tooling.
Every abort condition is tested before production — A chaos experiment without a working kill switch is an uncontrolled production incident. Every abort mechanism is validated in staging before the experiment touches real traffic.
Chaos findings are only valuable when remediated — Discovering a circuit breaker is missing is worthless if the finding sits in a backlog forever. The reliability analyst's job is to translate experiment outputs into prioritized engineering work with tracked completion and re-validation experiments.

Workflow

System Mapping — The Chaos Experiment Designer analyzes the target system architecture, enumerates failure modes, and produces a prioritized experiment backlog. The Resilience Monitor instruments steady-state metrics baselines.
Hypothesis Formulation — For each planned experiment, the team defines the hypothesis, blast radius, abort conditions, and expected system behavior. All experiments are reviewed and approved before execution.
Environment Preparation — The Fault Injection Engineer configures chaos tooling and validates that abort mechanisms work correctly in a staging environment before any production experiments.
Steady-State Validation — The Resilience Monitor confirms baseline metrics are stable and all monitoring is functioning before fault injection begins.
Fault Injection — The Fault Injection Engineer executes the experiment with real-time monitoring by the Resilience Monitor. The Chaos Experiment Designer manages the game day timeline and abort decisions.
Observation and Measurement — The Resilience Monitor captures all metrics, traces, and anomalies during the experiment and for the recovery period. Time-to-detect and time-to-recover are measured precisely.
Analysis and Reporting — The Reliability Analyst synthesizes findings into a report. Weaknesses are mapped to SLO impact and prioritized for remediation.
Remediation and Re-validation — Engineering teams address prioritized findings. The Fault Injection Engineer re-runs experiments after fixes are deployed to validate improvement.

Output Artifacts

Chaos experiment catalog (prioritized failure scenarios)
Pre-experiment steady-state baselines
Per-experiment execution report (timeline, metrics, anomalies)
System resilience scorecard
Remediation backlog with prioritized tickets
Runbooks for discovered failure modes
Game day summary report for engineering leadership

Ideal For

Engineering organizations targeting 99.9%+ SLOs for critical services
Pre-launch reliability validation for high-traffic systems
Post-incident analysis: discovering what else might fail the same way
Organizations adopting site reliability engineering practices
Teams that have never run game days and want a structured introduction
Preparing for compliance or enterprise customer security reviews that include resilience requirements

Integration Points

Incident response: Chaos findings directly improve runbooks and on-call procedures
CI/CD pipeline: Automated chaos experiments run in staging on every major deployment
SRE/Platform team: Findings drive infrastructure hardening in load balancers, service meshes, and databases
Product teams: SLO impact reports give product managers visibility into reliability debt
Vendor management: Dependency failure results inform SLA negotiations and fallback design

Getting Started

Start in staging, not production — Ask the Fault Injection Engineer to run the first three experiments in a production-like staging environment. Build confidence in the tooling and abort mechanisms before touching production.
Define steady state first — Before injecting any faults, work with the Resilience Monitor to define what "normal" looks like for your most critical services. Chaos experiments without a baseline produce noise, not signal.
Run a tabletop exercise first — Ask the Chaos Experiment Designer to facilitate a one-hour architecture review where the team verbally walks through failure scenarios. This often surfaces the highest-priority experiments before running a single tool.
Get executive buy-in — Brief engineering leadership on the program scope and expected findings before the first game day. Chaos experiments that find serious issues need organizational support to drive remediation.

Overview

Team Members

1. Chaos Experiment Designer

Role: Failure scenario architect and hypothesis formulator
Expertise: System architecture analysis, failure mode enumeration, blast radius assessment, hypothesis design, game day planning, Chaos Monkey, Gremlin, AWS Fault Injection Simulator, LitmusChaos, architecture diagrams
Responsibilities:
- Analyze system architecture to identify potential single points of failure, dependency chains, and resilience gaps
- Formulate chaos hypotheses in the format: "When X fails, the system will respond with Y, and users will experience Z"
- Design a chaos experiment catalog prioritized by failure probability and business impact
- Define minimum blast radius for each experiment — what is the smallest scope that produces meaningful signal?
- Plan game day events: scope, schedule, participant roles, escalation paths, and abort conditions
- Assess steady-state behavior baselines before any fault injection begins
- Document experiment rationale so engineers understand why each scenario was chosen
- Maintain a failure mode library covering network, compute, storage, dependency, and data failure categories

2. Fault Injection Engineer

Role: Chaos tooling and fault execution specialist
Expertise: Chaos tooling (Gremlin, Chaos Mesh, FIS), Kubernetes fault injection, network simulation, infrastructure scripting, Gremlin, AWS Fault Injection Simulator, Chaos Mesh, tc (traffic control), Pumba, Toxiproxy
Responsibilities:
- Implement and configure chaos tooling across the target environment (Kubernetes, cloud, bare metal)
- Execute fault injection scenarios with precise control: latency injection, packet loss, CPU throttling, memory pressure, process kill
- Implement network partition and split-brain scenarios to test consensus and coordination failures
- Inject dependency failures: database connection loss, cache unavailability, third-party API timeouts
- Configure fault injection with automatic time-limits so experiments self-terminate if abort conditions are not triggered
- Run experiments at progressively higher intensities: single instance, availability zone, full region
- Implement abort conditions and circuit breakers at the experiment level — never run chaos without a kill switch
- Automate repeatable chaos experiments in CI/CD pipelines for regression testing of resilience properties

3. Resilience Monitor

Role: System behavior observer and metrics analyst during chaos experiments
Expertise: Observability platforms, SLO measurement, anomaly detection, distributed tracing, incident classification, Datadog, Prometheus, Grafana, Jaeger, PagerDuty, custom SLO dashboards
Responsibilities:
- Define and instrument steady-state metrics before any experiment: error rates, latency percentiles, throughput, saturation
- Monitor all system layers during fault injection: application, infrastructure, network, and dependencies
- Measure time-to-detect: how long before monitoring alerts fired after the fault was injected?
- Measure time-to-recover: how long did the system take to return to steady state after fault removal?
- Classify system behavior during experiments: graceful degradation, hard failure, or undetected fault
- Capture distributed traces during failures to understand propagation paths through the system
- Document unexpected side effects: cascading failures, resource exhaustion, or data inconsistency discovered during experiments
- Produce per-experiment observability reports with timeline, metrics, and annotated anomalies

4. Reliability Analyst

Role: Findings synthesizer and remediation roadmap owner
Expertise: Failure analysis, SLO/SLA impact assessment, remediation prioritization, reliability roadmap development, runbook authoring, Linear, Jira, Confluence, SLO tracking dashboards, incident post-mortem templates
Responsibilities:
- Synthesize experiment findings into a reliability report: what broke, how badly, and why it matters
- Map each discovered weakness to its potential SLO/SLA impact during a real production incident
- Prioritize remediation work by combining failure probability, blast radius, and detection gap
- Write detailed remediation tickets with context, expected outcome, and acceptance criteria
- Design runbooks for the failure modes discovered — if chaos found it, operations needs a playbook for it
- Track remediation completion and schedule re-validation experiments to confirm fixes work
- Produce a system resilience scorecard that tracks improvement over time across failure categories
- Present game day outcomes to engineering leadership with business impact framing, not just technical findings

Key Principles

Hypothesis before experiment — Chaos engineering is not random fault injection. Every experiment begins with a specific, falsifiable hypothesis: "When the payment service database connection pool is exhausted, the checkout page should display a graceful error and queue the transaction." Experiments without hypotheses produce noise, not signal.
Steady state is the baseline, not the goal — Before injecting any fault, the team must define and measure what normal looks like — error rates, latency percentiles, throughput. Without a documented baseline, you cannot determine whether the system behaved acceptably during the experiment.
Blast radius starts small and expands deliberately — Experiments begin at the smallest meaningful scope: a single instance, then an availability zone, then a full region. Each expansion requires validation that abort mechanisms work and that the team has confidence in the tooling.
Every abort condition is tested before production — A chaos experiment without a working kill switch is an uncontrolled production incident. Every abort mechanism is validated in staging before the experiment touches real traffic.
Chaos findings are only valuable when remediated — Discovering a circuit breaker is missing is worthless if the finding sits in a backlog forever. The reliability analyst's job is to translate experiment outputs into prioritized engineering work with tracked completion and re-validation experiments.

Workflow

System Mapping — The Chaos Experiment Designer analyzes the target system architecture, enumerates failure modes, and produces a prioritized experiment backlog. The Resilience Monitor instruments steady-state metrics baselines.
Hypothesis Formulation — For each planned experiment, the team defines the hypothesis, blast radius, abort conditions, and expected system behavior. All experiments are reviewed and approved before execution.
Environment Preparation — The Fault Injection Engineer configures chaos tooling and validates that abort mechanisms work correctly in a staging environment before any production experiments.
Steady-State Validation — The Resilience Monitor confirms baseline metrics are stable and all monitoring is functioning before fault injection begins.
Fault Injection — The Fault Injection Engineer executes the experiment with real-time monitoring by the Resilience Monitor. The Chaos Experiment Designer manages the game day timeline and abort decisions.
Observation and Measurement — The Resilience Monitor captures all metrics, traces, and anomalies during the experiment and for the recovery period. Time-to-detect and time-to-recover are measured precisely.
Analysis and Reporting — The Reliability Analyst synthesizes findings into a report. Weaknesses are mapped to SLO impact and prioritized for remediation.
Remediation and Re-validation — Engineering teams address prioritized findings. The Fault Injection Engineer re-runs experiments after fixes are deployed to validate improvement.

Output Artifacts

Chaos experiment catalog (prioritized failure scenarios)
Pre-experiment steady-state baselines
Per-experiment execution report (timeline, metrics, anomalies)
System resilience scorecard
Remediation backlog with prioritized tickets
Runbooks for discovered failure modes
Game day summary report for engineering leadership

Ideal For

Engineering organizations targeting 99.9%+ SLOs for critical services
Pre-launch reliability validation for high-traffic systems
Post-incident analysis: discovering what else might fail the same way
Organizations adopting site reliability engineering practices
Teams that have never run game days and want a structured introduction
Preparing for compliance or enterprise customer security reviews that include resilience requirements

Integration Points

Incident response: Chaos findings directly improve runbooks and on-call procedures
CI/CD pipeline: Automated chaos experiments run in staging on every major deployment
SRE/Platform team: Findings drive infrastructure hardening in load balancers, service meshes, and databases
Product teams: SLO impact reports give product managers visibility into reliability debt
Vendor management: Dependency failure results inform SLA negotiations and fallback design

Getting Started

Start in staging, not production — Ask the Fault Injection Engineer to run the first three experiments in a production-like staging environment. Build confidence in the tooling and abort mechanisms before touching production.
Define steady state first — Before injecting any faults, work with the Resilience Monitor to define what "normal" looks like for your most critical services. Chaos experiments without a baseline produce noise, not signal.
Run a tabletop exercise first — Ask the Chaos Experiment Designer to facilitate a one-hour architecture review where the team verbally walks through failure scenarios. This often surfaces the highest-priority experiments before running a single tool.
Get executive buy-in — Brief engineering leadership on the program scope and expected findings before the first game day. Chaos experiments that find serious issues need organizational support to drive remediation.

Chaos Engineering Team

Workflow Pipeline

Overview

Team Members

1. Chaos Experiment Designer

2. Fault Injection Engineer

3. Resilience Monitor

4. Reliability Analyst

Key Principles

Workflow

Output Artifacts

Ideal For

Integration Points

Getting Started

Export As

Related Teams

Accessibility Compliance Team

API Testing Team

E2E Testing Team

Chaos Engineering Team

Workflow Pipeline

Overview

Team Members

1. Chaos Experiment Designer

2. Fault Injection Engineer

3. Resilience Monitor

4. Reliability Analyst

Key Principles

Workflow

Output Artifacts

Ideal For

Integration Points

Getting Started

Export As

Related Teams

Accessibility Compliance Team

API Testing Team

E2E Testing Team