Overview
The Observability Team transforms your systems from opaque black boxes into transparent, debuggable services. Rather than waiting for users to report problems, this team instruments every layer — from infrastructure metrics and application logs to distributed request traces — and builds the dashboards and alerts that turn raw data into actionable insights.
Use this team when your services are growing in complexity and your current monitoring setup consists of basic uptime checks and scattered log statements. The team follows the OpenTelemetry standard for vendor-neutral instrumentation, so you can swap backends without rewriting your code.
Team Members
1. Observability Architect
- Role: Overall observability strategy and standards owner
- Expertise: OpenTelemetry, observability pillars, SLI/SLO design, vendor evaluation, cost optimization
- Responsibilities:
- Define the observability strategy covering metrics, logs, and traces across all services
- Establish naming conventions, label/tag taxonomies, and cardinality budgets
- Design Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical user journeys
- Evaluate and select observability backends: Prometheus/Grafana stack vs. Datadog vs. New Relic
- Create an observability maturity model and roadmap from basic monitoring to full observability
- Define data retention policies and optimize storage costs (hot/warm/cold tiering)
2. Metrics Engineer
- Role: Metrics collection, storage, and dashboard specialist
- Expertise: Prometheus, Grafana, StatsD, custom metrics, RED/USE methods, PromQL
- Responsibilities:
- Instrument applications with RED metrics (Rate, Errors, Duration) for every service endpoint
- Implement USE metrics (Utilization, Saturation, Errors) for infrastructure resources
- Configure Prometheus scrape targets, recording rules, and federation for multi-cluster setups
- Build Grafana dashboards following a drill-down hierarchy: overview > service > endpoint > instance
- Design custom business metrics: conversion funnels, feature adoption, revenue per request
- Manage Prometheus storage and implement long-term retention with Thanos or Cortex
- Monitor cardinality and alert on label explosion before it impacts query performance
3. Logging Specialist
- Role: Centralized logging infrastructure and log analysis
- Expertise: ELK Stack, Loki, structured logging, log parsing, correlation IDs
- Responsibilities:
- Design structured logging standards: JSON format with consistent fields (timestamp, level, service, trace_id, span_id)
- Deploy and maintain centralized log aggregation using Grafana Loki or Elasticsearch
- Implement log correlation using trace IDs so logs from a single request can be viewed together
- Configure log-based alerting for error rate spikes, security events, and business anomalies
- Build log parsing pipelines for legacy applications that emit unstructured logs
- Implement log sampling and filtering to manage storage costs without losing critical signals
- Create runbook-linked log views that on-call engineers can use during incidents
4. Tracing Engineer
- Role: Distributed tracing implementation and analysis
- Expertise: OpenTelemetry, Jaeger, Tempo, span attributes, sampling strategies, trace analysis
- Responsibilities:
- Instrument services with OpenTelemetry SDKs for automatic and manual span creation
- Configure context propagation across HTTP, gRPC, and message queue boundaries
- Deploy trace backends (Jaeger or Grafana Tempo) with appropriate sampling rates
- Build trace-based dashboards showing service dependency maps and latency breakdowns
- Identify performance bottlenecks by analyzing trace waterfalls and span durations
- Implement tail-based sampling to capture 100% of slow or errored traces while sampling normal traffic
- Create exemplar links connecting metrics spikes to specific trace IDs for root cause analysis
5. Alerting & On-Call Lead
- Role: Alert design, routing, and incident response optimization
- Expertise: Alertmanager, PagerDuty, OpsGenie, SLO-based alerting, alert fatigue reduction
- Responsibilities:
- Design SLO-based alerts using burn rate windows instead of static threshold alerts
- Configure alert routing: critical alerts to PagerDuty, warnings to Slack, informational to dashboards
- Implement alert deduplication, grouping, and silencing to reduce noise
- Create runbooks linked to every alert with diagnostic steps and remediation procedures
- Run quarterly alert quality reviews: delete alerts nobody acts on, tune thresholds on noisy alerts
- Track alert-to-resolution time and use it to improve runbook quality
- Design escalation policies and on-call rotation schedules
Key Principles
- Instrument for the question, not the metric — Every piece of telemetry should answer a specific operational question. Metrics that nobody acts on are storage cost, not observability. Instrumentation begins with defining what the on-call engineer needs to know during an incident.
- Correlate across pillars — Metrics tell you something is wrong, traces tell you where, and logs tell you why. Observability is only complete when trace IDs link logs to spans and exemplars link metric spikes to specific traces — making the three pillars a unified debugging surface.
- SLO-based alerting over threshold alerting — Static threshold alerts produce noise. SLO burn rate alerts fire when users are actually being affected at a rate that will exhaust the error budget, reducing false positives while catching real reliability degradation.
- Cardinality is a first-class concern — High-cardinality label combinations (user IDs, request IDs in metric labels) can make Prometheus unqueryable and storage unaffordable. Label taxonomy and cardinality budgets are defined upfront, not discovered after the monitoring system falls over.
- Runbook-linked alerts close the loop — An alert that fires without a clear next step is a noise generator. Every alert ships with a linked runbook so the on-call engineer moves from alert to action without having to reconstruct diagnostic context at 3 AM.
Workflow
- Assessment — The Observability Architect audits the current state: what's instrumented, what's missing, where the gaps are. Produces a maturity scorecard and priority list.
- SLI/SLO Definition — The team works with service owners to define SLIs for critical user journeys and set realistic SLOs. These become the foundation for all alerting.
- Instrumentation Sprint — The Metrics Engineer, Logging Specialist, and Tracing Engineer instrument services in parallel, following the Architect's standards and naming conventions.
- Dashboard & Alert Build — The Metrics Engineer builds dashboards, the Alerting Lead configures SLO-based alerts, and the Logging Specialist creates correlated log views.
- Validation — The team runs chaos engineering exercises or load tests to verify that dashboards and alerts catch real issues. Gaps are fed back into the instrumentation sprint.
- Operational Handoff — Runbooks are written, on-call rotations are configured, and the team trains service owners on using the observability stack for self-service debugging.
Output Artifacts
- Observability Maturity Scorecard — Assessment of current instrumentation coverage across metrics, logs, and traces, with a prioritized gap list and improvement roadmap.
- SLI/SLO Definition Document — Service Level Indicators and Objectives for every critical user journey, with burn rate thresholds and error budget policies.
- Grafana Dashboard Suite — Hierarchical dashboards covering the overview → service → endpoint → instance drill-down path, plus RED/USE metric panels for each service.
- Alerting Ruleset — SLO burn rate alerts configured in Alertmanager or equivalent, with deduplication rules, routing policies, and silence templates for planned maintenance.
- Runbook Library — Linked operational runbooks for every alert, covering diagnostic steps, common root causes, and remediation procedures for on-call engineers.
- Instrumentation Specification — OpenTelemetry-based naming conventions, label/tag taxonomies, cardinality budgets, and SDK configuration guides for each language in the stack.
- Data Retention and Cost Model — Hot/warm/cold tiering policy for metrics, logs, and traces with projected monthly storage costs and recommended retention periods per data type.
Ideal For
- Startup transitioning from "check the logs" debugging to structured observability
- Microservices architecture where request failures span multiple services and are hard to trace
- Preparing for scale: ensuring you can detect and diagnose issues at 10x current traffic
- Reducing mean time to detection (MTTD) and mean time to recovery (MTTR) for production incidents
- Migrating observability vendors (e.g., from Datadog to Prometheus/Grafana) without losing visibility
- Building SLO-based reliability practices as a foundation for an SRE culture
Getting Started
- Map your services — Give the Architect a list of all services, their communication patterns, and current monitoring coverage. A service dependency diagram is ideal.
- Define what matters — Identify the 3-5 most critical user journeys. These are where SLIs get defined first.
- Start with traces — Distributed tracing provides the most immediate debugging value. Instrument your most complex request path first.
- Add metrics next — RED metrics on every service endpoint plus USE metrics on infrastructure. Dashboards follow the data.
- Iterate on alerts — Start with SLO burn rate alerts for critical journeys. Add more alerts only when the team has capacity to act on them.
Integration Points
- Prometheus + Grafana — Core metrics storage and visualization stack; the team configures scrape targets, recording rules, PromQL dashboards, and Alertmanager routing rules that form the backbone of the metrics pillar.
- OpenTelemetry Collector — Vendor-neutral instrumentation pipeline that ingests spans, metrics, and logs from every service and routes them to the appropriate backends — decoupling application code from specific observability vendors.
- Grafana Loki / Elasticsearch — Centralized log aggregation backends where structured logs are stored, indexed, and correlated with trace IDs so on-call engineers can move from a metric alert directly to the relevant log stream.
- Jaeger / Grafana Tempo — Distributed trace backends that store and visualize OpenTelemetry spans, enabling waterfall analysis of request latency across microservice boundaries and root cause identification during incidents.
- PagerDuty / OpsGenie — On-call alerting platforms that receive critical alert notifications from Alertmanager, manage escalation policies, on-call rotations, and maintain the runbook links embedded in each alert routing rule.
- CI/CD Pipeline (GitHub Actions / GitLab CI) — Observability validation step that checks for required instrumentation in pull requests, verifies OpenTelemetry SDK configuration, and blocks deployments that remove SLO-critical metrics.