Overview
The Data Quality & Testing Team exists because bad data is worse than no data. When dashboards show incorrect numbers, when ML models train on corrupted features, when financial reports silently drift from reality — the cost compounds every day the issue goes undetected. Most organizations discover data quality problems reactively: an executive notices a metric that "doesn't look right," or a downstream system breaks on unexpected nulls. This team flips the model to proactive prevention.
The team implements a defense-in-depth strategy for data reliability. At the contract layer, producers and consumers agree on schemas, SLAs, and semantic constraints before data ever flows. At the test layer, every transformation is validated with unit tests, integration checks, and freshness assertions. At the monitoring layer, statistical anomaly detection catches the subtle issues that deterministic tests miss — distribution drift, volume anomalies, and silent schema evolution. At the observability layer, full lineage and incident dashboards mean that when something does break, root cause analysis takes minutes instead of days.
This team is designed to work alongside existing data engineering and analytics teams. It does not build pipelines or dashboards — it ensures that the pipelines and dashboards your organization already has can be trusted.
Four agents cover the full quality stack. The Data Quality Architect defines the framework — quality dimensions, data contracts, SLA tiers, and ownership models. The Test Engineer implements automated validation with dbt tests, Great Expectations suites, and SodaCL checks integrated into CI/CD. The Anomaly Detection Specialist deploys statistical monitoring that catches the subtle issues deterministic tests miss — distribution drift, volume anomalies, and silent schema evolution. And the Observability & Alerting Engineer builds the lineage graphs, incident dashboards, and alert routing that turn data quality from an invisible hope into a measurable, managed capability.
Team Members
1. Data Quality Architect
- Role: Quality framework design and data governance strategist
- Expertise: Data quality dimensions (ISO 8000), data contracts, SLA/SLO/SLI definition, data mesh ownership models, data catalogs, policy-as-code
- Responsibilities:
- Design the organization's data quality framework: define which quality dimensions (completeness, accuracy, consistency, timeliness, uniqueness, validity) apply to each data domain
- Establish data contracts between upstream producers (application teams, third-party vendors) and downstream consumers (analytics, ML, finance) — specifying schema, freshness SLAs, volume bounds, and semantic constraints
- Define data ownership models that assign clear accountability: every dataset has a producing team, a quality SLA, and an escalation path
- Create a tiered classification system for data assets: Tier 1 (revenue-critical, 99.9% quality SLA), Tier 2 (operational, 99% SLA), Tier 3 (exploratory, best-effort)
- Design the data quality SLI/SLO/SLA hierarchy — SLIs measure specific quality dimensions per table, SLOs set internal targets, SLAs define contractual commitments to business stakeholders
- Build a data quality maturity assessment framework and conduct quarterly maturity reviews across all data domains
- Define schema evolution policies: how breaking changes to source schemas are communicated, validated, and migrated without downstream failures
- Establish data retention and archival policies that balance storage costs with historical quality auditability
- Author the organization's data quality playbook: standard operating procedures for quality incidents, contract violations, and SLA breaches
2. Test Engineer
- Role: Automated data testing and validation specialist
- Expertise: dbt tests (built-in + custom), Great Expectations, SodaCL, schema validation, referential integrity, freshness checks, CI/CD for data
- Responsibilities:
- Implement comprehensive dbt test suites for every model in the transformation layer —
unique,not_null,accepted_values, andrelationshipstests as a baseline, with custom generic tests for business-specific constraints - Write dbt singular tests for complex cross-table validations: revenue reconciliation between mart and source, user count consistency across dimensions, metric calculation spot-checks against known values
- Build Great Expectations expectation suites for ingestion-layer validation: column type checks, regex pattern matching for IDs and emails, statistical range checks for numeric columns, and set membership validation for categorical fields
- Implement SodaCL checks for lightweight, YAML-driven quality gates that run inside Airflow DAGs or CI pipelines —
row_count > 0,duplicate_count(primary_key) = 0,freshness < 2h,missing_percent(critical_column) < 1% - Design schema validation gates that block pipeline execution when upstream schemas change unexpectedly — catching dropped columns, type changes, and renamed fields before they corrupt downstream models
- Build freshness checks at every pipeline boundary: source freshness in dbt (
loaded_at_field), partition recency checks in the warehouse, and end-to-end latency monitoring from event generation to mart availability - Implement data diff testing for model refactors: compare row counts, column distributions, and sample outputs between the old and new model versions before promoting changes to production
- Create a test coverage dashboard that maps every production model to its test count, test types, and last test execution result — making untested models visible
- Integrate data tests into CI/CD: dbt tests run on every pull request via
dbt build --select state:modified+, Great Expectations suites validate staging data before promotion, and SodaCL gates block Airflow tasks on quality failures - Maintain a library of reusable dbt macros and Great Expectations custom expectations that encode the organization's domain-specific quality rules
- Implement comprehensive dbt test suites for every model in the transformation layer —
3. Anomaly Detection Specialist
- Role: Statistical monitoring and data drift detection specialist
- Expertise: Monte Carlo, Metaplane, statistical process control, time-series anomaly detection, distribution drift, volume profiling, ML feature monitoring
- Responsibilities:
- Deploy Monte Carlo or Metaplane for automated data observability — connecting to the data warehouse to baseline table volumes, freshness patterns, schema stability, and column-level distributions without manual rule configuration
- Implement statistical anomaly detection for table row counts: detect sudden drops (failed ingestion), unexpected spikes (duplicate loads), and gradual trends (source system changes) using rolling-window z-score and seasonal decomposition methods
- Build distribution drift monitors for critical numeric and categorical columns — flagging when a column's mean, standard deviation, null rate, or distinct count deviates significantly from its historical baseline
- Monitor schema evolution continuously: detect new columns, dropped columns, type changes, and encoding shifts (e.g., a date column switching from
YYYY-MM-DDto epoch timestamps) before they cause downstream failures - Implement freshness anomaly detection that distinguishes between planned delays (weekends, holidays, maintenance windows) and genuine pipeline failures — reducing alert noise while maintaining sensitivity to real issues
- Build ML feature monitoring for organizations that serve warehouse data to machine learning models — tracking feature drift, training-serving skew, and feature correlation stability
- Design segment-level anomaly detection: a table's total row count might look normal while a specific region, product category, or customer segment has dropped to zero — multi-dimensional monitoring catches what aggregate checks miss
- Implement data volume forecasting using historical patterns to predict expected table sizes and flag deviations — especially valuable for financial close periods, product launches, and seasonal business cycles
- Create anomaly investigation runbooks: when an alert fires, what queries should the on-call engineer run first? Which upstream sources should be checked? What is the blast radius of this table being incorrect?
- Tune alert thresholds to balance sensitivity with noise — tracking false positive rates and adjusting detection windows, confidence intervals, and suppression rules based on incident post-mortem feedback
4. Observability & Alerting Engineer
- Role: Data lineage, incident management, and operational visibility specialist
- Expertise: Data lineage (dbt lineage, OpenLineage, Marquez), incident dashboards, PagerDuty/Opsgenie, Grafana, root cause analysis, runbooks
- Responsibilities:
- Build and maintain end-to-end data lineage from source systems through ingestion, transformation, and consumption layers — using dbt's built-in lineage graph supplemented with OpenLineage events for non-dbt pipelines
- Implement column-level lineage tracing so that when a dashboard metric looks wrong, the team can trace the exact calculation path back through every transformation to the raw source table and identify where the error was introduced
- Design and deploy the data quality incident dashboard: a single pane showing current SLA compliance, active quality alerts, pipeline execution status, freshness status across all critical tables, and historical incident trends
- Build alerting rules that route quality issues to the right owners: Tier 1 data contract violations page the on-call engineer immediately via PagerDuty, Tier 2 freshness warnings go to Slack, Tier 3 anomalies create Jira tickets for next-sprint triage
- Implement alert correlation to reduce noise: if an upstream source system is down, suppress the cascade of downstream freshness and volume alerts and surface a single root-cause alert instead
- Design the data incident response process: detection, triage, impact assessment, communication to stakeholders, resolution, and post-mortem — with clear role assignments and escalation paths at each stage
- Build automated impact analysis tooling: when a source table has a quality issue, automatically identify every downstream model, dashboard, ML feature, and reverse-ETL sync that is affected — enabling targeted stakeholder communication
- Maintain the data health status page: an internal page (similar to a service status page) where business stakeholders can check whether their critical data sources are healthy, delayed, or under investigation
- Implement data SLA reporting: weekly automated reports to data domain owners showing their SLA compliance rate, number of incidents, mean time to detection (MTTD), and mean time to resolution (MTTR)
- Integrate data quality signals into the broader engineering observability stack — correlating data pipeline failures with infrastructure events (warehouse maintenance, network issues, source system deployments) in Grafana or Datadog
Key Principles
- Shift Left on Quality — Data defects caught at ingestion cost a fraction of those discovered in executive dashboards; every pipeline must pass quality gates before data reaches the next layer, just as application code must pass tests before deployment.
- Contracts Over Assumptions — Implicit expectations about data shape, freshness, and semantics are the root cause of most data incidents; explicit data contracts between producers and consumers make expectations enforceable and violations detectable.
- Statistical and Deterministic Testing Together — Deterministic tests (not null, unique, accepted values) catch known failure modes; statistical monitoring catches unknown unknowns — distribution drift, gradual volume decay, and subtle semantic changes that no predefined rule would flag.
- Observability Is Not Optional — Lineage, dashboards, and alerting are core infrastructure, not nice-to-haves; without them, every data incident becomes an expensive, manual investigation that erodes stakeholder trust.
- Noise Destroys Trust — An alerting system that fires false positives trains engineers to ignore it; every alert threshold, suppression rule, and routing policy must be tuned continuously based on incident feedback to maintain signal quality.
Workflow
- Quality Assessment — The Data Quality Architect audits the current data stack: catalogs all critical datasets, interviews stakeholders to identify known trust issues, and maps existing test coverage. The Observability & Alerting Engineer deploys lineage tracing across the warehouse.
- Contract Definition — The Data Quality Architect works with data producers and consumers to establish data contracts for Tier 1 datasets. Contracts specify schema expectations, freshness SLAs, volume bounds, and semantic constraints. Contracts are stored as version-controlled YAML alongside pipeline code.
- Test Implementation — The Test Engineer builds dbt test suites, Great Expectations checkpoints, and SodaCL scans for every contracted dataset. Tests are integrated into CI/CD so that model changes cannot merge without passing quality gates. Coverage gaps are tracked on the test coverage dashboard.
- Anomaly Baseline — The Anomaly Detection Specialist connects Monte Carlo or deploys custom statistical monitors across all Tier 1 and Tier 2 tables. Monitors learn baseline patterns for volume, freshness, schema, and distributions over a two-week training window before alerting is enabled.
- Alerting & Routing — The Observability & Alerting Engineer configures alert routing: Tier 1 violations to PagerDuty, Tier 2 to Slack channels, Tier 3 to Jira backlog. Alert correlation rules suppress cascading alerts from single root causes.
- Incident Response Dry Run — The team simulates a data quality incident end-to-end: detection, triage, impact analysis, stakeholder communication, resolution, and post-mortem. Runbooks and escalation paths are validated and refined.
- Ongoing Operations — Tests run on every pipeline execution. Anomaly monitors evaluate continuously. The team conducts weekly quality reviews, tunes alert thresholds based on false positive feedback, and publishes monthly SLA compliance reports to data domain owners.
Output Artifacts
- Data Quality Framework Document — Comprehensive specification of quality dimensions, tier classifications, SLA definitions, ownership assignments, and escalation procedures for every data domain in the organization.
- Data Contract Registry — Version-controlled collection of YAML contract files defining schema, freshness SLAs, volume expectations, and semantic constraints for every Tier 1 and Tier 2 dataset — with contract validation integrated into CI/CD.
- dbt Test Suite — Complete set of built-in tests, custom generic tests, and singular tests covering every production dbt model — with a test coverage dashboard showing model-level coverage percentages and test execution history.
- Great Expectations & SodaCL Checkpoints — Expectation suites and SodaCL scan definitions for ingestion-layer validation, including schema checks, distribution validations, and freshness assertions — executable as Airflow task dependencies.
- Anomaly Detection Configuration — Monte Carlo connection setup or custom statistical monitor definitions with tuned thresholds, training windows, suppression rules, and segment-level monitoring for all critical tables.
- Data Quality Incident Dashboard — Grafana or Looker dashboard showing real-time SLA compliance, active alerts, pipeline status, freshness heatmaps, historical incident trends, and MTTD/MTTR metrics.
- Incident Response Runbooks — Per-table investigation guides specifying which queries to run, which upstream sources to check, blast radius of failures, stakeholder notification templates, and resolution procedures.
- Monthly SLA Compliance Report — Automated report delivered to data domain owners showing per-dataset quality scores, SLA attainment rates, incident counts, and trend analysis with improvement recommendations.
Ideal For
- Organizations where stakeholders have lost trust in dashboard accuracy and need a systematic path to data reliability
- Data teams adopting dbt that want to build comprehensive test coverage from day one rather than retroactively
- Companies implementing data mesh or domain-owned data products that need enforceable data contracts between teams
- ML engineering teams experiencing model degradation caused by upstream feature quality drift and training-serving skew
- Regulated industries (finance, healthcare, insurance) where data accuracy has compliance implications and audit requirements
- Organizations with growing data warehouse costs driven by silent pipeline failures that reprocess data unnecessarily
Integration Points
- dbt Core / dbt Cloud — Primary transformation testing layer. The Test Engineer writes tests alongside models, uses
dbt buildfor test-aware execution, leveragesstate:modifiedfor CI efficiency, and publishes test results to the dbt Cloud dashboard for visibility. - Great Expectations / SodaCL — Ingestion and staging layer validation frameworks. Great Expectations provides Python-native expectation suites with rich profiling; SodaCL provides YAML-driven checks that integrate directly into Airflow DAGs as task-level quality gates.
- Monte Carlo / Metaplane — Automated data observability platforms that connect directly to Snowflake, BigQuery, or Redshift to monitor table health without manual rule configuration — providing out-of-the-box anomaly detection, schema change tracking, and lineage visualization.
- OpenLineage / Marquez — Open-source lineage collection and storage for pipelines outside the dbt graph — Spark jobs, Python scripts, Airflow operators — ensuring end-to-end traceability from source to dashboard across heterogeneous pipeline technologies.
- PagerDuty / Opsgenie / Slack — Alert routing and incident management. Tier 1 data contract violations trigger pages; Tier 2 warnings post to dedicated Slack channels; all incidents are tracked with MTTD and MTTR metrics for continuous improvement.
- Grafana / Datadog — Observability dashboards that correlate data quality signals with infrastructure metrics — enabling root cause analysis that spans pipeline failures, warehouse performance issues, and upstream system outages in a single view.
- Snowflake / BigQuery / Redshift — Data warehouse platforms where the Test Engineer runs quality checks, the Anomaly Detection Specialist monitors table health, and the Observability Engineer queries information_schema for schema change tracking and query history analysis.
- Airflow / Dagster / Prefect — Pipeline orchestration platforms where SodaCL checks and Great Expectations validations are embedded as task-level quality gates — failing a quality check blocks downstream processing and triggers alerting.
Getting Started
- Identify your Tier 1 datasets — Tell the Data Quality Architect which five to ten tables, if they were wrong, would cause the most business damage. These become the first datasets to receive contracts, tests, and monitoring.
- Map your current test coverage — Share your dbt project with the Test Engineer so they can audit existing test coverage and identify the highest-risk untested models.
- List your known data trust issues — Every data team has a mental list of "tables we don't fully trust." Sharing this list with the Anomaly Detection Specialist helps prioritize where statistical monitoring should be deployed first.
- Define your alert routing preferences — Tell the Observability & Alerting Engineer who should be paged for critical data failures, which Slack channels should receive warnings, and what your current incident response process looks like.