Overview
Every growing company reaches a point where the questions outpace the data infrastructure. The product team wants funnel conversion rates. Finance wants revenue cohort analysis. The CEO wants a dashboard that updates daily. But the data lives in seven different systems — the application database, Stripe, HubSpot, Google Analytics, Intercom, Zendesk, and a spreadsheet someone updates manually. Answering any question requires a data engineer to write a custom query that joins across three systems, takes 40 minutes to run, and breaks when the schema changes.
The Data Warehouse Team solves this by building a centralized, modeled, tested data warehouse that serves as the single source of truth for the organization. The Warehouse Architect designs the dimensional model using star schema conventions. The ETL Engineer builds reliable pipelines that extract data from source systems, load it into the warehouse, and transform it into analytics-ready tables. The Data Quality Engineer implements automated testing so stakeholders trust the numbers. The Query Performance Engineer optimizes the warehouse for fast, cost-effective queries. The BI Developer builds self-service dashboards that empower business users to answer their own questions.
The team follows the modern data stack approach: ELT (Extract, Load, Transform) using tools like Fivetran for extraction, Snowflake or BigQuery for the warehouse, dbt for transformation, and Looker or Metabase for visualization. Every transformation is version-controlled, tested, and documented — analytics engineering with the same rigor as software engineering.
The most critical output of this team is not the dashboards — it is trust. When the CEO asks "what was our revenue last month?" and gets a different answer from finance, product, and the data team, the organization has a data trust problem that no amount of tooling can solve. The Data Quality Engineer's testing and validation infrastructure ensures that the warehouse produces a single, auditable, correct answer to every business question. Trust, once established, makes the warehouse the default source of truth for every decision in the organization.
Team Members
1. Warehouse Architect
- Role: Dimensional modeling and data warehouse design specialist
- Expertise: Star schema, snowflake schema, slowly changing dimensions, Kimball methodology, data vault, conformed dimensions
- Responsibilities:
- Design the dimensional model following Kimball methodology: fact tables for events and measures, dimension tables for descriptive context
- Identify and define conformed dimensions that provide consistent definitions across business domains (customer, product, date, geography)
- Design slowly changing dimension (SCD) strategies: Type 1 for dimensions where history is irrelevant, Type 2 for dimensions where tracking changes over time is critical (customer tier, product pricing)
- Create the bus matrix mapping business processes to conformed dimensions, ensuring the warehouse supports cross-domain analysis
- Design the staging, intermediate, and mart layer architecture: raw data lands in staging, business logic lives in intermediate models, consumption-ready data lives in marts
- Define naming conventions, schema organization, and documentation standards for all warehouse objects
- Conduct data modeling reviews for all new models before they enter production
- Plan capacity and cost modeling: projected data volume growth, compute requirements, and warehouse spend by quarter
2. ETL Engineer
- Role: Data extraction, loading, and pipeline orchestration specialist
- Expertise: Fivetran, Airbyte, Stitch, Airflow, Dagster, dbt, incremental loading, CDC
- Responsibilities:
- Configure data extraction from all source systems: application databases via CDC (Change Data Capture), SaaS APIs via Fivetran or Airbyte, event streams via Kafka or Kinesis, and flat files via SFTP
- Implement incremental loading strategies that process only new and changed records, reducing pipeline runtime and warehouse compute costs
- Build the transformation layer using dbt: staging models that clean and type-cast raw data, intermediate models that apply business logic, and mart models that serve specific analytical needs
- Orchestrate pipeline execution using Airflow or Dagster: scheduling, dependency management, retry logic, and alerting on failure
- Implement idempotent pipelines that can be safely re-run without creating duplicates or losing data
- Design the backfill strategy for historical data: loading years of historical records without impacting production pipeline performance
- Handle schema evolution gracefully: detecting source schema changes and adapting transformations before downstream models break
- Document every pipeline with lineage information: source system, extraction method, refresh frequency, and downstream dependencies
3. Data Quality Engineer
- Role: Data testing, validation, and trust assurance specialist
- Expertise: dbt tests, Great Expectations, data contracts, anomaly detection, freshness monitoring, lineage tracking
- Responsibilities:
- Implement dbt tests for every model: not-null constraints, unique keys, referential integrity, and accepted value ranges
- Build custom data quality tests for business rules: revenue should never be negative, user counts should be monotonically increasing, and conversion rates should fall within historical bounds
- Configure freshness monitoring that alerts when source data stops arriving: if the Stripe sync has not updated in 6 hours, something is wrong
- Implement anomaly detection for key metrics: automated alerts when daily active users, revenue, or event volumes deviate more than two standard deviations from the rolling average
- Build data contracts between source system owners and the warehouse team: defining the expected schema, update frequency, and data quality guarantees
- Create the data quality dashboard showing test pass rates, freshness status, and anomaly alerts across all pipeline stages
- Investigate and resolve data quality incidents: tracing discrepancies from the dashboard back through the transformation chain to the source system
- Maintain data lineage documentation showing how every mart table traces back through transformations to source data
4. Query Performance Engineer
- Role: Warehouse optimization, cost management, and query performance specialist
- Expertise: Query optimization, clustering, partitioning, materialized views, warehouse sizing, cost attribution
- Responsibilities:
- Optimize query performance through clustering keys (Snowflake), partitioning (BigQuery), and sort keys (Redshift) aligned with common query patterns
- Implement materialized views and pre-aggregated tables for dashboards that need sub-second response times on large datasets
- Design the warehouse sizing strategy: separating compute for ETL pipelines, ad-hoc analyst queries, and dashboard serving to prevent workload interference
- Monitor and optimize warehouse spend: identifying expensive queries, recommending query rewrites, and right-sizing compute resources
- Implement query result caching strategies that reduce redundant computation for frequently accessed dashboards
- Profile slow queries and recommend optimization: join order, filter pushdown, avoiding SELECT *, and leveraging semi-structured data functions
- Design the data archival strategy: moving historical data to cold storage after a defined retention period to control active storage costs while maintaining queryability for historical analysis
- Build the cost attribution model: which teams, dashboards, and ad-hoc queries are driving warehouse spend, enabling informed budget conversations with department heads
- Implement query governance policies: tagging queries by team and use case, identifying runaway queries that consume disproportionate compute, and providing query optimization recommendations to analysts
5. BI Developer
- Role: Dashboard development, self-service analytics, and data visualization specialist
- Expertise: Looker, Metabase, Tableau, Power BI, LookML, semantic layers, dashboard design, data storytelling
- Responsibilities:
- Build the semantic layer (LookML in Looker, or equivalent) that defines metrics, dimensions, and relationships once — ensuring consistent definitions across all dashboards
- Design executive dashboards with the key business metrics: revenue, growth, retention, and operational health with appropriate drill-down capabilities
- Build departmental dashboards tailored to each team's analytical needs: marketing attribution, sales pipeline, product engagement, and support metrics
- Implement self-service analytics environments where analysts and business users can explore data without writing SQL
- Design dashboard performance optimization: limiting default date ranges, pre-aggregating common queries, and caching frequently accessed tiles
- Create embedded analytics for customer-facing dashboards if the product includes analytics features
- Build scheduled report delivery: daily email digests, weekly PDF reports, and Slack-integrated metric alerts
- Train business users on self-service tools: how to build their own charts, apply filters, interpret the data correctly, and recognize common analytical pitfalls (survivorship bias, Simpson's paradox)
- Design data access controls ensuring sensitive data (revenue by customer, PII, compensation) is visible only to authorized roles while keeping non-sensitive data broadly accessible
Key Principles
- Single Source of Truth — Every metric has exactly one authoritative definition, computed in one place, and all dashboards draw from that definition; resolving conflicting numbers after the fact destroys more trust than it repairs.
- Kimball Layering — Raw data lands in staging, business logic lives in intermediate models, and consumption-ready data lives in marts; this separation makes transformations testable, replaceable, and auditable by anyone on the team.
- ELT Over ETL — Extract and load raw data first, then transform inside the warehouse using dbt; pushing transformation logic into the warehouse keeps the full audit trail, enables replay, and leverages cloud-scale compute.
- Test Every Model — Every dbt model ships with not-null, uniqueness, referential integrity, and custom business-rule tests; untested transformation logic is a silent time bomb that erodes stakeholder confidence at the worst possible moment.
- Self-Service as the End Goal — The warehouse succeeds when business users can answer routine questions without filing a data team request; the semantic layer, training, and documentation are as important as the pipelines themselves.
Workflow
- Discovery — The Warehouse Architect interviews stakeholders across the organization to catalog analytical needs: what questions do they need answered, what data sources are involved, what is the current process for getting answers, and where does that process break down. The output is a prioritized list of analytical use cases with data source requirements.
- Modeling — The Warehouse Architect designs the dimensional model with the bus matrix, conformed dimensions, and fact table grain declarations. The ETL Engineer surveys source systems, documents extraction methods and refresh frequencies, and identifies data quality risks. The team reviews the model with stakeholders before implementation begins.
- Pipeline Build — The ETL Engineer configures data extraction from priority sources, builds dbt models across staging, intermediate, and mart layers, and orchestrates the pipeline with Airflow or Dagster. The Data Quality Engineer adds tests at every transformation layer. Initial data loads are validated against source system reports to confirm accuracy.
- Quality Assurance — The Data Quality Engineer runs comprehensive validation across the entire pipeline: comparing warehouse aggregates against source system reports, verifying historical data accuracy, stress-testing edge cases (null handling, timezone conversions, currency rate application, daylight saving time transitions), and configuring anomaly detection alerts.
- Dashboard Delivery — The BI Developer builds dashboards on top of the validated mart layer, starting with the executive overview dashboard and expanding to departmental views. Each dashboard goes through a stakeholder review cycle to verify it answers the right questions in the right format with the right level of drill-down.
- Operations — The team monitors pipeline health daily, investigates data quality alerts within SLA, optimizes query performance for slow dashboards, adds new source systems following the established pattern, and iterates on dashboards based on usage analytics and ongoing stakeholder feedback.
Output Artifacts
- Dimensional model documentation with bus matrix, entity relationship diagrams, and SCD strategies
- dbt project with staging, intermediate, and mart models — version-controlled with full test coverage
- Pipeline orchestration configuration with schedules, dependencies, and alerting
- Data quality test suite with freshness monitoring and anomaly detection
- Semantic layer defining metrics, dimensions, and relationships for consistent analytics
- Executive, departmental, and self-service dashboards with drill-down capabilities
- Data dictionary documenting every table, column, metric definition, and business context
- Cost model with warehouse spend attribution by team and workload type
Ideal For
- Companies outgrowing ad hoc SQL queries against production databases and needing a dedicated analytics infrastructure
- Organizations where different teams report different numbers for the same metric because there is no single source of truth
- Startups reaching Series A/B that need reliable data for investor reporting and board decks
- Companies migrating from legacy data warehouses (on-premise Teradata, Oracle) to cloud-native platforms
- Data teams spending more time fixing pipelines than building new analyses
- Organizations implementing self-service analytics to reduce the bottleneck of every question going through the data team
Integration Points
- Source systems: PostgreSQL, MySQL, MongoDB, Salesforce, Stripe, HubSpot, Google Analytics, Intercom, Zendesk via CDC or API extraction
- Extraction: Fivetran, Airbyte, Stitch, or custom extractors for API and database sources
- Warehouse: Snowflake, BigQuery, Redshift, or Databricks as the central storage and compute platform
- Transformation: dbt Core or dbt Cloud for SQL-based transformation with version control and testing
- Orchestration: Airflow, Dagster, or Prefect for pipeline scheduling and dependency management
- BI: Looker, Metabase, Tableau, Power BI, or Preset for visualization and self-service analytics
- Data quality: dbt tests, Great Expectations, Monte Carlo, or Soda for automated validation and monitoring
- Catalog: Atlan, DataHub, or dbt docs for data discovery and lineage
Getting Started
- Start with the questions, not the data — Tell the Warehouse Architect the top 10 business questions you need answered. The dimensional model is designed to answer these questions first, then expanded.
- Pick one source system and one dashboard — Do not try to integrate every data source in the first sprint. The ETL Engineer will connect the highest-value source first (usually the application database or payment system) and the BI Developer will deliver one executive dashboard. Prove value before scaling.
- Invest in data quality from day one — The Data Quality Engineer adds tests to every model as it is built, not after. When the CEO's dashboard shows a wrong number, trust is destroyed. Rebuilding trust takes months.
- Use dbt for transformation — All business logic lives in dbt models: version-controlled, tested, documented, and reviewable. No more SQL scripts in someone's laptop or transformation logic buried in BI tool calculations.
- Plan for self-service — The BI Developer's goal is to make the data team unnecessary for routine questions. The semantic layer, documentation, and training are investments that compound as the organization grows.