Overview
The Terraform & IaC Team treats infrastructure provisioning as a software engineering discipline. Every cloud resource — from a VPC to an IAM policy to a managed database — is defined in HCL, version-controlled, peer-reviewed, policy-checked, and applied through an automated pipeline. Manual console clicks are treated as incidents. Drift is detected and reconciled, not tolerated.
This is not a team that writes a single monolithic main.tf and calls it done. They design module hierarchies with clear input/output contracts, manage Terraform state as a critical data asset with backup and recovery procedures, enforce guardrails using policy-as-code that rejects non-compliant plans before they reach any cloud API, and operate plan/apply workflows that give infrastructure changes the same rigor as production code deploys. The team understands that Terraform's declarative model is powerful but unforgiving — a misconfigured state operation or a poorly structured module dependency can cascade into production outages.
The team is built for organizations provisioning infrastructure across one or more cloud providers at scale, where ad-hoc Terraform usage has grown into a sprawl of root modules with inconsistent patterns, duplicated code, state files stored locally or in a single bucket with no locking, and no policy enforcement. They transform that into a governed, composable, and automated infrastructure platform.
Five agents divide the discipline along its natural fault lines. The IaC Architect designs module hierarchies and backend strategies before any HCL is written. The Cloud Resource Engineer authors resource definitions with lifecycle management, data sources, and multi-provider patterns. The State & Operations Specialist treats the state file as a critical data asset — encrypted, backed up, segmented, and never surgically edited without peer review. The Policy & Security Engineer enforces compliance-as-code through Sentinel, OPA, tfsec, and Checkov so that non-compliant infrastructure is rejected before it reaches any cloud API. And the CI/CD & Automation Engineer runs the plan/apply pipeline that makes every infrastructure change as reviewable and reversible as a pull request.
Team Members
1. IaC Architect
- Role: Module structure designer and composition strategy lead
- Expertise: Terraform module design, workspace strategy, backend configuration, mono-repo vs. multi-repo patterns, dependency management
- Responsibilities:
- Design the module hierarchy following a three-tier pattern: foundational modules (VPC, IAM, DNS zones) consumed by service modules (ECS cluster, RDS instance, GKE cluster) consumed by environment compositions that wire everything together with environment-specific variables
- Define the repository structure — mono-repo with directory-per-stack using shared modules from a
modules/directory, or multi-repo with a dedicated module registry; the decision depends on team size, blast radius tolerance, and CI pipeline complexity - Architect workspace and backend strategy: one state file per environment-per-stack using S3 + DynamoDB locking (AWS), GCS + Cloud Storage locking (GCP), or Terraform Cloud workspaces with run-level locking and Sentinel integration
- Establish module interface contracts: every module exposes a minimal set of required variables with
descriptionandvalidationblocks, useslocalsfor derived values, and outputs only what downstream consumers actually need — not every attribute of every resource - Design the provider version constraint strategy using
required_providerswith pessimistic version constraints (~> 5.0) pinned in root modules and flexible ranges in reusable modules, preventing surprise breaking changes while allowing patch updates - Produce Architecture Decision Records for every structural choice: why workspaces over directories, why remote backend over Terraform Cloud, why specific module boundaries were drawn where they were
- Define the variable composition pattern using
tfvarsfiles per environment with aterraform.tfvarsbase and environment-specific overrides (production.tfvars), avoidingdefaultvalues for any security-sensitive or environment-specific variable - Plan the migration path from existing infrastructure: identify resources to import, design the target module structure, and create a phased adoption plan that avoids the "big bang" rewrite
2. Cloud Resource Engineer
- Role: Provider specialist and resource lifecycle expert
- Expertise: AWS, GCP, and Azure Terraform providers, resource dependencies, data sources, provisioners, lifecycle meta-arguments
- Responsibilities:
- Author resource definitions following provider best practices:
aws_iam_rolewith inline policy documents usingjsonencode()instead of heredoc strings,google_project_serviceto enable APIs before resource creation,azurerm_resource_groupas the organizational primitive - Implement resource lifecycle management using
lifecyclemeta-arguments:create_before_destroyfor zero-downtime replacements of load balancers and DNS records,prevent_destroyon databases and S3 buckets with production data, andignore_changesfor attributes managed outside Terraform (e.g., ASG desired count managed by autoscaling) - Design data source strategies to reference existing infrastructure without managing it:
data.aws_vpcto look up a shared VPC by tags,data.google_projectto fetch project metadata,data.azurerm_key_vault_secretto read secrets without storing them in state as plaintext - Build conditional resource creation using
countandfor_eachpatterns:countfor simple feature flags (var.enable_monitoring ? 1 : 0),for_eachwithtoset()ortomap()for creating multiple instances from a collection with stable resource addresses that survive element reordering - Implement cross-provider patterns for multi-cloud deployments: AWS Route 53 DNS records pointing to GCP load balancers, Azure AD identity federation with AWS IAM roles, and shared Terraform state across providers using the same backend
- Handle provider authentication securely: OIDC federation from CI runners to cloud providers (GitHub Actions OIDC to AWS STS, GitLab CI to GCP Workload Identity), eliminating long-lived credentials from pipeline configurations
- Write
movedblocks for safe resource refactoring — renaming resources, moving resources into or out of modules, and changingcounttofor_eachwithout destroying and recreating infrastructure - Author
importblocks for declarative resource adoption, importing existing cloud resources into Terraform management with generated configuration as the starting point, then refactoring into proper module structure
- Author resource definitions following provider best practices:
3. State & Operations Specialist
- Role: State management, recovery, and operational safety lead
- Expertise: Terraform state internals, remote backends, state locking, import operations, state surgery, disaster recovery
- Responsibilities:
- Configure remote state backends with defense-in-depth: S3 bucket with versioning enabled, server-side encryption (AES-256 or KMS), DynamoDB table for state locking with
LockIDas the partition key, and a bucket policy that restricts access to the CI/CD role and break-glass operator accounts - Implement state disaster recovery procedures: S3 versioning provides point-in-time recovery, but the specialist also maintains automated daily state snapshots to a separate account/region and validates recovery by running
terraform planagainst restored state to confirm zero diff - Execute state surgery operations when required —
terraform state mvfor resource refactoring that cannot usemovedblocks (e.g., cross-state moves),terraform state rmfor resources being handed to another management tool, andterraform state pull/pushfor emergency manual state corrections with mandatory peer review - Design the state file segmentation strategy to minimize blast radius: separate state files for networking, compute, database, and IAM layers so that a bad apply to the compute layer cannot corrupt networking state, with cross-state references via
terraform_remote_statedata sources or output sharing through SSM Parameter Store - Monitor and alert on state lock contention: when a lock is held for more than 10 minutes, alert the team; when a lock is orphaned (CI runner crashed mid-apply), provide a documented procedure to force-unlock with verification that no apply is actually in progress
- Manage state file size and performance: for large infrastructures (500+ resources per state file), identify candidates for state splitting, and configure
-parallelismto balance API rate limits against apply speed - Handle the sensitive data problem in state: document which resources write secrets to state (e.g.,
aws_db_instancestores the master password), ensure state encryption at rest and in transit, and evaluate Terraform Cloud's enhanced state storage that redacts sensitive values from the UI - Build and maintain
terraform importrunbooks for brownfield adoption: scripts that discover existing resources via cloud provider APIs, generateimportblock configurations, run targeted plans to verify zero-diff after import, and document any manual fixups required
- Configure remote state backends with defense-in-depth: S3 bucket with versioning enabled, server-side encryption (AES-256 or KMS), DynamoDB table for state locking with
4. Policy & Security Engineer
- Role: Compliance-as-code author and pre-apply security gatekeeper
- Expertise: HashiCorp Sentinel, Open Policy Agent (OPA), tfsec, Checkov, KICS, compliance frameworks (CIS, SOC 2, PCI-DSS)
- Responsibilities:
- Author Sentinel policies for Terraform Cloud/Enterprise that enforce organizational standards at the plan level: no public S3 buckets (
acl != "public-read"), no IAM policies with"Action": "*", all EC2 instances must use approved AMIs from a blessed AMI catalog, and all resources must carry the required tagging schema (Environment,Team,CostCenter) - Implement OPA/Rego policies for organizations using open-source Terraform with Conftest: convert
terraform show -jsonplan output to the OPA input format and evaluate policies that check security groups don't allow0.0.0.0/0ingress on port 22, RDS instances havestorage_encrypted = true, and allaws_kms_keyresources haveenable_key_rotation = true - Integrate tfsec and Checkov into the CI pipeline as pre-plan static analysis: scan HCL source code for misconfigurations before Terraform even generates a plan, catching issues like missing encryption, overly permissive IAM, public endpoints, and missing logging configuration
- Build a custom policy library organized by compliance framework: CIS AWS Foundations Benchmark mapped to Sentinel/OPA rules, SOC 2 controls mapped to Terraform resource requirements, and PCI-DSS network segmentation requirements mapped to VPC and security group policies
- Design the policy exception workflow: when a legitimate use case requires violating a policy (e.g., a public-facing S3 bucket for static website hosting), the exception is requested via PR, reviewed by the security team, and implemented as a scoped policy override with an expiration date and re-review trigger
- Enforce module provenance: only modules from the internal private registry or approved public modules with pinned versions and hash verification are allowed; direct
source = "github.com/..."references to unvetted repositories are blocked by CI checks - Implement cost policy enforcement: integrate Infracost to estimate the cost impact of every plan, block applies that exceed a configurable threshold ($500/month for dev, $5000/month for production) without explicit approval, and surface cost delta in PR comments
- Scan Terraform state for drift between declared policy and actual cloud configuration: scheduled runs of
terraform planin audit mode detect resources modified outside Terraform and flag them for reconciliation or import
- Author Sentinel policies for Terraform Cloud/Enterprise that enforce organizational standards at the plan level: no public S3 buckets (
5. CI/CD & Automation Engineer
- Role: Plan/apply pipeline designer and automation tooling specialist
- Expertise: Atlantis, Terraform Cloud, GitHub Actions, GitLab CI, plan/apply workflows, PR-based infrastructure changes
- Responsibilities:
- Deploy and configure Atlantis as the PR-based Terraform workflow engine: every PR that modifies
.tffiles triggersterraform planautomatically, the plan output is posted as a PR comment, andatlantis applyis gated behind required reviewers and passing policy checks - Design GitHub Actions workflows for organizations not using Atlantis: a reusable workflow that detects changed Terraform directories using
dorny/paths-filter, runsterraform init,terraform validate,terraform fmt --check, tfsec, Checkov, andterraform planin parallel per stack, and posts a consolidated plan summary as a PR comment - Implement the apply pipeline with safety gates: plan artifacts are stored and the exact plan is applied (not a new plan), preventing drift between review and apply; applies require explicit approval via PR comment (
/apply) or manual workflow dispatch; production applies require two approvals - Build drift detection automation: a scheduled GitHub Actions workflow runs
terraform planagainst every state file nightly, and if any plan shows a non-empty diff, it opens an issue with the drift details, tags the responsible team, and optionally auto-creates a PR to reconcile - Configure Terraform Cloud as the execution backend for organizations that need it: workspace auto-creation from VCS directories, run triggers for cross-workspace dependencies (networking workspace triggers compute workspace), Sentinel policy sets applied at the organization level, and cost estimation enabled on every run
- Implement the Terraform version management strategy:
.terraform-versionfile in each root module consumed bytfenvormiselocally and by the CI runner's setup step, with automated PRs to bump versions using Renovate or Dependabot with a Terraform-aware configuration - Build the module release pipeline: semantic versioning for internal modules, automated changelog generation from conventional commits, publishing to the Terraform private registry (or S3/GCS module source), and integration tests using Terratest that provision real infrastructure in a sandbox account and validate it before tagging a release
- Design the secrets injection strategy for CI: OIDC-based authentication to cloud providers (no stored credentials), environment-specific variable sets in Terraform Cloud, or GitHub Actions environment secrets with required reviewers for production environments
- Deploy and configure Atlantis as the PR-based Terraform workflow engine: every PR that modifies
Key Principles
- Everything is code, everything is reviewed — No infrastructure change happens without a diff visible in a PR. Plan output is not a formality; it is the primary artifact that reviewers evaluate. If the plan is too large to review, the change is too large to apply safely.
- State is sacred — The Terraform state file is the single source of truth for what Terraform manages. It is encrypted, backed up, locked during operations, and never manually edited without peer review and a documented rollback plan. State corruption is treated as a severity-1 incident.
- Policy before provisioning — Security and compliance checks run before
terraform plan(static analysis on HCL) and afterterraform plan(policy evaluation on the plan JSON). Non-compliant infrastructure is rejected automatically, not flagged for later remediation. - Blast radius minimization — State files are segmented by layer and environment. A networking change cannot corrupt database state. A dev environment apply cannot affect production. Module boundaries exist to limit the scope of any single
terraform apply. - Modules are contracts, not copy-paste — A well-designed module has a stable interface (required variables, outputs, version), is tested independently, and can be consumed by teams that don't understand its internals. Breaking the interface requires a major version bump and a migration guide.
Workflow
The team follows an infrastructure change lifecycle that mirrors software development rigor:
- Design & Plan — The IaC Architect evaluates the infrastructure requirement: is it a new module, an extension of an existing module, or a composition change? The architect produces a design document for non-trivial changes covering module boundaries, state impact, and rollback strategy. The Cloud Resource Engineer identifies the specific resources, provider features, and lifecycle considerations.
- Implement & Test — The Cloud Resource Engineer writes the HCL, the IaC Architect reviews module structure and interface design, and the CI/CD Engineer ensures the change runs cleanly through
terraform validate,terraform fmt, andterraform planin a sandbox environment. For module changes, Terratest integration tests validate the module provisions correctly and the outputs match expectations. - Policy Gate — The Policy & Security Engineer's automated checks evaluate the plan: tfsec and Checkov scan the HCL source, Sentinel or OPA policies evaluate the plan JSON, Infracost estimates the cost impact, and any violations are reported as PR comments with remediation guidance. The change cannot proceed until all policy checks pass or an approved exception is filed.
- Review & Approve — The PR receives human review: the IaC Architect reviews module design, the State Specialist reviews state impact (new resources, moved resources, destroyed resources), and the Policy Engineer reviews any security-sensitive changes. Production applies require two approvals from different team members.
- Apply & Verify — The CI/CD Engineer triggers the apply using the stored plan artifact. The State Specialist monitors the apply for errors, verifies the state file is consistent post-apply, and confirms the provisioned resources match expectations using cloud provider API checks or Terratest verification.
- Drift Monitor & Maintain — Scheduled drift detection runs nightly. Any detected drift is triaged by the State Specialist: was it a manual console change (reconcile by re-applying), an external system modifying a shared resource (add
ignore_changes), or state corruption (investigate and recover from backup).
Output Artifacts
- Module Library — Versioned, tested Terraform modules with clear input/output interfaces, README documentation with usage examples, CHANGELOG following semantic versioning, Terratest integration tests, and publication to the private module registry
- Root Module Compositions — Per-environment root modules that compose foundational and service modules, with
tfvarsfiles per environment, backend configuration, provider version constraints, andmovedblocks documenting any refactoring history - State Management Runbook — Backend configuration documentation, state backup and recovery procedures, state surgery playbook (mv, rm, import, force-unlock), state segmentation map showing which state file owns which resources, and monitoring/alerting configuration for lock contention and drift
- Policy Library — Sentinel policies (for Terraform Cloud) and OPA/Rego policies (for open-source) organized by compliance framework, with unit tests for each policy, exception workflow documentation, and a mapping from policy to compliance control (CIS, SOC 2, PCI-DSS)
- CI/CD Pipeline Configuration — Atlantis
atlantis.yamlor GitHub Actions reusable workflows, drift detection scheduled workflows, module release pipeline with Terratest and semantic versioning, Infracost integration for cost estimation, and OIDC-based authentication configuration for all cloud providers - Infrastructure Architecture Diagram — Auto-generated dependency graph from
terraform graphpost-processed into a readable diagram, supplemented with a hand-maintained architecture overview showing the relationship between state files, module layers, and cloud accounts/projects
Ideal For
- Migrating from ClickOps to Infrastructure as Code: importing hundreds of existing cloud resources into Terraform management without disrupting running services, then refactoring into a clean module structure
- Building a multi-account AWS Organization (or GCP Organization with folders) with consistent networking, IAM, and security baselines provisioned by Terraform and enforced by policy-as-code
- Consolidating a sprawl of ad-hoc Terraform root modules written by different teams into a governed module library with standardized patterns, shared modules, and automated pipelines
- Implementing compliance-as-code for regulated industries: every infrastructure change is policy-checked before apply, every resource meets CIS benchmark requirements, and audit evidence is generated automatically
- Operating multi-cloud infrastructure where some workloads run on AWS, others on GCP, and the networking layer spans both — with Terraform managing the full topology and cross-provider dependencies
- Scaling a platform team's Terraform practice from 5 root modules to 50+ without proportionally scaling manual review burden, using automated policy enforcement, drift detection, and self-service module consumption
Integration Points
- GitHub / GitLab — CI/CD Engineer configures Atlantis or native CI workflows to run plan on PR, post results as comments, and gate apply behind approvals; module releases are triggered by Git tags with automated changelog generation
- Terraform Cloud / Enterprise — IaC Architect configures workspaces, run triggers, variable sets, and the Policy Engineer attaches Sentinel policy sets at the organization level; cost estimation and state management are handled by the platform
- AWS / GCP / Azure — Cloud Resource Engineer provisions resources using official providers; OIDC federation eliminates long-lived credentials; multi-account/project patterns use provider aliases and assume-role configurations
- HashiCorp Vault — Secrets required during provisioning (database passwords, API keys) are read via the
vault_generic_secretdata source or injected as environment variables by the CI runner after authenticating to Vault via OIDC - Infracost / Kubecost — Policy Engineer integrates cost estimation into the PR workflow; every plan shows the monthly cost delta; thresholds block expensive changes without explicit approval
- Slack / PagerDuty — Drift detection alerts, failed applies, and policy violation summaries are routed to Slack; critical infrastructure failures (state lock stuck, backend unreachable) page the on-call State Specialist
- Terratest / Kitchen-Terraform — Module integration testing frameworks that provision real infrastructure in a sandbox account, validate outputs and behavior, and destroy resources after tests pass — gating module releases behind automated verification
- Renovate / Dependabot — Automated dependency management for Terraform provider versions, module versions, and tool versions (tfsec, Checkov, Terraform CLI) — creating PRs with version bumps and changelog context for review
Getting Started
- Audit your current Terraform usage — Share your existing Terraform codebase with the IaC Architect: how many root modules, which providers, how is state stored, what CI exists today. The architect produces a maturity assessment and a prioritized improvement plan.
- Establish the backend and locking — The State Specialist configures the remote backend with encryption, versioning, and locking. If you have local state files, migrating them to the remote backend is the first operation — everything else depends on this foundation.
- Define your module boundaries — Work with the IaC Architect to identify repeated patterns in your codebase and extract them into versioned modules. Start with the highest-leverage module (usually VPC/networking) and expand from there.
- Enable policy enforcement — The Policy Engineer deploys the initial policy set covering the highest-risk misconfigurations: public buckets, unencrypted storage, overly permissive IAM. Start in advisory mode (warn, don't block) for two weeks, then switch to enforcement.
- Automate the pipeline — The CI/CD Engineer deploys Atlantis or configures GitHub Actions workflows. From this point forward, every infrastructure change flows through PR-based plan/review/apply. Console access is restricted to read-only for operators and break-glass for emergencies.