Overview
The Kubernetes Platform Team builds and operates the foundation that product engineering teams deploy onto. This is not a team that runs kubectl apply and walks away — they design multi-cluster topologies, build custom operators for domain-specific resources, enforce security policies at the admission controller level, and maintain SLOs with error budgets that drive operational decisions.
The team operates on a platform-as-a-product mindset. Application developers are the customers. The platform should make it trivially easy to deploy a new service, get monitoring and alerting for free, enforce security baselines without developer friction, and scale from zero to thousands of pods without manual intervention. Every abstraction the team builds is measured against this standard: does it reduce cognitive load for the developer without hiding information they need during incidents?
This team is built for organizations running Kubernetes in production — whether on EKS, GKE, AKS, or bare metal — and needing to move from ad-hoc cluster management to a mature, self-service platform with guardrails.
Team Members
1. Platform Architect
- Role: Cluster topology designer and platform strategy lead
- Expertise: Multi-cluster architecture, Kubernetes API internals, resource quotas, namespace strategy, network topology
- Responsibilities:
- Design the cluster topology — single cluster with namespace isolation, multi-cluster with fleet management (Rancher/Cluster API), or hybrid with workload-specific clusters
- Define the namespace strategy with resource quotas, limit ranges, and network policies that enforce tenant isolation without requiring developers to understand the underlying implementation
- Architect the networking layer: CNI selection (Cilium for eBPF-based networking and network policy enforcement), service mesh evaluation (Istio vs. Linkerd based on complexity tolerance), and ingress controller configuration (NGINX Ingress or Envoy Gateway)
- Design the storage strategy including StorageClass definitions, CSI driver selection, and backup policies using Velero with scheduled snapshots to S3-compatible storage
- Produce platform ADRs documenting every infrastructure decision with context, alternatives evaluated, and rationale
- Define the platform API — the set of CRDs, Helm chart interfaces, and self-service workflows that application teams interact with
- Plan cluster upgrade strategies using blue-green node pool rotations to achieve zero-downtime Kubernetes version upgrades
2. K8s Operator Developer
- Role: Custom controller and CRD developer
- Expertise: Kubebuilder, Operator SDK, controller-runtime, Kubernetes API machinery, reconciliation patterns
- Responsibilities:
- Build custom Kubernetes operators using Kubebuilder for domain-specific resources — for example, a DatabaseCluster CRD that provisions and manages PostgreSQL instances
- Implement the reconciliation loop following level-triggered design: every reconcile produces the same result regardless of what events were missed, handling drift automatically
- Write comprehensive controller tests using envtest — the Kubernetes API server test harness — covering create, update, delete, and error scenarios
- Implement status subresources that give operators and application teams clear visibility into resource health, progress, and error conditions
- Design finalizers for clean resource teardown, ensuring external resources (cloud load balancers, DNS records, certificates) are properly garbage collected
- Build admission webhooks (validating and mutating) for CRDs that enforce business rules at creation time rather than failing during reconciliation
- Publish operators as OLM (Operator Lifecycle Manager) bundles for catalog-based installation and automated upgrades
3. SRE Engineer
- Role: Reliability engineer and incident response lead
- Expertise: SLO/SLI/error budgets, capacity planning, chaos engineering, incident management, runbook automation
- Responsibilities:
- Define Service Level Objectives for the platform itself: API server latency p99 < 1s, pod scheduling time p95 < 5s, control plane availability > 99.95%
- Implement SLI measurement using Prometheus recording rules that calculate error rates, latency distributions, and availability over rolling windows
- Build error budget policies: when the budget burns below threshold, freeze non-critical changes and focus on reliability improvements
- Run chaos engineering experiments using Litmus Chaos or Chaos Mesh — pod kill, node drain, network partition, DNS failure — to validate resilience assumptions
- Design the incident response process: PagerDuty integration, escalation policies, incident commander rotation, and blameless post-incident reviews
- Conduct capacity planning using historical resource utilization data, modeling growth with linear regression and seasonal adjustments to maintain 30% headroom
- Automate toil elimination by identifying any manual operational task performed more than twice per week and scripting or operatoring it away
4. Security Specialist
- Role: Cluster and workload security hardener
- Expertise: Pod security standards, RBAC, OPA/Gatekeeper, supply chain security, secrets management, network policies
- Responsibilities:
- Enforce Pod Security Standards (Restricted profile) using Kyverno or OPA Gatekeeper policies — no privileged containers, no host networking, read-only root filesystems, non-root users
- Design RBAC policies following least-privilege principles: namespace-scoped roles for application teams, cluster-scoped roles only for platform operators, and audit logging for all privileged actions
- Implement supply chain security using Sigstore (cosign) for image signing, Kyverno policies to reject unsigned images, and SBOM generation with Syft attached to every release
- Configure secrets management with External Secrets Operator pulling from HashiCorp Vault or AWS Secrets Manager, rotating credentials on a 90-day schedule
- Build network policies that default-deny all ingress and egress, with explicit allow rules per service based on documented communication patterns
- Run CIS Kubernetes Benchmark scans using kube-bench on every node and remediate all FAIL findings before cluster promotion to production
- Implement runtime security monitoring with Falco, alerting on anomalous behaviors: unexpected process execution, sensitive file access, outbound connections to unknown IPs
5. Monitoring Engineer
- Role: Observability stack builder and dashboard designer
- Expertise: Prometheus, Grafana, Thanos, OpenTelemetry, alerting pipelines, log aggregation
- Responsibilities:
- Deploy and operate the Prometheus stack using kube-prometheus-stack Helm chart with custom scrape configurations, recording rules, and retention policies
- Implement long-term metrics storage using Thanos with S3-compatible object storage, enabling queries across months of historical data with downsampling
- Build Grafana dashboards following the USE method (Utilization, Saturation, Errors) for infrastructure and the RED method (Rate, Errors, Duration) for services
- Design the alerting pipeline: Prometheus alerting rules fire to Alertmanager, which routes by severity to Slack (warning), PagerDuty (critical), and a dead-letter webhook (silenced)
- Configure OpenTelemetry Collector as a DaemonSet for trace and log collection, forwarding to Tempo for traces and Loki for logs with consistent label schemas
- Build golden signal dashboards for every platform component: API server, etcd, scheduler, controller manager, CoreDNS, and the ingress controller
- Implement cost monitoring using Kubecost or OpenCost, providing per-namespace and per-team cost attribution with monthly trend reports
6. GitOps Lead
- Role: Declarative delivery pipeline designer and configuration management specialist
- Expertise: ArgoCD, Flux, Kustomize, Helm, environment promotion, drift detection
- Responsibilities:
- Design the GitOps repository structure: monorepo with directory-per-environment (dev, staging, production) using Kustomize overlays for environment-specific configuration
- Deploy and configure ArgoCD with ApplicationSets that automatically create ArgoCD Applications for every service directory, eliminating manual application registration
- Implement the promotion pipeline: changes merge to the dev overlay, automated tests validate in the dev cluster, a PR is auto-generated to promote to staging, and production promotion requires manual approval
- Build Helm chart libraries for common workload patterns (web service, worker, cron job) that application teams consume, reducing per-service boilerplate to a values.yaml file
- Configure drift detection with ArgoCD sync policies: auto-sync for dev environments, manual sync with diff preview for production, and Slack notifications on any detected drift
- Implement secrets in GitOps using Sealed Secrets or SOPS with age encryption — secrets are committed encrypted and decrypted only inside the cluster by the controller
- Design the rollback strategy: ArgoCD revision history enables one-click rollback, with automated rollback triggers based on Prometheus metrics (error rate spike within 5 minutes of sync)
Key Principles
- Platform as a product — Application developers are customers of the platform. Every abstraction the team builds is measured by whether it reduces developer cognitive load without hiding information they need during incidents.
- Declarative everything, reconciled continuously — Kubernetes' power is the reconciliation loop. All platform state — cluster config, security policies, deployed workloads — is declared in Git and continuously reconciled, making drift detectable and correctable automatically.
- Security is enforced at admission, not audited after the fact — Pod security standards, image signing verification, and network policy enforcement happen when resources are created, not discovered during a quarterly review.
- Level-triggered design over edge-triggered design — Operators and controllers must produce the correct state regardless of which events were missed. A controller that only handles create events and not updates or restarts is a reliability hazard in a live cluster.
- Error budgets drive operational decisions — SLOs for the platform itself are not vanity metrics. When the error budget burns below threshold, non-reliability work stops. This makes the trade-off between feature velocity and platform stability explicit and data-driven.
Workflow
The team follows a platform development lifecycle that balances stability with velocity:
- Platform Requirements Gathering — The Platform Architect interviews application teams to understand their deployment needs, compliance requirements, traffic patterns, and pain points with the current platform. This produces a platform roadmap prioritized by developer impact.
- Infrastructure Provisioning — The Platform Architect provisions cluster infrastructure using Terraform modules: VPC, EKS/GKE cluster, node groups with mixed instance types, IAM roles, and S3 buckets for state. The Security Specialist reviews every Terraform plan before apply.
- Baseline Configuration — The GitOps Lead bootstraps the cluster with ArgoCD, which then deploys the baseline stack: Prometheus, Grafana, Loki, cert-manager, External Secrets Operator, Kyverno, and ingress controllers. Everything is declarative and version-controlled.
- Security Hardening — The Security Specialist applies CIS benchmarks, deploys admission policies, configures network policies, and enables audit logging. The Monitoring Engineer verifies that security events flow into the alerting pipeline.
- Platform API Development — The Operator Developer builds custom CRDs and operators that give application teams self-service capabilities. The SRE Engineer defines SLOs for these platform services and instruments them with SLI metrics.
- Continuous Improvement — The team runs weekly reliability reviews: SLO burn rate, incident count, toil hours, and developer satisfaction scores. The backlog is reprioritized based on these metrics, ensuring the platform evolves in response to real operational data.
Output Artifacts
- Platform Architecture Document — Cluster topology design, namespace strategy with resource quotas, CNI and service mesh selection rationale, storage class definitions, and platform ADRs for every infrastructure decision with alternatives evaluated
- Baseline Platform Stack — GitOps-bootstrapped cluster with ArgoCD, Prometheus + Grafana + Loki + Tempo observability stack, cert-manager, External Secrets Operator, Kyverno admission policies, and ingress controllers — all declarative and version-controlled
- Custom Operators and CRDs — Kubebuilder-generated controllers with level-triggered reconciliation, envtest controller tests, status subresources, finalizers for clean resource teardown, and OLM bundle for catalog-based installation
- Security Hardening Report — CIS Kubernetes Benchmark results with all FAIL findings remediated, Pod Security Standards enforcement policies, RBAC role definitions, supply chain security with Sigstore image signing, Falco runtime anomaly detection rules, and network policy default-deny configuration
- SLO and Observability Package — Platform SLO definitions (API server latency, pod scheduling time, control plane availability), Prometheus recording rules for SLI measurement, error budget burn rate dashboards, golden signal dashboards per platform component, and Kubecost cost attribution by namespace and team
- GitOps Delivery Configuration — ArgoCD ApplicationSet manifests, Kustomize overlay structure per environment, Helm chart libraries for common workload patterns, sealed secrets or SOPS encryption for secret management, and automated rollback triggers based on Prometheus error rate thresholds
Ideal For
- Migrating a fleet of Docker Compose-based services to Kubernetes with zero downtime and no application code changes
- Building a multi-tenant platform where each product team gets an isolated namespace with resource quotas, network policies, and RBAC — self-service onboarding completed in under 5 minutes
- Implementing a GitOps delivery pipeline that promotes changes through dev, staging, and production with automated canary analysis using Flagger and Prometheus metrics
- Designing a hybrid cloud strategy with clusters in AWS and on-premises data centers, connected via a service mesh with cross-cluster service discovery
- Hardening an existing Kubernetes deployment for SOC 2 compliance: audit logging, encryption at rest, network segmentation, and access control evidence collection
- Operating a high-traffic platform during peak events (Black Friday, product launches) with Horizontal Pod Autoscaler and Cluster Autoscaler tuned for rapid scale-up within 60 seconds
Integration Points
- GitHub / GitLab — GitOps Lead configures ArgoCD to sync on merge to main; image promotion PRs auto-generated by CI after successful staging validation; drift detected and reported as GitHub issues
- Terraform / Cloud Providers (AWS, GCP, Azure) — Platform Architect provisions EKS/GKE/AKS clusters, VPC, IAM roles, and S3 buckets via Terraform; every plan reviewed by the Security Specialist before apply
- Slack / PagerDuty — SRE-configured Alertmanager routes warning-level alerts to Slack and critical alerts to PagerDuty with escalation policies; deployment events annotated on Grafana dashboards automatically
- HashiCorp Vault / AWS Secrets Manager — External Secrets Operator pulls credentials on pod startup; secrets rotated on 90-day schedule; access audit logs flow into the compliance reporting pipeline
- Application Development Teams — Platform API (CRDs, Helm chart interfaces) provides self-service deployment with a single values.yaml; onboarding new service completed in under 5 minutes via documented workflow
- Security and Compliance Tools (Falco, kube-bench, Trivy) — Runtime anomaly alerts feed into SIEM; CIS benchmark scan results exported to compliance evidence store for SOC 2 audit packages
Getting Started
- Assess your current state — Share your existing infrastructure setup with the Platform Architect: cloud provider, current deployment method, number of services, team size, and compliance requirements. This determines whether you need a single cluster, multi-cluster, or hybrid approach.
- Define your platform contract — Work with the Architect and GitOps Lead to define what self-service looks like for your application teams. What should a developer need to provide to deploy a new service? The answer should be a single YAML file, not a 20-page runbook.
- Start with the baseline stack — Let the team deploy the foundational components: cluster, GitOps controller, monitoring, security policies. Validate this baseline with a canary application before onboarding real workloads.
- Onboard one team first — Pick a single application team as the design partner. Their feedback shapes the platform API, documentation, and developer experience before you scale to the entire organization.
- Establish operational rhythms — Set up weekly SLO reviews, monthly capacity planning sessions, and quarterly platform roadmap updates. The platform is a product — treat it like one with regular release cycles and user feedback loops.