Build Self‑Service Infrastructure: Design and scale highly available Infrastructure as Code (IaC) modules using Terraform, empowering development teams to provision resources autonomously and securely.
Champion Platform Reliability: Partner closely with engineering teams to define, measure, and operationalize SRE metrics (SLIs, SLOs, and Error Budgets) to balance feature velocity with system stability.
Drive Advanced Observability: Architect a comprehensive, unified observability stack (Elastic Cloud, Grafana, Prometheus) to monitor APM, logs, and metrics; implement event correlation to reduce alert fatigue and Mean Time to Resolution (MTTR).