Position Overview
Responsibilities
- Own reliability, availability, scalability, and security of production systems
- Design and operate highly available, fault‑tolerant, multi‑region cloud architectures
- Define and manage SLOs, SLIs, SLAs, and error budgets for critical services
- Lead high‑severity incidents and drive effective post‑incident reviews
- Improve MTTD and MTTR through automation, tooling, and runbooks
- Operate and evolve Kubernetes (EKS) platforms and multi‑tenant deployments
- Work with Infrastructure‑as‑Code (Terraform, CloudFormation, Pulumi) at scale
- Build and improve CI/CD pipelines and deployment safeguards
- Design and maintain observability (metrics, logs, traces, alerting)
- Drive capacity planning, performance optimisation, and cloud cost efficiency
- Partner with Security & Compliance on SOC 2, ISO 27001, GDPR, and DORA controls
- Mentor SREs and influence reliability‑first engine...