Position Overview
We are looking for a Site Reliability Engineer (SRE) responsible for ensuring the reliability, availability, performance, and efficiency of our production systems and services. The role involves collaborating closely with development, infrastructure, and support teams to build robust, scalable, and observable platforms.
Resilience
- Enhance application service and infrastructure resilience through self‑healing and automated failovers, targeting 99.99% uptime.
- Assist in planned random disruption of production infrastructure to ensure accountability for building resilient, always‑on systems.
- Build resilience into the application so that underlying system failures are handled gracefully and do not impact end users.
Efficiency
- Identify opportunities to eliminate manual, repeatable activities (toil) via tooling and automation.
- Reduce repeat incidents by permanently fixing the underlying root cause.