Position Overview
As a Site Reliability Engineer, you'll bridge the gap between software development and operations, applying software engineering principles to infrastructure and operations problems. You'll help design, build, and maintain the systems that keep our services reliable and scalable while working closely with development teams to improve application performance and resilience.
Responsibilities
- Design, implement, and maintain reliable infrastructure systems with a focus on security, scalability, reliability, and automation using tools like Terraform or CloudFormation
- Build and maintain scalable and resilient production systems with a focus on automation
- Develop and implement monitoring solutions to ensure system health, performance, and availability
- Lead incident response, perform root cause analysis, and implement preventative measures
- Track SLOs and SLAs to measure and improve service reliability and error budgets to dr...