Position Overview
Job Description
- Leadership & Strategy
- Define and implement SRE best practices across the organization.
- Proven expertise in production support, resilience engineering, disaster recovery (DCR), automation, and cloud operations
- Mentor and guide a team of SREs, fostering growth and technical excellence.
- Collaborate with senior stakeholders to align reliability goals with business objectives.
- Reliability & Performance
- Establish SLIs, SLOs, and SLAs for critical services and ensure adherence.
- Drive initiatives to improve system resilience and reduce operational toil.
- Excellent in designing systems that detect and remediate issues without manual intervention – Self Healing systems, Runbook automation
- Exposure to tools like Gremlin, Chaos Monkey, AWS FIS to simulate outages and improve fault tolerance
- Incident Manageme...