Position Overview
- Lead end-to-end incident response, triage, communication, and resolution in real time.
- Act as Incident Commander for high-impact events across a global environment.
- Track and improve metrics like MTTD, MTTM, and MTTR.
- Champion blameless Post-Incident Reviews (PIRs) and translate learnings into long-term system and process improvements.
Service Operations & Reliability
- Oversee daily service health, capacity, and reliability across all supported environments.
- Ensure compliance with operational KPIs through proactive planning and improvement.
- Balance demand vs. capacity and manage shift coverage to prevent burnout.
- Partner with engineering teams to maintain runbooks, knowledge bases, and escalation paths.
- Drive automation and workflow optimization to reduce manual overhead.
- Use data insights to guide decisions and improvements.
Strategic & Cross-Functional Impact <...