Position Overview
Job Summary
SRE Specialist with 7–8 years of experience in production support, specializing in incident management, system reliability, monitoring, and automation. Proven ability to reduce downtime, improve system performance, and ensure high availability of critical applications.
Key Responsibilities
- Manage 24/7 production support for critical applications, ensuring high availability and system stability
- Monitor application and infrastructure health using alerting and observability tools, proactively identifying issues
- Handle incident management lifecycle including detection, troubleshooting, escalation, and resolution
- Perform root cause analysis (RCA) for major incidents and implement preventive measures
- Reduce MTTR (Mean Time to Resolution) through automation and improved runbooks
- Participate in on-call rotations and provide timely resolution for production incidents
- Collaborate with ...