Looking for a Site Reliability Engineering (SRE) who has had prior developer and architecture experience in developing Java based enterprise applications and seasoned in handling operational/production support issues.
Adopt SRE best practices: Work with dev teams to define Non-Functional Requirements such as reliability, performance, scalability, application logging for observability, etc. Defi ne SLI/SLOs, Error Budgets, Automation focus
Incident Management: Lead the response to production issues, ranging from identifying and troubleshooting problems to implementing immediate fixes. Ensure minimal downtime and adherence to service level agreements (SLAs). Recent and frequent engagement during incidents is must.
Observability: Build alerting, monitoring and dashboards that identify problems proactively. Recent hands-on experience with threshold based Al...