Position Overview
Elevate system reliability at Kaseya as a Site Reliability Engineer, where your focus will be on maintaining production environments within AWS. Lead incident responses and enhance automation across services.
You will be responsible for defining SLOs and managing incident resolution, ensuring that the systems thousands of MSPs depend on remain stable. Your work will include building reliable, automated infrastructures and proactively monitoring systems to catch issues early. Team collaboration is key to integrating reliable practices into workflows.
Key Responsibilities:
• Define, enforce, and monitor SLOs and SLIs
• Lead and document incident response and troubleshooting efforts
• Develop automated infrastructure management solutions
• Oversee cloud infrastructure cost and resilience
• Improve system observability using dashboards and alerts
Requirements:
• 4 to 5 years of AWS production experience
• Proficiency in Terraform or CloudFormation for IaC
•...