Define and monitor Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets to balance reliability with feature velocity and ensure optimal system availability.
Respond to production incidents during business hours, conduct post-mortem analysis, and implement preventive measures to reduce MTTR and improve system resilience.
Monitor system performance, troubleshoot issues, and proactively identify bottlenecks to ensure optimal performance and reliability.
Design, implement, and maintain CI/CD pipelines to automate the software delivery process.
Collaborate with development and operations teams to ensure smooth integration of new features and applications.
Implement and manage infrastructure as code (IaC) using automation tools.
Implement security best practices and participate in vulnerability assessments.
Manage and maintain on-premises & cloud-based infrast...