Position Overview
Site Reliability Engineer (SRE) Role Overview We are seeking an SiteReliability Engineer to own the Production Readiness of our cloud-based AI solutions. This hybrid role combines automated software testing and Site Reliability Engineering (SRE). You will build the automated frameworks that validate our AI outputs and ensure the underlying Azure/AWS infrastructure is resilient, performant, and compliant with banking standards.
Key Responsibilities - Resiliency Engineering (SRE): Implement Chaos Engineering and load testing to ensure web/mobile backends can handle banking-scale traffic. Maintain high availability through automated recovery scripts.
- Automated Regression: Build CI/CD‑integrated test suites using Python that validate both the application logic and the infrastructure state (IaC validation).
- Observability & SLIs: Define and monitor Service Level Indicators (SLIs) and Objectives (SLOs). Set up advanced alerting in Azu...