Position Overview
About the Job An opportunity to grow your SRE craft in a fast-paced, collaborative environment on Google Cloud Platform, with exposure to multi‑cloud technologies and modern data engineering.
Reliability & Incident Response - Monitor production systems using observability tooling — dashboards, alerts, and logs — to detect and triage issues before they impact end users
- Participate in on‑call rotations, respond to incidents following established runbooks, and escalate appropriately when needed
- Contribute to blameless post‑mortems, documenting root causes and follow‑up action items to prevent recurrence
- Help maintain and improve SLO dashboards and alerting thresholds to ensure platform health is visible and measurable
Toil Reduction & Automation - Identify repetitive manual tasks and build automation to eliminate them, reducing toil for yourself and the broader team
- Write and maintain scripts, tooling, ...