Position Overview
Job Description
Lead SRE project plan and implementation for distributed applications across GCP and Azure covering API's , data pipelines , messaging/event driven systems and also external data platforms.
Job Description:
- Design and implement comprehensive SRE monitoring for distributed applications
- Implement distributed tracing and logging using W3C Trace Context headers and OpenTelemetry standards across all applications
- Create drill-down Grafana dashboards with correlation between metrics, logs, and traces
- Integrate GCP and Azure Monitoring, Logging, and Trace with existing Open telemetry standards by enterprise teams
- Implement zero code instrumentation for monitoring and traceability
- Experience in defining and working with core SRE models like SLI's , SLO's , Error budgets etc
- Design reliability focused metrics (Latency, Request rate, Error, Duration, Availability) dashboards
- Build service health dashboards with drill-down capab...