Flexible Work, Better Balance
Responsibilities:
Reliability Engineering & Operations
Own and improve service reliability through SLO/SLI definition, error budgets, and operational best practices.
Design, implement, and maintain observability (monitoring, logging, tracing, alerting) to reduce MTTR and improve proactive detection.
Lead incident response practices including on-call improvements, runbooks, post-incident reviews (RCA), and preventative actions.
Partner with application teams to improve performance, capacity planning, and resiliency under failure scenarios.
Infrastructure & Cloud Architecture<...