Position Overview
Tata Consultancy Services seeks a Lead Site Reliability Engineer to enhance production application reliability in Canada. Focus on incident response, performance optimization, and automation for operational excellence.
You will manage the reliability and performance of TCS Canada’s production applications. Your role includes implementing monitoring solutions, analyzing capacity planning, and optimizing infrastructure costs. You will also lead incident resolutions and automate daily operational procedures to improve efficiency.
Key Responsibilities:
• Manage production applications for high availability
• Set up monitoring and resolve incidents effectively
• Analyze and optimize resource usage and costs
• Automate operational workflows and integrations effectively
• Leverage tools such as Logic Monitor and PowerShell
Requirements:
• Expertise in Dynatrace and ELK Stack
• Experience with Python and Shell scripting
• Knowledge in AI Ops and observability