Position Overview
Responsibilities
- Deploy, monitor, and recover containerized AI training environments.
- Troubleshoot infrastructure bottlenecks and resolve system failures in real time.
- Build and manage resilient systems for stability and performance optimization.
- Collaborate with engineering teams to improve CI/CD pipelines and automation.
- Manage filesystem structures, storage, and process scheduling in containerized environments.
- Execute dynamic replanning during runtime issues and system failures.
- Document system processes, solutions, and best practices.
Requirements
- Strong experience with terminal-based system administration and troubleshooting.
- Expertise in containerized environments such as Docker or Kubernetes.
- Strong Python skills for scripting, automation, and debugging.
- Proficiency in Bash and familiarity with additional programming languages.