Position Overview
Overview
We are looking for an AI Infra engineer to join our growing team. We work with Kubernetes, Slurm, Python, C++, PyTorch, and primarily on AWS. As an AI Infrastructure Engineer, you will be partnering closely with our Inference and Research teams to build, deploy, and optimize our large-scale AI training and inference clusters.
Responsibilities
- Design, deploy, and maintain scalable Kubernetes clusters for AI model inference and training workloads
- Manage and optimize Slurm-based HPC environments for distributed training of large language models
- Develop robust APIs and orchestration systems for both training pipelines and inference services
- Implement resource scheduling and job management systems across heterogeneous compute environments
- Benchmark system performance, diagnose bottlenecks, and implement improvements across both training and inference infrastructure
- Build monitoring, alerting, a...