Flexible Work, Better Balance
Job Responsibilities:
Centered on the deployment needs of Tencent's overseas gaming business in large language model (LLM) and reinforcement learning scenarios, this role is responsible for the development, performance optimization, and engineering implementation of high-quality AI computing infrastructure. Specific responsibilities include:
1. Distributed Training Engineering: Participate in the implementation of large-scale distributed training solutions; own the engineering delivery of data parallelism, model parallelism (Tensor Parallelism / Pipeline Parallelism), and ZeRO techniques; continuously tune GPU utilization and ensure the stability of ultra-large-scale training jobs.
2. Compute Scheduling Optimization: Take a deep role in developing and optimizing AI job scheduling logic; Address compute bottlenecks in complex gaming scenarios through fine-grained resource management, fault self-healing mechanisms, and efficient checkpointing st...