[AI] AIGC Distributed Training & Optimization Engineer (Pre-training)
ShopeeWhat you'll do
About Us
Sea Group is establishing a brand-new, strategic AI department. This department is dedicated to exploring the transformative potential of generative AI in revolutionizing human connection, self-expression and communication diversity, and social interaction. We are building the next generation of AI-native applications and a comprehensive Model-as-a-Service (MaaS) product support system. Based on massive multi-country data, we are building a leading multilingual AI ecosystem from the ground up. We look forward to more outstanding talents joining us to build leading Southeast Asian multilingual models and explore innovative AI-native applications.
The AIGC team at Sea AI Department is dedicated to pushing the boundaries of visual synthesis. We aim to achieve industry leadership in high-fidelity portrait and video generation. This team focuses on fundamental research and the scaling of generative models to empower next-generation social and E-commerce platforms.
About the Job
- Toolchain Development: Design and build distributed training toolchains to support ultra-large-scale AIGC model training.
- System Optimization: Optimize distributed training performance across computation, communication, and storage layers.
- Stability & Scalability: Analyze and resolve technical bottlenecks in the training process, specifically focusing on improving training stability and efficiency.
- Frontier Research: Track and explore cutting-edge distributed training technologies, leading project planning and production-grade implementation.
Requirements
- Master’s degree or above in Computer Science or related fields; Bachelor can be considered with a strong industrial experience.
- Minimum 2 years of relevant experience.
- Distributed Expertise: Deep understanding of distributed training principles (Data/Pipeline/Tensor/Expert Parallelism) with proven hands-on experience.
- Framework Proficiency: Expert in deep learning frameworks such as PyTorch, DeepSpeed, and Megatron-LM.
- Low-level Knowledge: Familiar with GPU hardware architecture and CUDA programming; experience in CUDA kernel development/debugging and familiarity with NCCL and cuDNN.
- AIGC Background: Understanding of AIGC pre-training methodologies, Transformer architectures, and Diffusion models (e.g., Stable Diffusion, Flux).
- Core Competency: Strong problem-solving skills, innovative thinking, and excellent team collaboration/communication skills.