ume.
Candidates can apply to a maximum of two positions and will be considered for jobs in the order you apply. The application limit is applicable to TikTok and its affiliates' jobs globally. Applications will be reviewed on a rolling basis - we encourage you to apply early.
Responsibilities
- Optimize model performance and memory efficiency on GPU-based systems.
- Collaborate with research and infra teams to deploy high-throughput training and inference pipelines.
- Develop tools and libraries to accelerate deep learning workloads at scale.
- Analyze system performance (e.g., GPU profiling, kernel analysis, throughput tuning).
Qualifications
Minimum Qualifications:
- Final year graduate with a a background in Computer Science, Electrical Engineering, or other related field.
- Solid programming skills in C++/CUDA/Trition/Python.
- Familiarity with GPU architecture and distributed training is highly desirable.
Preferred Qualifications:
- Experience building production-grade training and inference systems for large-scale models.
- Hands-on experience optimizing Large Language Models (LLMs), including memory efficiency, latency, and throughput improvements.
- Knowledge of distributed training frameworks (e.g., NCCL, Horovod, DeepSpeed, FSDP) is a plus.
- Familiarity with deep learning compiler frameworks such as TVM or LLVM, and understanding of their underlying principles.
- Contributions to open-source projects or relevant research publications.
By submitting an application for this role, you accept and agree to our global applicant privacy policy, which may be accessed here: https://careers.tiktok.com/legal/privacy
If you have any questions, please reach out to us at [email protected]