Research Scientist (Machine Learning Training Systems), TikTok Applied Machine Learning

TikTok

4.5

(6)

Singapore

Why you should apply for a job to TikTok:

  • 4.5/5 in overall job satisfaction
  • 4.5/5 in supportive management
  • 100% say women are treated fairly and equally to men
  • 100% would recommend this company to other women
  • 100% say the CEO supports gender diversity
  • Ratings are based on anonymous reviews by Fairygodboss members.
  • Employee well-being is supported via hybrid work, short-term counseling through our EAP and a premium subscription to Headspace.
  • We embrace diversity across all dimensions and provide employees with 9 employee resource groups globally, including our WOMEN ERG.
  • Comprehensive parental leave policy as well as fertility treatment through healthcare providers with a $20,000 lifetime maximum.
  • #A35654

    Position summary

    That's how we drive impact - for ourselves, our company, and the communities we serve.
    Join us.

    About the Team
    The Applied Machine Learning Machine Learning (ML)Systems team provides end-to-end (E2E) machine learning experience and machine learning resources for the company. The team builds heterogeneous ML training and inference systems based on GPU and AI chips and advances the state-of-the-art of ML systems technology to accelerate models such as stable diffusion and LLM.

    The team is also responsible for research and development of hardware acceleration technologies for AI and cloud computing, via technologies such as distributed systems, compilers, HPC, and RDMA networking. The team is reinventing the ML infra for large scale language models. We have published papers at top tier conferences such as SIGCOMM, NSDI, EuroSys, OSDI, SOSP, MLSys, NeurIPS, etc.

    Responsibilities

    • Research and develop our machine learning systems, including heterogeneous computing architecture, management, scheduling, and monitoring
    • Manage cross-layer optimisation of system and AI algorithms and hardware for machine learning (GPU, ASIC)
    • Implement both general purpose training framework features and model specific optimisations (e.g. LLM, diffusions)
    • Improve efficiency and stability for extremely large scale distributed training jobs
    • Plan and lead the development of new and advanced data analytic techniques, methodologies and analytical solutions from design, prototyping, and testing.
    • Identify and develop core data and AI science components for the delivery of projects, architect specialised database and computing environments, explore and visualise complex data set to provide incremental business value.
    • Extract and integrate data from various sources, and create advanced models and algorithms suitable for the business use case.
    • Conduct testing on data and AI models, interprets findings from testing, and evaluates model performance for scaling and deployment.
    • Work in a team setting and apply proficient in statistics, scripting and programming languages required by the firm.
    • Work with relevant software platforms on which the solution is deployed.

    Qualifications

    • Bachelor or above degree in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies;
    • Familiar with machine learning algorithms, platforms and frameworks such as PyTorch and Jax.
    • Have basic understanding of how GPU and/or ASIC works;
    • Expert in at least one or two programming languages in Linux environment: C/C++, CUDA, Python;

    Preferred Qualifications:
    The following experiences will be a big plus:

    • GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs);
      -Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD
    • AI compiler stacks such as torch.fx, XLA and MLIR;
    • Large scale data processing and parallel computing;
    • Experiences in designing and operating large scale systems in cloud computing or machine learning;
    • Experiences in in-depth CUDA programming and performance tuning (cutlass, triton)

    TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.

    Why you should apply for a job to TikTok:

  • 4.5/5 in overall job satisfaction
  • 4.5/5 in supportive management
  • 100% say women are treated fairly and equally to men
  • 100% would recommend this company to other women
  • 100% say the CEO supports gender diversity
  • Ratings are based on anonymous reviews by Fairygodboss members.
  • Employee well-being is supported via hybrid work, short-term counseling through our EAP and a premium subscription to Headspace.
  • We embrace diversity across all dimensions and provide employees with 9 employee resource groups globally, including our WOMEN ERG.
  • Comprehensive parental leave policy as well as fertility treatment through healthcare providers with a $20,000 lifetime maximum.