Machine Learning Engineer Intern (Training Pre-processing) - 2025 Summer (PhD)

TikTok

4.5

(6)

San Jose, CA

4.5/5 in overall job satisfaction

4.5/5 in supportive management

100% say women are treated fairly and equally to men

100% would recommend this company to other women

100% say the CEO supports gender diversity

Ratings are based on anonymous reviews by Fairygodboss members.

Employee well-being is supported via hybrid work, short-term counseling through our EAP and a premium subscription to Headspace.

We embrace diversity across all dimensions and provide employees with 9 employee resource groups globally, including our WOMEN ERG.

Comprehensive parental leave policy as well as fertility treatment through healthcare providers with a $20,000 lifetime maximum.

#7498908264136165650

s like Ray to orchestrate large-scale distributed ML workflows.

Responsibilities:

Design and develop core Flink operators, connectors, or runtime modules to support TikTok's exabyte-scale real-time processing needs.
Build and maintain low-latency, high-throughput streaming pipelines powering online learning, recommendation, and ranking systems.
Collaborate with ML engineers to design end-to-end real-time ML pipelines, enabling efficient feature generation, training data streaming, and online inference.
Leverage Velox for compute-optimized ML data transformation and training acceleration on multimodal datasets (e.g., video, audio, and text).
Use Ray to coordinate distributed machine learning workflows and integrate real-time feature pipelines with ML model training/inference.
Optimize Flink job performance, diagnose bottlenecks, and deliver scalable solutions across EB-scale streaming workloads.

Qualifications

Minimum Qualifications:

Currently pursuing a PhD's degree in Computer Science, Software Engineering, Data Engineering, or a related technical field.
Strong programming skills in Java, Scala, or Python.
Understanding of distributed systems, stream processing, and event-driven architecture.
Familiar with system design concepts such as fault tolerance, backpressure, and horizontal scalability.
Demonstrated ability to debug and analyze complex distributed jobs in production environments.

Preferred Qualifications:

Graduating in December 2025 or later, with the intent to return to your academic program.
Experience with Apache Flink, Spark Streaming, or Kafka Streams.
Hands-on experience with Ray for distributed ML or workflow orchestration.
Familiarity with Velox, Arrow, or similar columnar execution engines for training/feature pipelines.
Understanding of multimodal data processing (e.g., combining video, audio, and text in model training pipelines).
Experience working with data lake ecosystems (e.g., Iceberg, Hudi, Delta Lake) and cloud-native storage at PB-EB scale.
Contributions to open-source projects or participation in ML/engineering hackathons or competitions.