icon
Home icon

Home

Jobs icon

Jobs

Reviews icon

Reviews

Network icon

Network

Resources icon

Resources

|For Employers icon

For Employers

logo
about
careers
FAQs
privacy policyterms & conditionsfor employers
112k
20k
icon
© 2022 Fairygodboss. All rights reserved.
My ProfileMy MessagesMy NetworkMy SettingsGroupsEventsMy PostsLog Out

Site Reliability Engineer ML Systems - TikTok

company-logo

TikTok

Mountain View, CA

Why you should apply for a job with TikTok:

  • Employee well-being is supported via hybrid work, short-term counseling through our EAP and a premium subscription to Headspace.

  • We embrace diversity across all dimensions and provide employees with 9 employee resource groups globally, including our WOMEN ERG.

  • Comprehensive parental leave policy as well as fertility treatment through healthcare providers with a $20,000 lifetime maximum.

Get jobs straight to your inbox

Anonymous company reviews, virtual recruiting events, and a supportive community for women when you sign up.

#JC9U2

Position summary

TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Mumbai, Singapore, Jakarta, Seoul and Tokyo. What You'll Do: Site Reliability Engineering (SRE) of Machine Learning System team combines system engineering and the art of machine learning to develop and run massively distributed ML training and Inference system around the world. On the SRE team, you'll have the opportunity to build and enrich your expertise in coding, performance analysis and large system management, and get heavily involved in the process of hardware/capacity decision-making. SRE ensures that ML systems are running in high level of availability, reliability and scalability. SRE will build and run the automated system and platform to manage the huge number of GPU machines, and leverage the machine learning technology to help operations efficiently. Responsibilities: 1. Deploy and maintain the machine learning system and platform, including training, inference, pipeline orchestration in the production environment 2. Build software and systems to monitor and manage the platform infrastructure and services to ensure system health 3. Manage GPU clusters to improve availability, reliability and efficiency 4. Handling on-calls and incidents 1. Bachelor or above degree in Computer Science or a related technical discipline with 2+ years' working experience 2. Programming experience with at least one language such as Golang, Python and Shell 3. Familiar with Kubernetes / YARN orchestrations 4. Strong problem solving and data analysis abilities 5. Self-motivated, team work, and good communication skills

What are TikTok perks and benefits

Lactation facilities

Fertility

Unconscious bias training

Networking

Succession planning

Diversity recruiting

Diversity performance

Short term disability

Paid paternity

Paid maternity

Paid adoptive

About the company

71252

TikTok

Industry: Technology: Consumer Internet

As the leading destination for short-form mobile video, our platform helps people around the world become a part of a global community. In a world that feels more divided than ever, we are here to inspire creativity and bring joy. We do this by embracing change, thriving in ambiguity, and always looking for solutions.

Why you should apply for a job with TikTok:

  • Employee well-being is supported via hybrid work, short-term counseling through our EAP and a premium subscription to Headspace.

  • We embrace diversity across all dimensions and provide employees with 9 employee resource groups globally, including our WOMEN ERG.

  • Comprehensive parental leave policy as well as fertility treatment through healthcare providers with a $20,000 lifetime maximum.

icon
© 2022 Fairygodboss. All rights reserved.
  • about
  • careers
  • FAQs
  • privacy policy
  • terms & conditions
112k
20k