Senior Architect, Large Scale Distributed Training

NVIDIA

2.7

(9)

Yokne'am Illit, Israel

#JR1968653

Position summary

pes.

  • Design and define protocols and APIs for leveraging our technology in a data center

  • Research and evaluate algorithms currently used in related applications

  • Participate in defining hardware and system features, and assist software and hardware groups in enabling new technologies.

What we need to see:

  • B.Sc./M.Sc. or equivalent experience in Electrical Engineering or Computer Science from a leading university

  • 3-5 years of proven experience in the industry, specifically in SW engineering, distributed AI system training

  • Familiarity with networking concepts, terms, and software stack

  • Passion for problem-solving and algorithms research and development

  • Background in distributed AI/ML models training on GPU's clusters

Ways to stand out from the crowd:

  • Background in data center architecture

  • Experience with Collective Communications Library such as NCCL

  • good understanding of OS, driver and performance aspects of a system

  • Background in network synchronization protocols such as IEEE 1588 PTP

  • Good command of Python, C/C++

NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) based on race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law. We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request an accommodation.