Senior DL Performance Infrastructure and MLOps Engineer

NVIDIA

2.7

(9)

Multiple Locations (Remote)

#JR1979768

Position summary

Improve all tooling and automation in use in the team, from simple data collection scripts to datacenter-scale ML CI/CD systems.

  • Understand and internalize workflows for GPU performance analysis and optimization so you can help us re-invent them.

  • Build Python-based machinery hooking into common Deep Learning software like PyTorch or JAX to support performance analysis work.

  • Ruthlessly discover and chase down workflow- and tool-related inefficiencies in the team's daily work, and dream up and implement ways to eliminate them.

What we need to see

  • MS degree in CS or adjacent fields or equivalent experience

  • 3+ years of relevant work experience

  • Background in deep learning fundamentals and common deep learning software, especially PyTorch/JAX

  • Experience in GPU computing, i.e. fundamental understanding of heterogeneous multi-node accelerated computing systems

  • Background in analyzing and optimizing application performance

  • Familiarity with containerized CI/CD flows, e.g. gitlab + docker

  • Programming skills in C++, Python, and CUDA

  • Deep passion related to tools, scripts, and automation

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. Are you creative and autonomous? Do you love a challenge? If so, we want to hear from you! Come, join our DL Architecture team and help build the real-time, cost-effective AI computing platform driving our success in this exciting and quickly growing field.