Senior DL Performance Infrastructure and MLOps Engineer

NVIDIA

2.7

(9)

Multiple Locations (Remote)

#JR1979768

Position summary

Improve all tooling and automation in use in the team, from simple data collection scripts to datacenter-scale ML CI/CD systems.

Understand and internalize workflows for GPU performance analysis and optimization so you can help us re-invent them.
Build Python-based machinery hooking into common Deep Learning software like PyTorch or JAX to support performance analysis work.
Ruthlessly discover and chase down workflow- and tool-related inefficiencies in the team's daily work, and dream up and implement ways to eliminate them.

What we need to see

MS degree in CS or adjacent fields or equivalent experience
3+ years of relevant work experience
Background in deep learning fundamentals and common deep learning software, especially PyTorch/JAX
Experience in GPU computing, i.e. fundamental understanding of heterogeneous multi-node accelerated computing systems
Background in analyzing and optimizing application performance
Familiarity with containerized CI/CD flows, e.g. gitlab + docker
Programming skills in C++, Python, and CUDA
Deep passion related to tools, scripts, and automation

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. Are you creative and autonomous? Do you love a challenge? If so, we want to hear from you! Come, join our DL Architecture team and help build the real-time, cost-effective AI computing platform driving our success in this exciting and quickly growing field.