Senior Deep Learning Systems Software Engineer - AI Infrastructure

NVIDIA

2.7

(9)

Multiple Locations

#JR1986479

Position summary

an opportunity to directly impact the hardware and software roadmap in a fast-growing technology company that leads the AI revolution while helping deep learning users around the globe enjoy ever-higher training speeds.

What you'll be doing:

  • Understand, analyze, profile, and optimize deep learning workloads on state-of-the-art hardware and software platforms.

  • Build tools to automate workload analysis, workload optimization, and other critical workflows.

  • Collaborate with cross-functional teams to analyze and optimize cloud application performance on diverse GPU architectures.

  • Identify bottlenecks and inefficiencies in application code and propose optimizations to enhance GPU utilization.

  • Drive end-to-end platform optimization from a hardware level to the application and service levels

  • Design and implement performance benchmarks and testing methodologies to evaluate application performance.

  • Provide guidance and recommendations on optimizing cloud-native applications for speed, scalability, and resource efficiency.

  • Share knowledge and best practices with domain expert teams as they transition applications to distributed environments.

What we need to see:

  • Masters in CS, EE or CSEE or equivalent experience

  • 5+ years of experience in application performance engineering

  • Experience using large scale multi node GPU infrastructure on premise or in CSPs

  • Background in deep learning model architectures and experience with Pytorch and large scale distributed training

  • Experience with application profiling tools such as NVIDIA NSight, Intel VTune etc.

  • Deep understanding of computer architecture, and familiarity with the fundamentals of GPU architecture. Experience with NVIDIA's Infrastructure and software stacks.

  • Proven experience analyzing, modeling and tuning DL application performance.

  • Proficiency in Python and C/C++ for analyzing and optimizing application code

Ways to stand out from the crowd:

  • Strong fundamentals in algorithms and GPU programming experience (CUDA or OpenCL)

  • Understanding of NVIDIA's server and software ecosystem

  • Hands-on experience in performance optimization and benchmarking on large-scale distributed systems

  • Hands-on experience with NVIDIA GPUs, HPC storage, networking, and cloud computing.

  • In-depth understanding storage systems, Linux file systems, RDMA networking

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you.