Senior Manager - Storage Production Engineering and SRE




Santa Clara, CA


Position summary

of both our internal and external GPU cloud services, you align closely with our commitments to users. Simultaneously, you empower developers to enact system changes with meticulous preparation and planning, placing a sharp focus on critical elements such as capacity, latency, and performance. This position embodies a unique attitude and a suite of engineering strategies geared toward amplifying the efficiency of production systems and implementing innovative optimizations. A substantial part of our software development endeavors is dedicated to automating tasks, fine-tuning performance, and elevating the overall efficiency of production systems. With a comprehensive responsibility for understanding the intricate interconnectedness of our systems, you'll us a diverse range of tools and approaches to tackle a wide array of challenges. This role promises a daily dose of engaging and dynamic work, underscored by a commitment to continuous improvement, ensuring the triumphant success of our groundbreaking AI/ML solutions.

What You Will Be Doing:

  • Leadership: Formulating and executing strategic initiatives to enhance the reliability and performance of storage systems, aligning with organizational goals.

  • Team Management: Leading and mentoring a team of Storage SRE professionals, fostering a collaborative and innovative work environment.

  • Cloud Storage Expertise: Supervise the planning, execution, and enhancement of storage solutions, encompassing file, block, and object storage, to cater to the requirements of an expanding cloud infrastructure. Guarantee the efficient utilization of cloud-native storage services offered by platforms like AWS S3 and Azure Blob Storage.

  • System Optimization: Collaborating with multi-functional teams to optimize storage systems, implement best practices, and ensure seamless integration with other technology stacks.

  • Incident Response: Overseeing incident response and resolution for storage-related issues, minimizing downtime, and ensuring a resilient storage environment.

  • Conducting capacity planning exercises and collaborating with team members to forecast and meet storage demands efficiently.

  • Automation and Tooling: Driving automation initiatives to streamline storage operations and developing tools for monitoring, alerting, and performance analysis.

  • Continuous Improvement: Implementing continuous improvement processes to enhance storage systems' overall reliability and efficiency.

What We Need To See:

  • Extensive experience in a senior-level role within Site Reliability Engineering, particularly in managing storage infrastructure.

  • Technical Expertise: In-depth knowledge of storage technologies, file systems, and experience with cloud-based storage solutions. Proficiency in scripting and automation tools is essential.

  • Leadership Skills: Strong leadership and people management skills, with the ability to inspire and guide a team towards achieving common objectives.

  • Problem-Solving Skills: Exceptional analytical and problem-solving skills, with the ability to address complex storage-related issues effectively.

  • Collaboration: Demonstrated ability to collaborate with multi-functional teams and communicate effectively with technical and non-technical collaborators.

  • Prior engineering experience with hands-on coding background in storage systems

  • Master's degree in Computer Science, Information Technology, or a related field or equivalent experience

  • 10+ overall years of relevant experience and 5+ yrs of management experience

Ways to stand out from the crowd:

  • Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success.

  • Professional certifications in relevant technologies (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator). Experience with container orchestration platforms and software-defined storage solutions.

  • Proven track record of implementing and managing storage solutions in a large-scale, enterprise environment. Thrive in collaborative environments and enjoy working with various teams. Flexible in adapting to different working styles.

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and dedicated people on the planet working for us. If you're creative and autonomous, we want to hear from you!

NVIDIA's invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning - the next era of computing - with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as "the AI computing company." We're looking to grow our company and establish teams with the most thoughtful people in the world.

The base salary range is 272,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.