IT InfiniBand/GPU -Sr Staff Systems Engineer

Cadence Design Systems

4.4

(53)

San Jose, CA

Why you should apply for a job to Cadence Design Systems:

  • 4.4/5 in overall job satisfaction
  • 4.4/5 in supportive management
  • 87% say women are treated fairly and equally to men
  • 89% would recommend this company to other women
  • 87% say the CEO supports gender diversity
  • Ratings are based on anonymous reviews by Fairygodboss members.
  • Parental leave is available for both paternity and maternity
  • Flexible work options available
  • 88% of employees at Cadence say it is a great place to work compared to 57% of employees at a typical U.S.-based company.
  • #R43864

    Position summary

    ll experience in technical roles supporting GPU Infrastructure setup using InfiniBand

    • Experience with interconnections between InfiniBand & GPU's

    • Experience with GPU Enabled MPI's

    • Experience with GPU Nvidia CUDA or AMD's ROCm

    • Experience with; H100, AMD MI210, GPU servers in Cluster

    • Customer deployments and ensure on-time bring-up of GPU Servers. InfiniBand fabric bring-up, configuration, and subnet management on the IB switch

    • Participate in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc.) teams to develop best-in-class practices and tools; you will be analyzing, debugging, and resolving critical firmware and software issues for the workload performance at scale

    • Provide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutions

    • Strong knowledge of Linux operating systems and networking and security concepts.

    • Document and drive acceptance and qualification test plans, procedures, and reports

    Requirements

    • Accelerate strategic customer deployments and ensure on-time bring-up and deployment of HPC infrastructure

    • Participate in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc.) teams to develop best-in-class practices and tools; you will be analyzing, debugging, and resolving critical firmware and software issues for the workload performance at scale

    • Provide engineering solutions to enable large-scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assist systems engineers in building creative solutions

    • Development and implementation of server and rack-level telemetry aspects, collaborate and establish continuous improvements in our design flows

    • Recent experience in critical data center technologies such as server architectures, software containers, job schedulers, and parallel computing. Deployment and operation of large-scale systems; resilient system design; and clustering of computing resources

    • cluster management for HPC and actively connect with management regarding any problems with the equipment and propose a resolution

    • Establish and maintain IT infrastructure and procedures for customer-facing and internal systems

    • Actively establish the technical relationship with our customer's engineers, management, and architects at focus accounts

    • Create and develop test plans for new features on each product. Recommend improvements to enable automated scripting for testing and archiving of results. Develop HPC computing strategies for cloud-based computing, GPU-accelerated computing, etc.

    • Provide remote cluster support to large environments, including scalability/flexibility and troubleshooting end-user issues involving job submission, runtime, and resource access.

    • InfiniBand fabric configuration and administration on Red hat/Centos/Linux experience in configuring PKeys and troubleshooting the end-to-end InfiniBand environment

    • InfiniBand fabric bring-up, configuration, subnet management, and monitoring on the IB switch and client side for multi-tenancy setup, understanding of IPoIB communication modes

    • Performance comparison of the InfiniBand network with cluster interconnects and debugging the InfiniBand performance-related issues

    • Automate configuration management, software updates, and system availability maintenance and monitoring using modern DevOps tools (Ansible, Gitlab, etc.)

    • Be a technical specialist on GPU computing and networking products, directly supporting GPU customers

    • Direct experience and strong knowledge of parallel programming, GPU CUDA/ROCm development, and applications.

    • Actively partner with the R&D teams delivering services to our infrastructure to gather their service requirements to live within this infrastructure.

    • Automate repetitive tasks and implement custom solutions using scripting/programming languages such as bash or python

    • Configure and troubleshoot a heterogeneous (QDR, FDR, EDR) InfiniBand network and associated subnet manager

    • Experience with High-performance computer interconnects (e.g. 10 and 40 Gigabit Ethernet, InfiniBand)

    • Able to move 50+ pounds

    #LI-MA1

    The annual salary range for California is $133,000 to $247,000. You may also be eligible to receive incentive compensation: bonus, equity, and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the salary range is a guideline and compensation may vary based on factors such as qualifications, skill level, competencies and work location. Our benefits programs include: paid vacation and paid holidays, 401(k) plan with employer match, employee stock purchase plan, a variety of medical, dental and vision plan options, and more.
    We're doing work that matters. Help us solve what others can't.

    Why you should apply for a job to Cadence Design Systems:

  • 4.4/5 in overall job satisfaction
  • 4.4/5 in supportive management
  • 87% say women are treated fairly and equally to men
  • 89% would recommend this company to other women
  • 87% say the CEO supports gender diversity
  • Ratings are based on anonymous reviews by Fairygodboss members.
  • Parental leave is available for both paternity and maternity
  • Flexible work options available
  • 88% of employees at Cadence say it is a great place to work compared to 57% of employees at a typical U.S.-based company.