icon
Home icon

Home

Jobs icon

Jobs

Reviews icon

Reviews

Network icon

Network

Resources icon

Resources

|For Employers icon

For Employers

logo
    about
    careers
    FAQs
    privacy policyterms & conditionsfor employers
112k
20k
icon
© 2022 Fairygodboss. All rights reserved.
    My ProfileMy MessagesMy NetworkMy SettingsGroupsEventsMy PostsLog Out

(USA) Director, Engineering

company-logo

Walmart

2.7

Bella Vista, AR

Why you should apply for a job with Walmart:

  • All eligible associates have affordable options that include no lifetime maximum and eligible preventive care covered at 100%.

  • The enhanced maternity benefit supports birth moms with up to 10 weeks of protected paid time away from work.

  • Associates can take advantage of Resources for Living, a free confidential counseling and health information service.

FGB'ers' job reviews

51%
Say women are treated fairly and equally to men

#8261_R-809002-e415f9843cb6ddebbbd6036dbf548157

Position summary

Position Summary...

What you'll do...

Job Description

Introduction

It’s an exciting time to join our Walmart journey. Want to use design and coding skills to solve real-life large-scale problems impacting millions of lives?

We are seeking a Director, Site Reliability Engineering (SRE) for a key position within People Technology organization.  

You'll sweep us off our feet if:

  • You’re a problem solver
  • Comfortable influencing others, leading team, managing stakeholders, getting buy-in from leadership,
  • Someone with a “test and learn” mentality, and an agile way of working to improve your team’s products
  • Experience designing and developing highly available systems that utilize load balancing, horizontal scalability, and high availability;
  • Experience in activities like architecture reviews, code reviews, creating platforms and frameworks, capacity planning, etc
  • Demonstrated experience leading engineering and operational teams responsible for supporting and deploying Enterprise scale cloud services and products
  • Proven track record of improving reliability, availability, incident management and performance of cloud services
  • Proven experience managing software development lifecycle platforms and tools and/or designing, building, servicing, and driving ongoing improvement of service infrastructure systems
  • Experience defining, measuring, and improving Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Operations Processes (Incident, Problem Management), and Operations Toil Reduction through Automation;
  • Experience driving the development of dashboards from application and infrastructure health perspectives using tools such as Splunk, Dynatrace, Prometheus based metric platform
  • Experience in implementing Chaos Engineering concepts and familiarity with tools such Gremlin and Chaos Monkey
  • Deep understanding of and experience in implementing resiliency design patterns frameworks and validations
  • Understanding of containerization concepts including Docker & Kubernetes
  • Understanding of Continuous Delivery and Integration frameworks including deployment automation and configuration management components and familiarity with DevOps / CICD tools

What you'll do..

  • you will be responsible for building and executing a vision to adopt every day engineering excellence to our on prem & cloud deployment in terms of availability, scalability & resiliency.
  • You will implement strategies around tooling for monitoring, deployments, and automated remediation for heterogenous tech stack within People Technology.
  • You will be responsible for evangelizing and socializing the Site Reliability Engineering discipline across the People Technology, serve as a change agent for driving SLO and help promote a culture of continuous improvement measured by operational metrics and KPIs.
  • you will be expected to strategize portfolio / program reliability by working with cross-functional organizations and build roadmaps to drive reliability into the product, enable the enterprise to standardize and adopt application reliability metrics and improve application health
  • Lead the work of other small groups of six to eight engineers in software development and design, identifying short- and long- term solutions and timeline; reviewing and approving proposed solutions, implementing new architectural patterns; and performing design and code reviews of changes.
  • Provides support to the business for new and existing systems by responding to user questions, concerns, and issues (for example, technical feasibility); researching and identifying needed solutions; determining implementation designs; providing guidance regarding implications of new and enhanced systems; and directing users to appropriate contacts for issues outside of own domain.
  • Troubleshoots business and production issues by gathering information (for example, issue, impact, criticality); performing root cause analysis to reduce future issues; engaging support teams when needed; developing solutions; driving the development of an action plan; performing actions as designated in the plan; and completing online documentation.
  • Demonstrates up-to-date expertise and applies this to the development, execution, and improvement of action plans by providing expert advice and guidance to others in the application of information and best practices; supporting and aligning efforts to meet customer and business needs; and building commitment for perspectives and rationales.
  • Provides and supports the implementation of business solutions by building relationships and partnerships with key stakeholders; identifying business needs; determining and carrying out necessary processes and practices; monitoring progress and results; recognizing and capitalizing on improvement opportunities; and adapting to competing demands, organizational changes, and new responsibilities.
  • Models compliance with company policies and procedures and supports company standards of ethics and integrity by incorporating these into the development and implementation of business plans; using the Open Door Policy; and demonstrating and assisting others with how to apply these in executing business processes and practices.

What you'll bring..

  • Over 10+ years of experience in building scaled distributed systems with high throughput and low latency
  • 4+ years of running production engineering or SRE team to drive service reliability by developing tooling that enables metric visibility using SLIs, SLOs, and SLAs
  • Minimum of 4 year’s supervisory experience.
  • Working knowledge of Docker and Kubernetes.
  • Working knowledge of the Linux OS.
  • Working knowledge of Networking (TCP/IP and Application).
  • Understand the concepts of Monitoring, Observability (Prometheus, AlertManager, Grafana, Splunk, Ansible, …).
  • Experience using Git.
  • Experience using CI/CD tools.
  • Experience in 360º people management, growing & grooming teams and to ensure the happiness and productivity of the team’s software engineers.
  • Experience with risk management (technical, product, personnel)
  • Experience partnering with cross-functional project development and in collaborating with other Product development teams, QA, Release Management, Program Management.

Minimum Qualifications...

Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.

As permitted by applicable law, provide evidence of full vaccination as defined by CDC guidelines OR secure approval of medical or religious accommodation for the vaccination mandate.

Preferred Qualifications...

Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.

Primary Location...

508 SW 8TH ST, BENTONVILLE, AR 72712, United States of America

About the company

52045

Walmart

Industry: Retail: Supermarket Company

In 1962, Sam Walton started a single mom-and-pop shop and transformed it into the world’s largest retailer. Since those founding days, one thing has remained consistent: helping our customers save money so they can live better. Today, we’re reinventing the shopping experience - and our associates are at the heart of it (all 2.2 million of them).  When you join our Walmart family of ...

icon
© 2022 Fairygodboss. All rights reserved.
  • about
  • careers
  • FAQs
  • privacy policy
  • terms & conditions
112k
20k