#JR-159040
bservability practices, DR readiness, reliability hygiene, and AIOps enablement.
You will partner with engineering, platform, and transformation teams to architect reliability by design-defining SLI/SLO frameworks, ensuring Operational Readiness (ORR) compliance, building reusable reliability patterns, and driving the unified observability and AIOps strategy across the ecosystem. This role is deeply technical and hands-on, requiring strong architectural judgment, systems thinking, and leadership in guiding teams toward operational excellence.
Responsibilities
Lead Reliability Architecture for New Transformations
Ensure all new programs, services, and platforms are onboarded with the right SRE foundations-covering DR readiness, observability, capacity, performance, and operational hygiene
Drive & Enable ORR (Operational Readiness Review) Compliance
Define architectural guardrails, review technical designs, and ensure teams meet ORR requirements before production launch
Define and Operationalize SLI/SLO Frameworks
Partner with engineering to define service-level indicators and objectives, ensuring transformations and new features adhere to reliability goals
Architect Unified Observability & AIOps Integration
Ensure all services are onboarded correctly onto the unified observability stack, with proper instrumentation, dashboards, alerting, and correlation patterns
Define AIOps Enablement Use Cases
Identify and define patterns that leverage telemetry, automation, and intelligence-including anomaly detection, event deduplication, and predictive insights
Reliability Architecture Reviews
Conduct deep technical reviews of system architecture, focusing on resilience, failure modes, performance, availability, and operational workflows
Handhold Transformations Through Hypercare to BAU
Guide new transformations end-to-end-architecture reviews → observability setup → DR completion → ORR readiness → launch → BAU stabilization.
Build Reusable SRE Blueprints
Create standardized templates and patterns for logging, monitoring, alerting, DR design, chaos readiness, and performance baselines
Partner with SRE & Platform Teams
Work closely with Infrastructure, SRE, and Platform Engineering to ensure architectural alignment and drive adoption of reliability engineering best practices
Qualifications
12+ years of experience in large-scale distributed systems, SRE, or platform engineering roles, with deep architectural responsibilities
Expertise in SRE foundations: SLI/SLOs, error budgets, incident response, capacity, chaos engineering, DR, reliability patterns
Strong hands-on background in observability stacks (Datadog, Splunk, Prometheus, Grafana, OpenTelemetry)
Experience with modern cloud-native architecture, container platforms, and microservices
Strong familiarity with DevOps practices, CI/CD pipelines, deployment strategies (blue/green, canary, progressive rollout)
Experience defining and enforcing ORR, reliability gates, operational hygiene, and launch readiness
Ability to influence architecture & engineering teams with strong systems thinking and operational rigor
Excellent communication skills with the ability to translate reliability goals into actionable engineering guidance
Equinix is committed to ensuring that our employment process is open to all individuals, including those with a disability. If you are a qualified candidate and need assistance or an accommodation, please let us know by completing this form.
Equinix is an Equal Employment Opportunity and, in the U.S., an Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to unlawful consideration of race, color, religion, creed, national or ethnic origin, ancestry, place of birth, citizenship, sex, pregnancy / childbirth or related medical conditions, sexual orientation, gender identity or expression, marital or domestic partnership status, age, veteran or military status, physical or mental disability, medical condition, genetic information, political / organizational affiliation, status as a victim or family member of a victim of crime or abuse, or any other status protected by applicable law.
We use artificial intelligence in our hiring process. Learn more here**.**