#210690597
stems..
Build and maintain observability, monitoring, and telemetry for AI application and platforms.
Build and support automation for alerting, anomaly detection, and self-healing workflows.
Collaborate with engineering, and other stakeholders to drive operational excellence.
Mentor and guide engineers on AIOps standards and operational excellence.
Define and execute the roadmap for AI-assisted SRE and observability.
Required qualifications, capabilities, and skills
Formal training or certification on software engineering concepts and 5+ years applied experience
Demonstrates strong experience in SRE, DevOps, or Platform Engineering roles.
Strong hands-on experience with AWS (ECS, Lambda, API Gateway, Bedrock, CloudWatch, RDS, EKS).
Hands-on experience with AWS Bedrock, OpenAI, or LLM APIs.
Expertise in observability tools: OpenTelemetry, Grafana, Prometheus, ELK, CloudWatch.
Experience with CI/CD tools (GitHub Actions, Jenkins, Spinnaker ).
Proven track record in automation, operational tooling, and event-driven workflows.
In-depth understanding of distributed systems, microservices, and cloud architectures.
Preferred qualifications, capabilities, and skills