#210700008
n. Defines, measure, and report on Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services.
Supports the deployment, monitoring, and reliability of Large Language Model (LLM) applications and systems utilizing Model Context Protocol (MCP).
Ensures high availability and reliability of applications running on modern infrastructure such as AWS, Kubernetes, and related cloud-native platforms, and batch processing environments.
Deploys, monitor, and troubleshoot workloads on AWS, leveraging cloud-native services.
Manages, monitor, and optimize batch jobs using schedulers like Autosys and Control-M.
Writes and optimize SQL queries for data extraction, transformation, and reporting.
Participates in on-call rotations, respond to production incidents, and drive root cause analysis and postmortems. Works closely with data science, engineering, and operations teams to support AI/ML model deployment, LLM workflows, and batch processes.
Identifies reliability gaps and drive initiatives to improve system resilience, scalability, and efficiency.
Required qualifications, capabilities, and skills
Formal training or certification on software engineering concepts and 2+ years applied experience
Proven experience as an SRE, DevOps Engineer, or similar role supporting AI/ML, LLM, and batch processing environments.
Exposure to Large Language Models (LLMs) and Model Context Protocol (MCP). Proficiency in Python for automation and scripting.
Strong knowledge of AWS cloud services and infrastructure.
Experience with SQL and relational databases.
Hands-on experience with job schedulers (Autosys, Control-M, or similar).
Familiarity with observability and telemetry tools (e.g., Prometheus, Grafana, CloudWatch, Datadog).
Understanding of SLI/SLO concepts and their application in production environments.
Solid troubleshooting and incident management skills.
Preferred qualifications, capabilities, and skills