#R10231023
s are not only part of history, they're making history.Northrop Grumman Defense Systems (NGDS), Beavercreek Ohio, is seeking a Site Reliability Engineer to define what "reliable enough" means from the user's perspective, instrumenting and measuring against those targets, and building the tooling and runbooks that make failure recoverable. Candidates will partner with dev teams pushing operational quality upstream before code ships and lead problem resolution in production. SREs are comfortable debugging distributed systems, resolving incidents, and translating findings into lasting reliability improvements. They will work closely with software developer teams accomplishing the following:
Incident Response - Lead real time detection, triage, and resolution of production incidents; conduct post mortems and drive corrective actions. Complete work independently and as a part of an Agile team
Toil Reduction - Identify repetitive operational work, develop automation and runbooks, and implement CI/CD pipelines to reduce manual effort
Reliability Evaluations - Define service level objectives (SLOs) and error budget policies; assess system reliability against those targets using observability data
Platform Enablement - Build and maintain shared tooling (e.g., Kubernetes clusters, GitOps workflows); enable development teams with SDKs, instrumentation guidance, and reliability best practices
This requisition may be filled at a higher level based on qualifications listed below and is contingent on funding.
*Basic Qualifications: *
Engineer (Level 2): 2+ years related experience with Bachelor's degree in Computer Science or related STEM degree from an accredited institution; 0 years with Master's degree
*Principal Engineer (Level 3): *5+ years related experience with Bachelor's degree in Computer Science or related STEM degree from an accredited institution; 3 years with Master's degree
U.S. Citizenship and ability to obtain a Top-Secret security clearance
Systems-thinking mindset - understand how components fail together and assess blast radius
Observability fundamentals - beyond the three signals, know how to use telemetry to optimize services and engineers' quality of life
Basic software-engineering skills - build automation, non-trivial APIs, follow Git workflows, and actively participate in code reviews
Linux and networking fundamentals
Strong communication, collaboration, and organizational abilities
Specialty Skills (1 or more):
Platform & Infrastructure - Kubernetes, Argo CD/GitOps, disaster recovery planning, capacity forecasting
*Observability *- OpenTelemetry standards, Grafana/Perses, Tempo, ClickHouse, VictoriaMetrics
Automation & Toil Reduction - Scripting, CI/CD pipeline development, runbook automation, "DevOps" practices
Developer Enablement - Instrumentation SDKs, onboarding of SRE practices for engineering teams
Data & Alerting - High quality dashboards, alert design, anomaly detection techniques
*Preferred Qualifications: *