and build self-healing systems that adapt to infrastructure changes, migrations, and global-scale challenges
- Design smart traffic and load management to keep performance steady during viral spikes, large events, and global campaigns
- Develop monitoring, alerting, and automation that spots and fixes issues before they affect users
- Lead the creation of reliability frameworks for topology mapping, capacity planning, automated recovery, and disaster readiness
- Continuously refine system architecture for better performance, fault tolerance, and maintainability
- Apply chaos engineering, fault injection, and failure simulations to stress-test our systems
- Use A/B testing to measure the real-world impact of your improvements
- Mentor engineers and help set the team's technical direction
Qualifications
Minimum Qualifications:
- 5+ years in backend, infrastructure, or reliability engineering
- Strong coding skills in Python, Go, Java, C++, or similar
- Solid grasp of distributed systems, networking, and fault-tolerant design
- Experience with Linux/Unix and large-scale infrastructure (cloud or on-prem)
- Proven track record delivering high-availability systems in production
- Strong debugging, analysis, and problem-solving skills
- Strong communication and writing skills.
Preferred Qualifications:
- Experience with video platforms, streaming, or CDN optimization
- Background in highly reliable production systems
- Knowledge of service mesh, edge routing, or traffic shaping at scale
- Hands-on experience with chaos engineering and incident response
- Strong system design and technical leadership skills
- Excellent communication and ability to work across global teams