Bangalore Urban, India
As an Reliability Engineer, you will be designing, building and operating features and services that makes Xi, Nutanix cloud services to be secure, reliable, completely elastic, scalable, and self healing. Delivering reliable and high-performance services and features. Nutanix requires engineers with exceptional expertise and boundless creativity.
- Work in concert with engineering teams to evolve services for better scalability, reliability and development velocity
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Focus on improving Reliability
- Practice sustainable incident response and blameless postmortems
- Define and develop software for tasks associated with the developing, designing and debugging of applications
- Develop tools to improve ability to rapidly deploy and effectively monitor custom applications in large scale environments
- Participate in a 24x7 on call rotation
- Strong understanding of Linux operating systems
- Be proficient in one or more cloud providers, including AWS, Azure, GCP
- Deep Experience with setup and architecture of queuing, caching, microservices and service mesh systems
- Systematic problem-solving approach, strong communication skills, a sense of ownership and drive
- Deep understanding of service metrics and alarms through the development of dashboards, service KPIs, alarming systems
- Knowledge of data structures, relational and non-relational databases, networking, Linux internals, filesystems, web architecture
- Highly skilled at one or more domains: Infrastructure As Code tools (Docker, Terraform, Puppet, Helm), Monitoring tools (Prometheus, Zabbix), Container Orchestration tools (Kubernetes, Docker), Database technologies (Cassandra, Postgres), CI/CD tools(Jenkins, Spinnaker)
- Passion for automating everything repetitive
- Experience working in an operational environment with mission critical tier-one services with associated on-call support
- Designed Monitoring, Logging and Reliability Processes for systems at scale
How do I know if this role is for me
- Do you like thinking about large scale problems that have a lot of moving parts?
- Do you like thinking about how to make large systems more reliable?
- Are you okay with working on software that will likely never be overtly seen by an external user?
- Do you enjoy the process of diagnosing and fixing a problem?
- Do you like looking through metrics and logs as if it were a treasure hunt ?
- Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.g. ssh, proc, cron, swaps, etc.)?
- Are you comfortable with the idea of being on-call in which you are likely to be in high-stakes scenario where something needs to be fixed?
- Are you able to stay calm under pressure?
- Do you approach problems in a logical, process-oriented way?
- Are you comfortable attempting a problem that has never been solved before?
- Are you someone who thinks about how you can make things better?