k and affiliate are developing the next-generation high-performance analytical database, with a mission to enable efficient and real-time data-driven decision-making on PB-level data sets. The initial product was forked from Clickhouse, after which large re-architecture had been taken place. The product now not only improves the efficiency of Clickhouse but also fits into the elastic cloud-native infrastructure with better scalability and resource utilization. With years of polishment in the internal EB-level scenarios, we are now ready to serve our business partners via various cloud vendors.
Our software engineers for product infrastructure role combine software and systems engineering disciplines to run high-performance, large-scale distributed infrastructure. This means you will be deeply involved in the developmental lifecycle of critical software services, collaborating closely with product engineers to combine software code and systems knowledge to ensure that cloud-native OLAP engines are reliable, fault-tolerant, efficiently scalable and cost-effective. You will also be leveraging your software engineering expertise to develop software platforms and tools to optimise the operational and engineering efficiencies of complex systems at scale, with particular focus on improving the systems' observability, performance and maintainability.
In this role, you will:
- Building and managing the Global SRE team, including team recruitment, new talent training, system operation/maintenance/coordination and team culture building.
- Improve the cross-team/time zone/regional cooperation mechanism, and provide SRE solutions in line with actual business scenarios based on business orientation.
- Responsible for SRE team arrangement and project management, guiding basic SRE work to be more effective, and improving the overall SRE efficiency.
- Develop process specifications and plans for compliant access, configuration, disaster recovery and fault handling of critical paths of overseas SRE services.
- Responsible for continuously improving the core SRE capabilities of OLAP engine in efficiency, cost, quality, security, etc.
- Develop automation, data visualization and automated monitoring processes to facilitate the optimization of the cloud-native OLAP engine infrastructure.
- Drive the design and engineering of tools, as well as platform solutions, to optimize product engineering and operation efficiencies.
- Manage oncall processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.
Qualifications
- Bachelor degree or above in Computer Science or a related technical discipline and good English communication skills.
- Familiar with SRE-related processes, understand the development trend of SRE technology in the industry, and have a good ability to build an SRE system, 6 years+ SRE experience, big-data or OLAP engine SRE experience is best to have .
- Familiar with SRE technologies, including Kubernetes, Terraform, Ansible, Bash Scripting etc.
- Familiar with cloud computing technologies of Amazon Web Services, Google Cloud Platform and other suppliers.
- Expertise in operations, deployment, and trouble shooting high availability and quality assurance of large-scale distributed systems, with a strong focus on stability and performance.
- Possesses a strong sense of responsibility, a proactive team spirit, and a strong ability to comprehensively analyze and solve problems.
TikTok is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At TikTok, our mission is to inspire creativity and bring joy. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.