Incident Response Manager - Data Center

TikTok

4.5

(6)

San Jose, CA

Why you should apply for a job to TikTok:

  • 4.5/5 in overall job satisfaction
  • 4.5/5 in supportive management
  • 100% say women are treated fairly and equally to men
  • 100% would recommend this company to other women
  • 100% say the CEO supports gender diversity
  • Ratings are based on anonymous reviews by Fairygodboss members.
  • Employee well-being is supported via hybrid work, short-term counseling through our EAP and a premium subscription to Headspace.
  • We embrace diversity across all dimensions and provide employees with 9 employee resource groups globally, including our WOMEN ERG.
  • Comprehensive parental leave policy as well as fertility treatment through healthcare providers with a $20,000 lifetime maximum.
  • #7426054917305714981

    Position summary

    iven thinking, and a proactive approach to continuous improvement and operational resilience.

    Responsibilities

    • Serve as the first responder in the IRC Operation Center, detecting and responding to events across infrastructure, facilities using tools such as Server Automation, Data Center Infrastructure Management, Network monitoring, Grafana, and related systems.
    • Respond promptly to events including but not limited to:
    • Environmental systems (e.g. high temperature, humidity, power fluctuations or failures)
    • IT infrastructure (e.g. server performance issues, network outages, system failures)
    • Facility and environmental alerts relevant to operations.
    • External Facing Services (e.g. colocation maintenance notices, service requests from CDN partners, and critical notifications)
    • Conduct detailed investigations to diagnose the root cause of events, assess their impact, and determine appropriate response actions.
    • Monitor and analyze detected events, accurately classify incidents based on potential or actual customer impact, and proactively communicate risks.
    • Coordinate timely escalations by notifying and collaborating with relevant support teams to ensure swift incident resolution.
    • Monitor incident response performance against agreed SLAs, ensuring timely alerts and notifications.
    • Manage incidents calmly and efficiently, performing in-depth investigations to determine root causes and impacts, while promptly engaging and coordinating with the designated resolver teams to facilitate timely resolution.
    • Draft detailed incident reports and conduct post-mortem reviews to document lessons learned.
    • Generate regular reports to deliver comprehensive insights into the effectiveness of incident response and recovery processes.
    • Analyze trends and patterns in events to identify opportunities for improvement and optimization
    • Own and drive the Incident, Problem, and Change Management processes in alignment with ITIL or internal ITSM frameworks.
    • Develop and maintain a comprehensive library of Standard Operating Procedures (SOPs), Methods of Procedure (MOPs), runbooks, and operational guides to ensure consistency and readiness across teams.
    • Lead or support continuous improvement projects aimed at enhancing incident response capabilities, operational security, system reliability, and overall infrastructure performance. Collaborate with cross-functional teams to implement engineering solutions and process optimizations.
    • Provide technical and operational leadership to the incident response center team, ensuring consistent performance and adherence to best practices.

    Qualifications

    Minimum Qualifications

    • Bachelor's degree in Computer Science, Information Technology, Engineering, or a related technical field.
    • Strong technical background with prioritized experience in Data Center Facility Operations Center (DC FOC) management. Experience in IT infrastructure, network operations, or systems monitoring is also desirable.
    • Proven ability to analyze complex systems, investigate incidents, and identify root causes effectively.
    • Familiarity with monitoring and alerting tools such as Grafana, Nagios, or similar platforms.
    • Experience in incident and problem management processes, with the ability to drive corrective actions and coordinate cross-functional teams.
    • Excellent troubleshooting skills and the ability to work in fast-paced environments during critical incidents.
    • Strong communication skills to draft reports, conduct reviews, and liaise with technical and non-technical stakeholders.

    Preferred Qualifications

    • 5 years of experience in IT environments-such as data centers or enterprise systems-combined with hands-on incident and problem management experience.
    • Proactive mindset with a focus on continuous improvement and operational excellence.
    • Proven ability to perform effectively and within tight time constraints to resolve issues and meet deliverables.
    • Hands-on experience with ticketing systems, monitoring tools such as Grafana, server infrastructure, and data center systems.
    • Working knowledge and/or certifications in one or more of the following:
      ITIL Foundation/CompTIA Server+/Schneider Electric Data Center Certified Associate (DCCA)/Cisco Certified Network Associate (CCNA)/Project Management Professional (PMP)/Data Analytics and Visualization tools or methodologies
    • Demonstrated experience in driving or contributing to improvement projects focused on operational efficiency, security enhancements, or infrastructure reliability.
    • Ability to manage multiple tasks and projects, ensuring timely delivery and alignment with organizational goals.
    • This position is part of a team that provides 24/7 support and requires working scheduled shifts, which may include holidays.

    Job Information

    [For Pay Transparency] Compensation Description (annually)

    The base salary range for this position in the selected city is $100800 - $220400 annually.

    Compensation may vary outside of this range depending on a number of factors, including a candidate's qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.

    Benefits may vary depending on the nature of employment and the country work location. Employees have day one access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short-term and long-term disability coverage, life insurance, wellbeing benefits, among others. Employees also receive 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure).

    The Company reserves the right to modify or change these benefits programs at any time, with or without notice.

    For Los Angeles County (unincorporated) Candidates:

    Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:

    1. Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;

    2. Appropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and

    3. Exercising sound judgment.

    Why you should apply for a job to TikTok:

  • 4.5/5 in overall job satisfaction
  • 4.5/5 in supportive management
  • 100% say women are treated fairly and equally to men
  • 100% would recommend this company to other women
  • 100% say the CEO supports gender diversity
  • Ratings are based on anonymous reviews by Fairygodboss members.
  • Employee well-being is supported via hybrid work, short-term counseling through our EAP and a premium subscription to Headspace.
  • We embrace diversity across all dimensions and provide employees with 9 employee resource groups globally, including our WOMEN ERG.
  • Comprehensive parental leave policy as well as fertility treatment through healthcare providers with a $20,000 lifetime maximum.