Site Operations Lead

AI Fabrik

San Mateo, CA, USA

Published: 6/14/2022

Real Estate

Full Time

Job Description

About AI Fabrik

AI Fabrik builds an edge inference delivery network for high-performance tokens, with faster time-to-market from grid to tokens. Our mission is to build the inference infrastructure we wished every enterprise already had — close to users, close to the cloud, and extremely resilient for real-time workloads. We are builders, architects, engineers, and researchers with hands-on experience in real-world AI deployment in production, and decades of data center experience that taught us exactly what needs to change.

AI Fabrik was incubated inside Gruve and backed by Mayfield, Xora (Temasek), Acclimate Ventures, Cisco Investments — existing investors from Gruve who followed us into this new chapter. We are deploying five initial production sites, with the first one coming online in July 2026.

About the Role

We are seeking an experienced operations leader to oversee the day-to-day management of our mission-critical infrastructure. In this role, you will be responsible for ensuring the reliability, availability, and scalability of live 24x7 production environments, while maintaining exceptional service levels for customers and stakeholders. The ideal candidate has hands-on experience operating critical facilities, establishing and managing service level agreements (SLAs), building strong vendor and partner relationships, and proactively identifying and mitigating risks before they impact operations. You will lead incident response efforts, drive capacity planning initiatives, manage operating budgets, and continuously improve operational processes to support business growth. Experience with high-density GPU deployments, AI infrastructure, and liquid cooling technologies is highly desirable. This is a unique opportunity to help shape and scale the operational foundation of next-generation AI infrastructure.

Key Responsibilities

Own day-to-day operation and uptime of our live sites — keeping power, cooling, network, and compute infrastructure available and healthy in a 24x7 environment
Manage the ongoing vendor ecosystem (facility maintenance, smart/remote hands, cooling, UPS and generator service, fire systems, physical security) — defining, tracking, and enforcing SLAs and holding each vendor to performance, response times, and budget
Build and run the preventive and corrective maintenance program, scheduling maintenance windows and coordinating vendors with minimal disruption to live workloads
Lead incident and outage response — own on-call and escalation, drive rapid resolution, and close the loop with root-cause analysis and preventive actions
Monitor facility health continuously (DCIM, building management, environmental) and manage capacity — power, cooling, space, and rack utilization — ahead of the engineering team's growth
Run change management for the live environment, and coordinate ongoing hardware operations (installs, moves, decommissions, cabling, cross-connects, spares) in support of engineering
Own operating budget and efficiency (opex, utility costs, PUE), physical security operations, and compliance, inspections, and audits (fire/safety, environmental, frameworks such as SOC 2)
Maintain operational documentation (runbooks, MOPs, SOPs/EOPs), report to leadership on uptime, capacity, incidents, and spend, and support new site bring-up and handover into operations as locations come online.

Basic Qualifications

Proven experience operating live data center or critical facilities — owning uptime, maintenance, and vendor performance in a 24x7 environment. This is a hard requirement
Strong vendor and service-provider management: setting and enforcing SLAs and maintenance contracts, and holding multiple vendors accountable on availability, quality, and cost
Working knowledge of critical facility systems in operation — power (utility, switchgear, UPS, generators, PDUs), mechanical and liquid cooling, fire suppression, cabling, and physical security
Hands-on with monitoring and management tooling (DCIM, building/facility management, environmental), plus solid capacity planning for power, cooling, and space
A track record in incident and outage management — on-call ownership, fast resolution, root-cause analysis, and preventive follow-through
Experience managing operating budgets with demonstrated cost and efficiency control (including PUE/energy), and familiarity with relevant codes, standards, and audits (fire/safety, Uptime Institute, TIA-942)
Strong documentation discipline and stakeholder communication — crisp reporting to leadership and coordination across a distributed US/India team; willing to be on-site, carry on-call, and travel as operations demand
Exposure to high-density and GPU/AI infrastructure and liquid/immersion cooling is a strong plus, as is new-site bring-up experience and relevant certifications (e.g., CDCP/CDCDP/DCOM, PMP)

Salary Range

$160,000 - $200,000 USD + Benefits

Why AI Fabrik

At AI Fabrik, we hire for impact. We want those who challenge how inference infrastructure is built and who excel at delivering it in production. We are builders, architects, engineers, and researchers. We move fast, work with rigor, and care deeply about what runs in the real world.

We are committed to building a diverse and inclusive team. AI Fabrik is an equal opportunity employer. We welcome applicants from all backgrounds and thank all who apply; however, only those selected for an interview will be contacted.

Please note that this is an onsite position based out of AI Fabrik’s Redwood City, California office.