Job Description
Job DescriptionSite Reliability EngineerOnsite- Bay Area, CA
Skills
Relevant Skills and Experience
What You’ll Do (Day-to-Day)
-
Own and manage our cloud infrastructure (GCP or AWS, on-prem).
-
Build, maintain, and optimize Kubernetes clusters (including GPU-backed clusters).
-
Implement and improve CI/CD pipelines (GitHub Actions).
-
Write and maintain Infrastructure as Code (Terraform).
-
Monitor system health and performance using Grafana and other observability tools.
-
Ensure high availability, reliability, and uptime across platforms.
-
Handle infrastructure maintenance, upgrades, and scaling.
-
Administer and improve our platform architecture and apply general security best practices across the stack.
Note: This is an internal-facing role — no customer interaction.
Must-Have:
-
4+ years in SRE, DevOps, or Infrastructure Engineering
-
Solid experience with GCP or AWS (hybrid/on-prem a plus)
-
Experience with Kubernetes cluster management (GPU experience a bonus)
-
Hands-on with Terraform and CI/CD (GitHub)
-
Experience with monitoring/observability (Grafana, etc.)
-
Strong understanding of high availability and infrastructure reliability
-
Familiarity with platform/cluster architecture and administration
-
Security mindset and ability to apply best practice
Nice-to-Have:
-
Startup experience (you enjoy building, not just maintaining)
-
Experience with scalable GPU infrastructure for AI/ML