Senior Site Reliability Engineer

Gruve

All India • 1 month ago

Experience: 6 to 10 Yrs

PREMIUM

Deal of the Day --:--:--

A recruiter messaged CVX24 Premium users few seconds ago.

Upgrade to CVX24 Premium: Only $2.49

Free Resume Writing
Get a Verified Blue tick
See who viewed your profile
Unlimited chat with recruiters
Rank higher in recruiter searches
Get up to 10× more recruiter visibility
Get practical interview tips and guidance
Receive verified recruiter messages directly
Unlock hidden jobs, not visible to free users

$4.99 $2.49 🔥 50% OFF

Activate

$4.99 $2.49 all inc.

🔥 50% OFF

(Validity: 6 Months. After payment confirmation we will reach out to you)

Job Description

Role Overview: At Gruve, you will be leading reliability strategy and architectural improvements across various areas including infrastructure, GPU systems, observability, ML Ops, and IT Ops. Your role involves mentoring engineers, managing high-severity incidents, and driving SLO governance. Working with a team of SRE engineers, you will be responsible for setting up, maintaining, and troubleshooting the stack from bare metal through the application layer. Key Responsibilities: - Architect reliability improvements across Kubernetes, GPU infrastructure, ML Ops, networking, and monitoring. - Lead incident management, blameless post-mortems, and error-budget policies. - Drive automation, IaC, and reliability tooling at scale. - Oversee metrics, logs, tracing, and dashboards; ensure actionable alerting. - Integrate GPU operators/exporters and model lifecycle workflows for inference platforms. - Mentor junior and mid-level SREs and guide cross-team initiatives. Qualifications Required: - 69 years of SRE or platform engineering experience. - Expertise in Kubernetes operations and cloud platform experience (AWS/GCP/Azure). - Advanced networking and security fundamentals. - Strong coding background in Python, Go, or Java. - Deep observability knowledge in Prometheus, Grafana, ELK / Fluentd. About Gruve: Gruve is an innovative software services startup dedicated to transforming enterprises into AI powerhouses. Specializing in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs), Gruve's mission is to assist customers in utilizing their data for making more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks. If you are passionate about technology and eager to make an impact, Gruve fosters a culture of innovation, collaboration, and continuous learning in a diverse and inclusive workplace. Gruve is an equal opportunity employer welcoming applicants from all backgrounds. Role Overview: At Gruve, you will be leading reliability strategy and architectural improvements across various areas including infrastructure, GPU systems, observability, ML Ops, and IT Ops. Your role involves mentoring engineers, managing high-severity incidents, and driving SLO governance. Working with a team of SRE engineers, you will be responsible for setting up, maintaining, and troubleshooting the stack from bare metal through the application layer. Key Responsibilities: - Architect reliability improvements across Kubernetes, GPU infrastructure, ML Ops, networking, and monitoring. - Lead incident management, blameless post-mortems, and error-budget policies. - Drive automation, IaC, and reliability tooling at scale. - Oversee metrics, logs, tracing, and dashboards; ensure actionable alerting. - Integrate GPU operators/exporters and model lifecycle workflows for inference platforms. - Mentor junior and mid-level SREs and guide cross-team initiatives. Qualifications Required: - 69 years of SRE or platform engineering experience. - Expertise in Kubernetes operations and cloud platform experience (AWS/GCP/Azure). - Advanced networking and security fundamentals. - Strong coding background in Python, Go, or Java. - Deep observability knowledge in Prometheus, Grafana, ELK / Fluentd. About Gruve: Gruve is an innovative software services startup dedicated to transforming enterprises into AI powerhouses. Specializing in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs), Gruve's mission is to assist customers in utilizing their data for making more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks. If you are passionate about technology and eager to make an impact, Gruve fosters a culture of innovation, collaboration, and continuous learning in a diverse and inclusive workplace. Gruve is an equal opportunity employer welcoming applicants from all backgrounds.

Skills Required

Kubernetes networking monitoring Python Go Java GPU infrastructure ML Ops Prometheus Grafana ELK Fluentd

Posted on: March 6, 2026

Relevant Jobs