Gruve Logo

Senior Site Reliability Engineer

Gruve

All India • 1 month ago

Experience: 6 to 10 Yrs

PREMIUM
Deal of the Day --:--:--

A recruiter messaged CVX24 Premium users few seconds ago.

Upgrade to CVX24 Premium: Only $2.49

Bluetooth Earphone
  • Free Resume Writing
  • Get a Verified Blue tick
  • See who viewed your profile
  • Unlimited chat with recruiters
  • Rank higher in recruiter searches
  • Get up to 10× more recruiter visibility
  • Get practical interview tips and guidance
  • Receive verified recruiter messages directly
  • Unlock hidden jobs, not visible to free users
$4.99 $2.49 🔥 50% OFF
Activate
Bluetooth Earphone

(Validity: 6 Months. After payment confirmation we will reach out to you)

Job Description

Role Overview: At Gruve, you will be leading reliability strategy and architectural improvements across various areas including infrastructure, GPU systems, observability, ML Ops, and IT Ops. Your role involves mentoring engineers, managing high-severity incidents, and driving SLO governance. Working with a team of SRE engineers, you will be responsible for setting up, maintaining, and troubleshooting the stack from bare metal through the application layer. Key Responsibilities: - Architect reliability improvements across Kubernetes, GPU infrastructure, ML Ops, networking, and monitoring. - Lead incident management, blameless post-mortems, and error-budget policies. - Drive automation, IaC, and reliability tooling at scale. - Oversee metrics, logs, tracing, and dashboards; ensure actionable alerting. - Integrate GPU operators/exporters and model lifecycle workflows for inference platforms. - Mentor junior and mid-level SREs and guide cross-team initiatives. Qualifications Required: - 69 years of SRE or platform engineering experience. - Expertise in Kubernetes operations and cloud platform experience (AWS/GCP/Azure). - Advanced networking and security fundamentals. - Strong coding background in Python, Go, or Java. - Deep observability knowledge in Prometheus, Grafana, ELK / Fluentd. About Gruve: Gruve is an innovative software services startup dedicated to transforming enterprises into AI powerhouses. Specializing in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs), Gruve's mission is to assist customers in utilizing their data for making more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks. If you are passionate about technology and eager to make an impact, Gruve fosters a culture of innovation, collaboration, and continuous learning in a diverse and inclusive workplace. Gruve is an equal opportunity employer welcoming applicants from all backgrounds. Role Overview: At Gruve, you will be leading reliability strategy and architectural improvements across various areas including infrastructure, GPU systems, observability, ML Ops, and IT Ops. Your role involves mentoring engineers, managing high-severity incidents, and driving SLO governance. Working with a team of SRE engineers, you will be responsible for setting up, maintaining, and troubleshooting the stack from bare metal through the application layer. Key Responsibilities: - Architect reliability improvements across Kubernetes, GPU infrastructure, ML Ops, networking, and monitoring. - Lead incident management, blameless post-mortems, and error-budget policies. - Drive automation, IaC, and reliability tooling at scale. - Oversee metrics, logs, tracing, and dashboards; ensure actionable alerting. - Integrate GPU operators/exporters and model lifecycle workflows for inference platforms. - Mentor junior and mid-level SREs and guide cross-team initiatives. Qualifications Required: - 69 years of SRE or platform engineering experience. - Expertise in Kubernetes operations and cloud platform experience (AWS/GCP/Azure). - Advanced networking and security fundamentals. - Strong coding background in Python, Go, or Java. - Deep observability knowledge in Prometheus, Grafana, ELK / Fluentd. About Gruve: Gruve is an innovative software services startup dedicated to transforming enterprises into AI powerhouses. Specializing in cybersecurity, customer experience, cloud infrastructure, and advanced technologies such as Large Language Models (LLMs), Gruve's mission is to assist customers in utilizing their data for making more intelligent decisions. As a well-funded early-stage startup, Gruve offers a dynamic environment with strong customer and partner networks. If you are passionate about technology and eager to make an impact, Gruve fosters a culture of innovation, collaboration, and continuous learning in a diverse and inclusive workplace. Gruve is an equal opportunity employer welcoming applicants from all backgrounds.

Posted on: March 6, 2026

Relevant Jobs

Senior Site Reliability Engineer, Tenant Services Geo (Mumbai)

Gitlab

All India

View Job →

Senior Site Reliability Engineer, Tenant Services Geo (Mumbai)

Gitlab

All India

View Job →

Senior Site Reliability Engineer, Tenant Services Geo (Mumbai)

Gitlab

All India

View Job →

Senior Site Reliability Engineer, Tenant Services Geo (Mumbai)

Gitlab

All India

View Job →

Senior Site Reliability Engineer, Tenant Services Geo (Mumbai)

Gitlab

All India

View Job →

Senior Site Reliability Engineer, Tenant Services Geo (Mumbai)

Gitlab

All India

View Job →

Senior Site Reliability Engineer, Tenant Services Geo (Mumbai)

Gitlab

All India

View Job →

Senior Site Reliability Engineer

Aptly Technology Corporation

All India, Thane

View Job →

Senior Site Reliability Engineer

Aptly Technology Corporation

All India, Thane

View Job →

Senior Site Reliability Engineer, Tenant Services Geo (Mumbai)

Gitlab

All India

View Job →