Principal Reliability and Automation Engineer

Hydrolix

All India, Delhi • 2 months ago

Experience: 10 to 14 Yrs

PREMIUM

Deal of the Day --:--:--

7 Days Free Trial

Upgrade to CVX24 Premium

Free Resume Writing
Get a Verified Blue tick
See who viewed your profile
Unlimited chat with recruiters
Rank higher in recruiter searches
Get up to 10× more recruiter visibility
Auto-forward profile to 10 top recruiters
Receive verified recruiter messages directly
Unlock hidden jobs, not visible to free users

Activate

A small token amount will be charged to verify. Get Refund in 48 Hours.
After free-trial 6 Months subscription will be auto Activated @ $ 1 (Cancel Anytime).
Free Earplugs Delivery Only after Payment of Rs. 99 for Five Consecutive Months.

Job Description

As a Principal Site Reliability Engineer at our dynamic Services team, you will play a crucial role in ensuring the reliability and scalability of our cutting-edge platform. Your deep expertise in system reliability and automation will be instrumental in delivering exceptional solutions tailored to our customers' unique needs. **Key Responsibilities:** - **Reliability Engineering:** Design and build automated systems to ensure the reliability and scalability of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms. - **Automation and Efficiency:** Identify and eliminate repetitive manual work through automation and improved tooling, freeing the team to focus on high-value work. - **Observability Infrastructure:** Enhance observability systems for deep visibility into system behavior, debugging, troubleshooting, and data-driven reliability decisions. - **CI/CD and Deployment Automation:** Design robust CI/CD pipelines and deployment automation for safe, frequent releases with minimal human intervention. - **Infrastructure Reliability:** Deploy and maintain a highly reliable fleet of Kubernetes clusters and Hydrolix deployments. - **Service Optimization:** Implement systems and processes to enhance the reliability, availability, and performance of our services. - **Root Cause Analysis:** Conduct comprehensive root cause analyses for system failures and implement long-term preventive measures. - **Collaboration and Customer Engagement:** Work closely with cross-functional teams, share knowledge, and champion SRE best practices. **Qualifications and Skills:** - **SRE Expertise:** Minimum 10+ years of experience as a Site Reliability Engineer or DevOps Engineer supporting large-scale distributed systems. - **Architecture, Performance & Scalability:** Deep experience in designing system architectures with reliability, scalability, and operability as primary concerns. - **Automation, Platform & Infrastructure Engineering:** Track record of eliminating toil through automation and expertise in configuration management tools. - **Observability & Reliability Engineering:** Deep expertise in observability tools, reliability concepts, and experience with chaos engineering. - **Kubernetes & Distributed Systems:** Understanding of Kubernetes architecture, operations, and experience in operating multi-cluster environments. - **Cloud & Multi-Cloud Expertise:** Proficiency in at least one major cloud platform and familiarity with multi-cloud architectures. - **Networking, Security & Traffic Management:** Experience in network load balancing, security technology stacks, and standard networking protocols. - **Data & Storage Systems:** Experience with SQL databases and ability to reason about performance and scaling characteristics of data-intensive systems. - **Programming & Systems Engineering:** Strong programming skills in Go, Python, or Rust with the ability to build and maintain production-quality tools. - **Linux & Infrastructure Fundamentals:** Deep experience with Linux systems, including performance tuning and low-level troubleshooting. - **Incident Management & Operational Excellence:** Extensive experience in leading high-severity incidents, driving post-incident reviews, and improving operational standards. We are excited to see how your expertise can contribute to the success of Hydrolix and make a significant impact on our platform. As a Principal Site Reliability Engineer at our dynamic Services team, you will play a crucial role in ensuring the reliability and scalability of our cutting-edge platform. Your deep expertise in system reliability and automation will be instrumental in delivering exceptional solutions tailored to our customers' unique needs. **Key Responsibilities:** - **Reliability Engineering:** Design and build automated systems to ensure the reliability and scalability of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms. - **Automation and Efficiency:** Identify and eliminate repetitive manual work through automation and improved tooling, freeing the team to focus on high-value work. - **Observability Infrastructure:** Enhance observability systems for deep visibility into system behavior, debugging, troubleshooting, and data-driven reliability decisions. - **CI/CD and Deployment Automation:** Design robust CI/CD pipelines and deployment automation for safe, frequent releases with minimal human intervention. - **Infrastructure Reliability:** Deploy and maintain a highly reliable fleet of Kubernetes clusters and Hydrolix deployments. - **Service Optimization:** Implement systems and processes to enhance the reliability, availability, and performance of our services. - **Root Cause Analysis:** Conduct comprehensive root cause analyses for system failures and implement long-term preventive measures. - **Collaboration and Customer Engagement:** Work closely with cross-functional teams, share knowledge, and champion SRE best practices. **Qualifi

Skills Required

Reliability Engineering Automation Root Cause Analysis Knowledge Sharing Architecture Automation Networking Observability Infrastructure CICD Deployment Automation Infrastructure Reliability Service Optimization CrossFunctional Teamwork Reliability Advocacy Global Team Collaboration CustomerFacing Reliability Performance Scalability Platform Infrastructure Engineering Observability Reliability Engineering Kubernetes Distributed Systems Cloud MultiCloud Expertise Security Traffic Management Data Storage Systems Programming Systems Engineering Linux Infrastructure Fundamentals Incident Management Operational Excellence

Posted on: March 1, 2026

Relevant Jobs

Senior Designer- Electrical

Barry-Wehmiller

All India, Chennai

View Job →

Lead Platform Engineer/Platform Architect

PEOPLE EQUATION PRIVATE LIMITED

All India

View Job →

Engineering Manager (JIRA Project Management)

Newgen Software

All India, Noida

View Job →

Senior Project Head

DAS FOODTECH PVT. LTD.

All India, Gurugram

View Job →

Customer Service - Engineering

Cadence

All India, Pune

View Job →

Site Reliability Engineer - Vice President Level

NatWest Group

All India, Gurugram

View Job →

Software Development Specialist

Accelya Services India

All India

View Job →

Senior RF Systems Specialist

Botlab Dynamics

All India

View Job →

Engineering Manager (JIRA Project Management)

Newgen Software

All India, Noida

View Job →

Senior RF Systems Specialist

Botlab Dynamics

All India

View Job →

Principal Reliability and Automation Engineer

7 Days Free Trial

Enter Your Details

Job Description

Skills Required

Relevant Jobs

Senior Designer- Electrical

Lead Platform Engineer/Platform Architect

Engineering Manager (JIRA Project Management)

Senior Project Head

Customer Service - Engineering

Site Reliability Engineer - Vice President Level

Software Development Specialist

Senior RF Systems Specialist

Engineering Manager (JIRA Project Management)

Senior RF Systems Specialist

Application Submitted

Your Professional Info

Login / Register Free