Principal Reliability and Automation Engineer
Hydrolix
All India, Delhi • 2 months ago
Experience: 10 to 14 Yrs
PREMIUM
Deal of the Day
--:--:--
7 Days Free Trial
Upgrade to CVX24 Premium
- Free Resume Writing
-
Get a Verified Blue tick
- See who viewed your profile
- Unlimited chat with recruiters
- Rank higher in recruiter searches
- Get up to 10× more recruiter visibility
- Auto-forward profile to 10 top recruiters
- Receive verified recruiter messages directly
- Unlock hidden jobs, not visible to free users
$0
Activate
$0
A small token amount will be charged to verify.
Get Refund in 48 Hours.
After free-trial 6 Months subscription will be auto Activated @ $
1
(Cancel Anytime).
Free Earplugs Delivery Only after Payment of Rs. 99 for Five Consecutive Months.
Enter Your Details
Job Description
As a Principal Site Reliability Engineer at our dynamic Services team, you will play a crucial role in ensuring the reliability and scalability of our cutting-edge platform. Your deep expertise in system reliability and automation will be instrumental in delivering exceptional solutions tailored to our customers' unique needs.
**Key Responsibilities:**
- **Reliability Engineering:** Design and build automated systems to ensure the reliability and scalability of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms.
- **Automation and Efficiency:** Identify and eliminate repetitive manual work through automation and improved tooling, freeing the team to focus on high-value work.
- **Observability Infrastructure:** Enhance observability systems for deep visibility into system behavior, debugging, troubleshooting, and data-driven reliability decisions.
- **CI/CD and Deployment Automation:** Design robust CI/CD pipelines and deployment automation for safe, frequent releases with minimal human intervention.
- **Infrastructure Reliability:** Deploy and maintain a highly reliable fleet of Kubernetes clusters and Hydrolix deployments.
- **Service Optimization:** Implement systems and processes to enhance the reliability, availability, and performance of our services.
- **Root Cause Analysis:** Conduct comprehensive root cause analyses for system failures and implement long-term preventive measures.
- **Collaboration and Customer Engagement:** Work closely with cross-functional teams, share knowledge, and champion SRE best practices.
**Qualifications and Skills:**
- **SRE Expertise:** Minimum 10+ years of experience as a Site Reliability Engineer or DevOps Engineer supporting large-scale distributed systems.
- **Architecture, Performance & Scalability:** Deep experience in designing system architectures with reliability, scalability, and operability as primary concerns.
- **Automation, Platform & Infrastructure Engineering:** Track record of eliminating toil through automation and expertise in configuration management tools.
- **Observability & Reliability Engineering:** Deep expertise in observability tools, reliability concepts, and experience with chaos engineering.
- **Kubernetes & Distributed Systems:** Understanding of Kubernetes architecture, operations, and experience in operating multi-cluster environments.
- **Cloud & Multi-Cloud Expertise:** Proficiency in at least one major cloud platform and familiarity with multi-cloud architectures.
- **Networking, Security & Traffic Management:** Experience in network load balancing, security technology stacks, and standard networking protocols.
- **Data & Storage Systems:** Experience with SQL databases and ability to reason about performance and scaling characteristics of data-intensive systems.
- **Programming & Systems Engineering:** Strong programming skills in Go, Python, or Rust with the ability to build and maintain production-quality tools.
- **Linux & Infrastructure Fundamentals:** Deep experience with Linux systems, including performance tuning and low-level troubleshooting.
- **Incident Management & Operational Excellence:** Extensive experience in leading high-severity incidents, driving post-incident reviews, and improving operational standards.
We are excited to see how your expertise can contribute to the success of Hydrolix and make a significant impact on our platform. As a Principal Site Reliability Engineer at our dynamic Services team, you will play a crucial role in ensuring the reliability and scalability of our cutting-edge platform. Your deep expertise in system reliability and automation will be instrumental in delivering exceptional solutions tailored to our customers' unique needs.
**Key Responsibilities:**
- **Reliability Engineering:** Design and build automated systems to ensure the reliability and scalability of Kubernetes clusters and Hydrolix deployments across multiple cloud platforms.
- **Automation and Efficiency:** Identify and eliminate repetitive manual work through automation and improved tooling, freeing the team to focus on high-value work.
- **Observability Infrastructure:** Enhance observability systems for deep visibility into system behavior, debugging, troubleshooting, and data-driven reliability decisions.
- **CI/CD and Deployment Automation:** Design robust CI/CD pipelines and deployment automation for safe, frequent releases with minimal human intervention.
- **Infrastructure Reliability:** Deploy and maintain a highly reliable fleet of Kubernetes clusters and Hydrolix deployments.
- **Service Optimization:** Implement systems and processes to enhance the reliability, availability, and performance of our services.
- **Root Cause Analysis:** Conduct comprehensive root cause analyses for system failures and implement long-term preventive measures.
- **Collaboration and Customer Engagement:** Work closely with cross-functional teams, share knowledge, and champion SRE best practices.
**Qualifi
Skills Required
Reliability Engineering
Automation
Root Cause Analysis
Knowledge Sharing
Architecture
Automation
Networking
Observability Infrastructure
CICD
Deployment Automation
Infrastructure Reliability
Service Optimization
CrossFunctional Teamwork
Reliability Advocacy
Global Team Collaboration
CustomerFacing Reliability
Performance Scalability
Platform Infrastructure Engineering
Observability Reliability Engineering
Kubernetes Distributed Systems
Cloud MultiCloud Expertise
Security Traffic Management
Data Storage Systems
Programming Systems Engineering
Linux Infrastructure Fundamentals
Incident Management Operational Excellence
Posted on: March 1, 2026
Relevant Jobs
Step 2 of 2