Site Reliability Engineer Senior
RSM
Job Description
You will be joining a global professional services firm that aims to instill confidence in a constantly changing world, empowering both clients and employees to reach their full potential. The inclusive culture and talent experience at RSM are unparalleled, providing you with an inspiring environment to thrive personally and professionally.
As a Senior Platform Site Reliability Engineer at RSM, your primary responsibility will be to ensure the reliability, scalability, and availability of NAS AI Ecosystem platforms. This role involves a combination of software engineering and operations to automate platform operations, enhance observability, and maintain stable production environments for AI, data, and backend services.
**Key Responsibilities:**
- Implement reliability engineering practices for AI and data platforms
- Define and monitor SLIs, SLOs, and SLAs
- Automate operational processes to reduce manual effort
- Manage monitoring, logging, and alerting systems
- Perform incident response and root cause analysis
- Enhance scalability, resilience, and disaster recovery capabilities
- Collaborate with engineering teams to integrate reliability into system design
- Maintain CI/CD pipelines and deployment strategies
- Ensure security and compliance across infrastructure
- Participate in production support and on-call rotations
**Qualifications:**
**Minimum Requirements:**
- Experience in Site Reliability Engineering, DevOps, or Platform Engineering
- Proficiency in Python, Go, or Bash
- Experience with Azure, AWS, or GCP
- Hands-on experience with Docker and Kubernetes
- Experience with Prometheus, Grafana, Azure Monitor, or ELK
- Experience with Terraform, ARM, or CloudFormation
- Strong understanding of networking and distributed systems
**Preferred Requirements:**
- Experience supporting AI/ML or data platforms
- Knowledge of chaos engineering and resiliency testing
- Cloud or Kubernetes certifications
- Experience with high-availability, multi-region systems
**Educational Requirements:**
- Bachelors degree
RSM offers a competitive benefits and compensation package, providing flexibility in your schedule to help you balance work and personal life commitments while serving clients effectively. To explore more about the total rewards offered by RSM, visit https://rsmus.com/careers/india.html.
RSM is dedicated to fostering an inclusive environment and does not tolerate discrimination or harassment based on any protected characteristic. Accommodations for applicants with disabilities are available upon request to ensure a fair recruitment process. If you need assistance or accommodation during the recruiting process, please contact us at careers@rsmus.com. You will be joining a global professional services firm that aims to instill confidence in a constantly changing world, empowering both clients and employees to reach their full potential. The inclusive culture and talent experience at RSM are unparalleled, providing you with an inspiring environment to thrive personally and professionally.
As a Senior Platform Site Reliability Engineer at RSM, your primary responsibility will be to ensure the reliability, scalability, and availability of NAS AI Ecosystem platforms. This role involves a combination of software engineering and operations to automate platform operations, enhance observability, and maintain stable production environments for AI, data, and backend services.
**Key Responsibilities:**
- Implement reliability engineering practices for AI and data platforms
- Define and monitor SLIs, SLOs, and SLAs
- Automate operational processes to reduce manual effort
- Manage monitoring, logging, and alerting systems
- Perform incident response and root cause analysis
- Enhance scalability, resilience, and disaster recovery capabilities
- Collaborate with engineering teams to integrate reliability into system design
- Maintain CI/CD pipelines and deployment strategies
- Ensure security and compliance across infrastructure
- Participate in production support and on-call rotations
**Qualifications:**
**Minimum Requirements:**
- Experience in Site Reliability Engineering, DevOps, or Platform Engineering
- Proficiency in Python, Go, or Bash
- Experience with Azure, AWS, or GCP
- Hands-on experience with Docker and Kubernetes
- Experience with Prometheus, Grafana, Azure Monitor, or ELK
- Experience with Terraform, ARM, or CloudFormation
- Strong understanding of networking and distributed systems
**Preferred Requirements:**
- Experience supporting AI/ML or data platforms
- Knowledge of chaos engineering and resiliency testing
- Cloud or Kubernetes certifications
- Experience with high-availability, multi-region systems
**Educational Requirements:**
- Bachelors degree
RSM offers a competitive benefits and compensation package, providing flexibility in your schedule to help you balance work and personal life commitments while serving clients effectively. To explore more about the total rewards offered by
Skills Required
Posted on: April 8, 2026