AI Agent Performance & Reliability Engineer
Pilotcrew AI
All India, Delhi • 1 month ago
Experience: 2 to 6 Yrs
PREMIUM
Deal of the Day
--:--:--
15 Days Free Trial
After Free Trial → Flat 50% OFF
Upgrade to CVX24 Premium
- Free Resume Writing
-
Get a Verified Blue tick
- See who viewed your profile
- Unlimited chat with recruiters
- Rank higher in recruiter searches
- Get up to 10× more recruiter visibility
- Auto-forward profile to 10 top recruiters
- Receive verified recruiter messages directly
- Unlock hidden jobs, not visible to free users
$0
Activate
$0
A small token amount will be charged to verify.
Get Refund in 48 Hours.
Free Earplugs Delivery Only after Payment of Rs. 99 for Five Consecutive Months.
After free-trial 6 Months subscription will be auto Activated @ $
1
(Cancel Anytime). Quoted price includes 50% discount.
Enter Your Details
Job Description
As a Machine Learning Engineer at Pilotcrew AI, you will be responsible for designing and building scalable evaluation infrastructure for Large Language Models (LLMs) and AI agents. Your role will involve architecting distributed inference pipelines, implementing automated benchmarking systems, developing adversarial testing frameworks, and optimizing inference for latency, cost, and throughput.
Key Responsibilities:
- Design and implement distributed LLM inference pipelines
- Build automated benchmarking systems for reasoning, planning, and tool use
- Implement pass@k, reliability metrics, variance analysis, and statistical confidence evaluation
- Develop adversarial testing frameworks for stress-testing agents
- Create structured evaluation pipelines (rule-based and model-based graders)
- Build trace capture, logging, and telemetry systems for multi-step agent workflows
- Validate tool calls and sandboxed execution environments
- Optimize inference for latency, cost, and throughput
- Manage dataset versioning and reproducible benchmark pipelines
- Deploy and monitor GenAI systems in production (AWS/GCP/Azure)
Qualifications Required:
- Strong Python programming and system design skills
- Hands-on experience with Generative AI systems and LLM APIs
- Experience with PyTorch or TensorFlow
- Experience building production ML or GenAI systems
- Strong understanding of decoding strategies, temperature effects, and sampling variance
- Familiarity with async processing, distributed task execution, or job scheduling
- Experience with Docker and cloud deployment
- Strong debugging, observability, and reliability engineering mindset
Additional Company Details:
Pilotcrew AI builds infrastructure for AI Agent Evaluation, benchmarking large language models, running automated agent evaluations, and hosting AI arenas for competitive testing. The company's mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems.
Why Join Pilotcrew AI:
- Work on cutting-edge AI agent evaluation infrastructure
- Solve real-world GenAI reliability challenges
- High technical ownership and autonomy
- Opportunity to shape how AI agents are benchmarked at scale As a Machine Learning Engineer at Pilotcrew AI, you will be responsible for designing and building scalable evaluation infrastructure for Large Language Models (LLMs) and AI agents. Your role will involve architecting distributed inference pipelines, implementing automated benchmarking systems, developing adversarial testing frameworks, and optimizing inference for latency, cost, and throughput.
Key Responsibilities:
- Design and implement distributed LLM inference pipelines
- Build automated benchmarking systems for reasoning, planning, and tool use
- Implement pass@k, reliability metrics, variance analysis, and statistical confidence evaluation
- Develop adversarial testing frameworks for stress-testing agents
- Create structured evaluation pipelines (rule-based and model-based graders)
- Build trace capture, logging, and telemetry systems for multi-step agent workflows
- Validate tool calls and sandboxed execution environments
- Optimize inference for latency, cost, and throughput
- Manage dataset versioning and reproducible benchmark pipelines
- Deploy and monitor GenAI systems in production (AWS/GCP/Azure)
Qualifications Required:
- Strong Python programming and system design skills
- Hands-on experience with Generative AI systems and LLM APIs
- Experience with PyTorch or TensorFlow
- Experience building production ML or GenAI systems
- Strong understanding of decoding strategies, temperature effects, and sampling variance
- Familiarity with async processing, distributed task execution, or job scheduling
- Experience with Docker and cloud deployment
- Strong debugging, observability, and reliability engineering mindset
Additional Company Details:
Pilotcrew AI builds infrastructure for AI Agent Evaluation, benchmarking large language models, running automated agent evaluations, and hosting AI arenas for competitive testing. The company's mission is to make AI agents measurable, reliable, and production-ready through structured, scalable evaluation systems.
Why Join Pilotcrew AI:
- Work on cutting-edge AI agent evaluation infrastructure
- Solve real-world GenAI reliability challenges
- High technical ownership and autonomy
- Opportunity to shape how AI agents are benchmarked at scale
Skills Required
Posted on: March 16, 2026
Relevant Jobs
Step 2 of 2