Zendesk’s people have one goal in mind: to make Customer Experience better.
Our products help more than 125,000 global brands (AirBnb, Uber, JetBrains, Slack, among others) make their billions of customers happy, every day.
The AI/ML Platform team is at the forefront of this mission.
We build the foundation that powers every AI-driven experience at Zendesk, enabling product teams to build, evaluate, and deploy state-of-the-art Large Language Model (LLM) applications reliably and at scale.
We're looking for a Senior ML Engineer to lead the next wave of GenAI infrastructure at Zendesk.
This includes our internal research platform, LLM Proxy, A/B Testing & Evaluation benchmarking, agentic workflow orchestration tools.
You’ll empower Zendesk’s ML/AI teams by building secure, cost-optimized, and developer-friendly ML platforms that scale across use cases and products.
You’ll work closely with Staff Engineers, Tech Leads, Product Managers, and other ML teams to deliver robust, production-grade systems that accelerate the impact of AI across Zendesk.
 
Help build benchmarking frameworks for LLMs, including A/B, Offline Evals testing capabilities to assess quality, latency, and cost trade-offs.
Contribute to the design and implementation of Zendesk’s LLM Proxy to enable safe, observable, and cost-optimized access to multiple foundation models.
Partner with applied ML, product, and platform teams to ensure GenAI infrastructure meets the needs of diverse product use cases.
Implement best practices for monitoring, observability, rate-limiting, and cost attribution for LLM services.
Establish strong engineering practices around observability, reliability, security, and cost monitoring.
Work on orchestration tooling to enable multi-step, tool-using AI agents that integrate with Zendesk’s products
5+ years in developing and deploying ML systems in production, with hands-on experience in scaling infrastructure and ensuring service reliability.
Familiarity with core ML infrastructure components such as model registries, feature stores, orchestration tools, and inference serving systems.
Understanding of LLM systems, GenAI applications, or ML/AI platform components such as vector databases, serving layers, and orchestration tools.
Experience with GCP, AWS, or Azure; Kubernetes; Docker; and distributed systems.
Proficiency in at least one server-side language (Python, Java, Scala, Golang, or Ruby) and solid grounding in testing and CI/CD workflows.
Understanding of architecture principles and patterns for building scalable, resilient backend services.
Experience taking projects from design to production deployment, with a focus on maintainability and performance.
Preferred Qualifications
Agentic and Automation: Experience with AI technologies in automating processes and developing agentic solutions and frameworks
Experience building tools that improve developer productivity and platform adoption across multiple teams.
What our tech stack looks like
Our code is written in Python.
Our servers live in AWS.
LLM Vendors: OpenAI, Anthropic, Google, Llama
Infra: Kubernetes, Docker, Kafka, AWS
What we offer
Full ownership of the projects you work on.
What you will be doing will have a huge impact.
Team of passionate people who love what they do.
Exciting projects, ability to implement your own ideas and improvements.
Opportunity to learn and grow.
...and everything you need to be effective and maintain work-life balance
Flexible working hours.
Professional development funds.
Comfortable office and a remote setup.
Choice of your laptop and other equipment.
Premium Medical Insurance as well as Private Life Assurance.
#LI-KM7
Hybrid: In this role, our hybrid experience is designed at the team level to give you a rich onsite experience packed with connection, collaboration, learning, and celebration - while also giving you flexibility to work remotely for part of the week.
This role must attend our local office for part of the week.
The specific in-office schedule is to be determined by the hiring manager.