A
Site Reliability Engineer, Siri Evaluation Reliability
AppleYokohama, Kanagawa-ken, Japan5+ years
Apply Siri’s quality signal drives every model and product decision before a release ships. But a signal is only as trustworthy as the infrastructure behind it.
The Evaluation Reliability Engineering (ERE) team exists to make that infrastructure bulletproof. Within ERE, Core SRE owns the production backbone: resource management, session orchestration, on-call response, and the observability systems that surface failures before they corrupt evaluation signal. We sit at the intersection of distributed systems, ML evaluation infrastructure, and operational excellence.
This is a senior hands-on role. You share primary on-call as part of a global follow-the-sun rotation, lead incident investigations end-to-end, and set the operational bar the rest of the team works against. You are fluent with agentic coding tools like Claude Code, Cursor, or Copilot, and use them as a force multiplier across runbook authoring, automation, and log analysis.
- Own reliability outcomes across the evaluation infrastructure stack: orchestration, capacity, and service health
- Own runbook quality across the team: author runbooks for complex failure categories and set the bar that guides other engineers to produce the same quality
- Build deep expertise in the device orchestration and provisioning layers — understand quota management, retry behavior, and failure modes well enough to diagnose upstream issues independently
- Instrument infrastructure components that lack observability; if a failure is hard to detect, make it easy to detect before the next occurrence
- Balance incident response with proactive reliability work — automation and eliminating recurring failures are core deliverables
- Partner on SLO definition and burn-rate alerting; bring the operational depth that turns reliability targets from aspirational to measurable
- Influence the team’s technical roadmap, mentor junior SREs, and represent infrastructure reliability posture to leadership and in cross-team reviews
- 5+ years of site reliability, infrastructure, or platform engineering experience with direct on-call ownership in production systems
- Hands-on orchestration experience (Kubernetes or equivalent): cluster health, resource management, scheduling, and failure diagnosis at scale
- Experience owning or closely operating a device or VM provisioning pipeline; familiarity with virtualization-layer failure modes is a strong plus
- Track record of improving system reliability against measurable outcomes — uptime, MTTR, incident frequency — not just responding to incidents but eliminating their causes
- Incident command discipline: able to lead a multi-team incident from declaration to close-out
- Depth in at least one of: distributed systems reliability, device management infrastructure, evaluation or ML platform operations
- Demonstrated cross-team technical influence; prior experience shaping reliability practices beyond the immediate team
- None