Get more other jobs in your inbox
Verified daily — no ghost listings.
About This RoleAI processing…
Airbnb was born in 2007 when two hosts welcomed three guests to their San Francisco home, and has since grown to over 5 million hosts who have welcomed over 2 billion guest arrivals in almost every country across the globe. Every day, hosts offer unique stays and experiences that make it possible for guests to connect with communities in a more authentic way.
Key Responsibilities
- 1Embrace an AI-first engineering approach, using LLM-powered agents to generate and iterate on code while you focus on problem-solving, system design, and quality oversight.
- 2Investigate and resolve complex production issues by analyzing distributed traces, resource utilization patterns, and system metrics to identify root causes and implement durable fixes.
- 3Design and implement observability features including span instrumentation, SLO dashboards, and fine-grained attribution for blocking time, memory, and CPU across tenant workloads.
- 4Develop and iterate on tooling for deployment triage, service health monitoring, and incident response automation using LLM capabilities.
- 5Lead technical design discussions and RFCs for initiatives like performance regression testing pipelines, emergency deployment workflows, and runtime resiliency improvements.
- 6Partner with tenant teams to debug performance issues, provide guidance on GraphQL best practices, and enable self-service capabilities for common operational tasks.
- 7Contribute to open-source Viaduct by ensuring platform improvements are generalizable and well-documented for the broader engineering community.
Requirements
- 9+ years of software engineering experience, with significant depth in backend systems, distributed architectures, and platform engineering.
- Deep expertise in observability and monitoring, including experience designing SLO frameworks, distributed tracing systems, and metrics pipelines at scale.
- Proven track record in reliability engineering, with hands-on experience in incident response, root cause analysis, and building systems that maintain high availability (99.99%+).
- Strong experience with performance tuning and resource management in JVM-based systems, including profiling, garbage collection optimization, and understanding of concurrency models (blocking I/O, thread pools, coroutines in Kotlin).
- Experience operating critical, high-traffic systems with a focus on deployment safety, automated rollbacks, and progressive delivery strategies.
- Familiarity with GraphQL or similar API gateway/data access layer technologies
- Experience building developer tooling and platforms, with a product mindset focused on developer experience and self-service capabilities.
- Strong leadership and communication skills with the ability to partner effectively across infrastructure and product engineering teams.
Perks & Benefits
Apply to This Job in Minutes
Generate ATS-optimized resume + cover letter + interview prep with Jobease.ca AI. Complete your application faster.
75% of AI Resumes Get Rejected
Beat the ATS with Jobease.ca's AI Resume Builder. Optimized for real hiring systems.
Build My ResumeProfile Match
Loading…Checking your profile against this job…
Job Overview
Share This Job
Track All Your Applications
Never lose track again. Jobease.ca organizes every application, interview, and follow-up.
Organize My Search