AF
Manager, Software Engineering (Resilience Engineering)
Remote RemoteFull-time2 weeks ago
Salary
$178,000 - $228,000
Experience
Mid
Job Type
Full-time
Posted
2 weeks ago
Get more other jobs in your inbox
Verified daily — no ghost listings.
About This RoleAI processing…
Affirm is reinventing credit to make it more honest and friendly, giving consumers the flexibility to buy now and pay later without any hidden fees or compounding interest.
Key Responsibilities
- 1Own the design and evolution of platforms that enable safe, controlled production load testing and fault injection.
- 2Ensure strong safeguards are in place, including isolation boundaries, approval workflows, and automated rollback mechanisms to protect real users.
- 3Build systems that provide end-to-end observability, traceability, and auditability for all resilience experiments.
- 4Drive reliability improvements by systematically identifying weaknesses through load testing and chaos experiments.
- 5Establish monitoring, alerting, and incident response practices tailored to proactive resilience validation.
- 6Work closely with engineering teams to design and execute production load tests and chaos experiments safely.
- 7Partner with infrastructure teams to build guardrails around tests and experimentations.
- 8Enable teams to adopt resilience practices by providing reusable tooling, frameworks, and standardized workflows.
- 9Identify systemic weaknesses and lead cross-functional efforts to improve reliability and fault tolerance.
- 10Evangelize a culture of “test failure before failure tests you” across the organization.
Requirements
- Proven experience leading engineering teams in reliability, infrastructure, or distributed systems.
- Hands-on experience with production load testing, chaos engineering, or large-scale system validation.
- Experience with leveraging a chaos engineering vendor such as Gremlin, Harness, or something similar.
- Strong understanding of failure modes in distributed systems, including latency, partial failure, and cascading outages.
- Experience building or operating systems with strong safety guarantees (isolation, rate limiting, guardrails, auditability).
- Familiarity with cloud-native environments (AWS, Kubernetes) and observability tooling.
- Strong programming background (e.g., Python, Kotlin, Java, or similar).
- Excellent problem-solving skills and the ability to balance long-term resilience investments with immediate business needs.
- Strong communication and leadership skills, with a track record of influencing engineering practices across teams.
Perks & Benefits
Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
Apply to This Job in Minutes
Generate ATS-optimized resume + cover letter + interview prep with Jobease.ca AI. Complete your application faster.
75% of AI Resumes Get Rejected
Beat the ATS with Jobease.ca's AI Resume Builder. Optimized for real hiring systems.
Build My ResumeProfile Match
Loading…Checking your profile against this job…
Posted
2 weeks ago
Job Overview
Salary$178,000 - $228,000
Job TypeFull-time
Work ModelRemote
ExperienceMid
LocationRemote
Categoryother
Share This Job
Track All Your Applications
Never lose track again. Jobease.ca organizes every application, interview, and follow-up.
Organize My Search