Postmortem: 2024-02-15 Login Outage
Timeline (UTC)
- 14:00: Deployment of
v1.2.0completes. - 14:02: Alert
High401Ratefires. PagerDuty notifies on-call. - 14:10: On-call confirms users cannot log in.
- 14:15: Rollback to
v1.1.9initiated. - 14:20: System recovered.
Root Cause
The JWT_SECRET environment variable was accidentally dropped from the Doppler prd configuration during a cleanup, causing all generated tokens to fail validation against the old signatures.
Impact
- 1,200 users experienced login failures.
- 15 minutes of total downtime for the Auth module.
Temporary Fix
Rolled back the deployment and manually re-injected the JWT_SECRET.
Permanent Fix
Implement a startup assertion in JwtAuthenticationFilter that strictly crashes the application on boot if JWT_SECRET is null or less than 32 characters.
Action Items
- [x] Add startup assertion for JWT_SECRET (Owner: @backend-lead, Due: 2024-02-20)
- [ ] Implement Doppler secret drift detection in CI (Owner: @sre, Due: 2024-03-01)