Runbook: Backend API High Error Rate
Trigger: Elevated 4xx or 5xx rate on the application API or the frontend BFF. Impact: Workspace actions, auth refresh, billing, file workflows, and project/session operations may fail.
Quick triage
- Check whether the failure is in the BFF or the backend API.
- Check backend health:bashReplace the host and port with the environment-specific backend target if needed.
curl -fsS http://localhost:7000/actuator/health - Confirm whether the Python worker is also unhealthy if run-related endpoints are failing:bash
curl -fsS http://localhost:7002/api/v1/health/readiness
If the errors are mostly 401
Check:
- frontend auth refresh behavior
- JWT signing and validation configuration
- backend auth routes and token issuance
- any recent secret/config drift affecting auth
Use the auth module and postmortem docs as the first references:
If the errors are mostly 429
Check:
- whether the spike is a request-rate issue or a quota exhaustion issue
- Redis health and key pressure for rate limiting
- user tier / quota changes that may have tightened limits
Relevant docs:
If the errors are mostly 500 or 502
Check:
- BFF configuration for
BACKEND_ORIGIN,CF_CLIENT_ID, andCF_CLIENT_SECRET - backend application logs and actuator health
- Python worker reachability for run-related failures
- recent billing, file upload, or R2 changes if the errors cluster around those workflows
Common BFF-originated error bodies:
500 Proxy misconfigured502 Upstream fetch failed
Escalation
If customer-facing write flows are failing for more than 5 minutes:
- Page the backend on-call owner
- Open an incident issue
- Capture whether the failure is auth, quota, file, billing, or Python-worker related
Post-action
- Link the failing surface to the relevant API reference page.
- Add or update a runbook if the failure mode was not already documented.