Runbook: Backend API High Error Rate

Trigger: Elevated 4xx or 5xx rate on the application API or the frontend BFF. Impact: Workspace actions, auth refresh, billing, file workflows, and project/session operations may fail.

Quick triage

Check whether the failure is in the BFF or the backend API.
Check backend health:
bash
```
curl -fsS http://localhost:7000/actuator/health
```
Replace the host and port with the environment-specific backend target if needed.
Confirm whether the Python worker is also unhealthy if run-related endpoints are failing:
bash
```
curl -fsS http://localhost:7002/api/v1/health/readiness
```

If the errors are mostly `401`

Check:

frontend auth refresh behavior
JWT signing and validation configuration
backend auth routes and token issuance
any recent secret/config drift affecting auth

Use the auth module and postmortem docs as the first references:

If the errors are mostly `429`

Check:

whether the spike is a request-rate issue or a quota exhaustion issue
Redis health and key pressure for rate limiting
user tier / quota changes that may have tightened limits

Relevant docs:

If the errors are mostly `500` or `502`

Check:

BFF configuration for BACKEND_ORIGIN, CF_CLIENT_ID, and CF_CLIENT_SECRET
backend application logs and actuator health
Python worker reachability for run-related failures
recent billing, file upload, or R2 changes if the errors cluster around those workflows

Common BFF-originated error bodies:

500 Proxy misconfigured
502 Upstream fetch failed

Escalation

If customer-facing write flows are failing for more than 5 minutes:

Page the backend on-call owner
Open an incident issue
Capture whether the failure is auth, quota, file, billing, or Python-worker related

Post-action

Link the failing surface to the relevant API reference page.
Add or update a runbook if the failure mode was not already documented.

Runbook: Backend API High Error Rate ​

Quick triage ​

If the errors are mostly 401 ​

If the errors are mostly 429 ​

If the errors are mostly 500 or 502 ​

Escalation ​

Post-action ​

Runbook: Backend API High Error Rate

Quick triage

If the errors are mostly `401`

If the errors are mostly `429`

If the errors are mostly `500` or `502`

Escalation

Post-action