Skip to content

Runbook: Backend API High Error Rate

Trigger: Elevated 4xx or 5xx rate on the application API or the frontend BFF. Impact: Workspace actions, auth refresh, billing, file workflows, and project/session operations may fail.

Quick triage

  1. Check whether the failure is in the BFF or the backend API.
  2. Check backend health:
    bash
    curl -fsS http://localhost:7000/actuator/health
    Replace the host and port with the environment-specific backend target if needed.
  3. Confirm whether the Python worker is also unhealthy if run-related endpoints are failing:
    bash
    curl -fsS http://localhost:7002/api/v1/health/readiness

If the errors are mostly 401

Check:

  • frontend auth refresh behavior
  • JWT signing and validation configuration
  • backend auth routes and token issuance
  • any recent secret/config drift affecting auth

Use the auth module and postmortem docs as the first references:

If the errors are mostly 429

Check:

  • whether the spike is a request-rate issue or a quota exhaustion issue
  • Redis health and key pressure for rate limiting
  • user tier / quota changes that may have tightened limits

Relevant docs:

If the errors are mostly 500 or 502

Check:

  • BFF configuration for BACKEND_ORIGIN, CF_CLIENT_ID, and CF_CLIENT_SECRET
  • backend application logs and actuator health
  • Python worker reachability for run-related failures
  • recent billing, file upload, or R2 changes if the errors cluster around those workflows

Common BFF-originated error bodies:

  • 500 Proxy misconfigured
  • 502 Upstream fetch failed

Escalation

If customer-facing write flows are failing for more than 5 minutes:

  1. Page the backend on-call owner
  2. Open an incident issue
  3. Capture whether the failure is auth, quota, file, billing, or Python-worker related

Post-action

  1. Link the failing surface to the relevant API reference page.
  2. Add or update a runbook if the failure mode was not already documented.