Runbook: Python Worker Degraded
Trigger: AI runs fail, indexing stalls, or the Python worker health endpoints degrade. Impact: PRECHECK, T661, LOGBOOK, and file indexing workflows may fail or stall.
Health checks
Check the worker health endpoints first:
bash
curl -fsS http://localhost:7002/api/v1/health/
curl -fsS http://localhost:7002/api/v1/health/readiness
curl -fsS http://localhost:7002/api/v1/health/livenessIf the environment runs the worker on 8000, replace the port accordingly.
Common failure classes
Java cannot reach Python
Check:
PYTHON_SERVICE_BASE_URLin the backend- network reachability from backend to worker
- whether the worker process is actually listening on the expected port
Internal token mismatch
Symptoms often show up as backend-side call failures even when both services are healthy.
Check:
INTERNAL_API_TOKENon both sides- whether recent secret rotation updated both services consistently
Readiness fails because dependencies are missing
Check:
OPENAI_API_KEYREDIS_URLPGVECTOR_CONNECTION_STRINGand related settings whenRAG_ENABLED=true- R2 settings if indexing or download URL flows are failing
Long-running run failures
Check:
- backend-side Python client timeouts
- worker logs around task execution
- whether selected files are actually
READY
Related references:
Indexing-specific checks
If uploads succeed but files never become READY:
- Confirm the upload was acknowledged through the confirm endpoint.
- Check file
indexStatusvalues in the application. - Check worker readiness and any RAG prerequisites.
- Inspect whether the issue is broad or isolated to a specific file/content type.
Escalation
If customer-visible run execution is degraded for more than 5 minutes:
- Page the backend or AI workflow owner
- Capture sample failing session IDs and file IDs
- Record whether the failure is health, auth, timeout, or indexing related
Post-action
- Update the relevant API reference page if a new operational caveat was discovered.
- Add a postmortem if the incident was customer-visible.