Runbook: Python Worker Degraded

Trigger: AI runs fail, indexing stalls, or the Python worker health endpoints degrade. Impact: PRECHECK, T661, LOGBOOK, and file indexing workflows may fail or stall.

Health checks

Check the worker health endpoints first:

bash

curl -fsS http://localhost:7002/api/v1/health/
curl -fsS http://localhost:7002/api/v1/health/readiness
curl -fsS http://localhost:7002/api/v1/health/liveness

If the environment runs the worker on 8000, replace the port accordingly.

Common failure classes

Java cannot reach Python

Check:

PYTHON_SERVICE_BASE_URL in the backend
network reachability from backend to worker
whether the worker process is actually listening on the expected port

Internal token mismatch

Symptoms often show up as backend-side call failures even when both services are healthy.

Check:

INTERNAL_API_TOKEN on both sides
whether recent secret rotation updated both services consistently

Readiness fails because dependencies are missing

Check:

OPENAI_API_KEY
REDIS_URL
PGVECTOR_CONNECTION_STRING and related settings when RAG_ENABLED=true
R2 settings if indexing or download URL flows are failing

Long-running run failures

Check:

backend-side Python client timeouts
worker logs around task execution
whether selected files are actually READY

Related references:

Indexing-specific checks

If uploads succeed but files never become READY:

Confirm the upload was acknowledged through the confirm endpoint.
Check file indexStatus values in the application.
Check worker readiness and any RAG prerequisites.
Inspect whether the issue is broad or isolated to a specific file/content type.

Escalation

If customer-visible run execution is degraded for more than 5 minutes:

Page the backend or AI workflow owner
Capture sample failing session IDs and file IDs
Record whether the failure is health, auth, timeout, or indexing related

Post-action

Update the relevant API reference page if a new operational caveat was discovered.
Add a postmortem if the incident was customer-visible.

Runbook: Python Worker Degraded ​

Health checks ​

Common failure classes ​

Java cannot reach Python ​

Internal token mismatch ​

Readiness fails because dependencies are missing ​

Long-running run failures ​

Indexing-specific checks ​

Escalation ​

Post-action ​

Runbook: Python Worker Degraded

Health checks

Common failure classes

Java cannot reach Python

Internal token mismatch

Readiness fails because dependencies are missing

Long-running run failures

Indexing-specific checks

Escalation

Post-action