Skip to content

Service level objectives (SLOs)

This page defines customer-facing reliability targets for the SREDSimplify production environment (prd). It is a living document: when architecture or traffic changes, update SLOs and linked runbooks together.

Scope

SurfaceUsersNotes
Web application (Next.js)End customers and internal operatorsIncludes marketing pages and authenticated workspace
API (Spring Boot)Web client and integrationsJWT on custom auth header per API contract
Python document serviceInvoked from backend workflowsLong-running AI and document jobs

Availability SLOs (draft)

These are targets until historical metrics back them; treat the percentages as design goals for alerting thresholds.

ServiceMonthly availability targetMeasurement window
Web + API (synthetic or edge checks)99.5%Rolling 30 days
Background document jobs99.0%Job success rate over completed jobs

Error budget (conceptual)

For a 99.5% monthly availability target, roughly 3.6 hours of combined outage budget exists per month. When burn is high:

  1. Triage with on-call or engineering lead.
  2. Open or update a tracking issue with customer impact.
  3. Link a postmortem if user-visible failure occurred (example).

Latency (draft)

Path classTarget (p95)Notes
Authenticated workspace shellUnder 2s TTFB at edgeExcludes long AI runs
Core REST mutationsUnder 5s server-sideAI-heavy endpoints may use async patterns

Document concrete probes and dashboards in your observability tool of choice; keep deep links out of this repo if they rotate frequently.

Dependencies that affect SLOs

  • PostgreSQL — primary data store; see database runbooks from Architecture hub.
  • Redis — quotas, auth tokens, rate limits; see Redis high memory.
  • External LLM and document providers — third-party outages may consume error budget without a code defect.