Skip to content

Observability — SLO & Distributed Tracing

AKKO ships an opinionated SLO stack out of the box: Prometheus metrics, Dashboards dashboards, multi-window multi-burn-rate alerts, and (Sprint 27) distributed tracing via Dashboards Tempo. This page explains the concept, the dashboards, and how to adjust the budget.

Concept — Error budget & burn rate

An SLO (Service Level Objective) is the reliability target users rely on, e.g. "ADEN answers 99% of queries successfully over 30 days". The error budget is the inverse: 1% of queries may fail without breaching the promise.

A burn rate of 1x means you are spending the budget exactly at the expected pace. 14.4x means you are burning 14.4 times faster — at that rate, you exhaust the monthly budget in under two days. AKKO follows the Google SRE Workbook multi-window multi-burn-rate pattern:

Window combo Burn rate Severity Action
1h AND 5m 14.4x warning page on-call
6h AND 1h 6x info create a ticket
1d (records) n/a dashboard only

The three SLO dashboards

All three are auto-provisioned into Dashboards under the AKKO SLO folder by the grafana_dashboard: "akko" sidecar label.

  1. AKKO — Platform SLO Overview (akko-platform-slo) — top-level availability per tier (ingress, auth, data, AI) + error budget remaining (30d) per tier.
  2. AKKO — ADEN SLO (akko-aden-slo) — p50/p95/p99 query latency, success rate, multi-window burn rate, queries per second.
  3. AKKO — Trino AI Plugin (akko-trino-ai-plugin) — per-function latency (akko_ai_search, ai_gen, akko_ai_embed), cache hit rate, circuit breaker state, requests + errors.

Find them at https://grafana.<your-domain>/dashboards once you have signed in with SSO (Keycloak → Dashboards OIDC).

Distributed tracing — Dashboards Tempo

ADEN is instrumented with OpenTelemetry (FastAPIInstrumentor). When AKKO_TRACING_ENABLED=true is set on the ADEN pod, spans are exported via OTLP/gRPC to the in-cluster Tempo service at akko-tempo:4317. Tempo's query API is wired into Dashboards as a datasource, so clicking a span in the ADEN dashboard opens the corresponding trace.

Enable tracing platform-wide by flipping the values:

akko-aden:
  tracing:
    enabled: true
akko-tempo:
  enabled: true     # default

The Tempo backend defaults to the local filesystem (10 GiB PVC) — fine for dev and small prod. For high-volume deployments, swap the backend to S3 in helm/akko/charts/akko-tempo/files/tempo.yaml (comment in-file).

Adjusting the budget

The SLO targets live in helm/akko/templates/prometheusrule-slo.yaml. To tighten ADEN's target from 99% → 99.5%, change the budget factor in both the aden_query_error_rate_* threshold expressions:

# Was: > (14.4 * 0.01)   # 99% SLO → 1% budget
# New: > (14.4 * 0.005)  # 99.5% SLO → 0.5% budget

Then helm upgrade — Prometheus picks up the new rule within 30 s.

Alerts shipped

Eleven ADEN- and Tempo-related alert rules are defined:

  • AdenSLOFastBurn — fast-burn (14.4x, 1h+5m windows), warning.
  • AdenSLOSlowBurn — slow-burn (6x, 6h+1h windows), info.
  • AdenTraceIngestionStalled — Tempo received zero spans for 10 min, warning (tracing pipeline broken).

Recording rules pre-compute aden_query_success_rate_5m and aden_query_error_rate_{5m,1h,6h,1d} so dashboards and alerts share the same series and stay cheap to evaluate.

Where it runs

  • Metrics — Prometheus (kube-prometheus-stack), /metrics exposed by ADEN via prometheus-fastapi-instrumentator.
  • Dashboards — Dashboards sidecar, auto-discovered via grafana_dashboard: "akko" label on three ConfigMaps.
  • Alerts — Alertmanager, routed through the standard AKKO pipeline.
  • Traces — Dashboards Tempo (akko-tempo sub-chart), OTLP/gRPC on 4317, HTTP on 4318, query API on 3200.