Observability — SLO & Distributed Tracing¶
AKKO ships an opinionated SLO stack out of the box: Prometheus metrics, Dashboards dashboards, multi-window multi-burn-rate alerts, and (Sprint 27) distributed tracing via Dashboards Tempo. This page explains the concept, the dashboards, and how to adjust the budget.
Concept — Error budget & burn rate¶
An SLO (Service Level Objective) is the reliability target users rely on, e.g. "ADEN answers 99% of queries successfully over 30 days". The error budget is the inverse: 1% of queries may fail without breaching the promise.
A burn rate of 1x means you are spending the budget exactly at the
expected pace. 14.4x means you are burning 14.4 times faster — at that rate,
you exhaust the monthly budget in under two days. AKKO follows the Google SRE
Workbook multi-window multi-burn-rate pattern:
| Window combo | Burn rate | Severity | Action |
|---|---|---|---|
| 1h AND 5m | 14.4x | warning | page on-call |
| 6h AND 1h | 6x | info | create a ticket |
| 1d (records) | n/a | — | dashboard only |
The three SLO dashboards¶
All three are auto-provisioned into Dashboards under the AKKO SLO folder by
the grafana_dashboard: "akko" sidecar label.
- AKKO — Platform SLO Overview (
akko-platform-slo) — top-level availability per tier (ingress, auth, data, AI) + error budget remaining (30d) per tier. - AKKO — ADEN SLO (
akko-aden-slo) — p50/p95/p99 query latency, success rate, multi-window burn rate, queries per second. - AKKO — Trino AI Plugin (
akko-trino-ai-plugin) — per-function latency (akko_ai_search,ai_gen,akko_ai_embed), cache hit rate, circuit breaker state, requests + errors.
Find them at https://grafana.<your-domain>/dashboards once you have signed
in with SSO (Keycloak → Dashboards OIDC).
Distributed tracing — Dashboards Tempo¶
ADEN is instrumented with OpenTelemetry (FastAPIInstrumentor). When
AKKO_TRACING_ENABLED=true is set on the ADEN pod, spans are exported via
OTLP/gRPC to the in-cluster Tempo service at akko-tempo:4317. Tempo's query
API is wired into Dashboards as a datasource, so clicking a span in the ADEN
dashboard opens the corresponding trace.
Enable tracing platform-wide by flipping the values:
The Tempo backend defaults to the local filesystem (10 GiB PVC) — fine for
dev and small prod. For high-volume deployments, swap the backend to S3 in
helm/akko/charts/akko-tempo/files/tempo.yaml (comment in-file).
Adjusting the budget¶
The SLO targets live in helm/akko/templates/prometheusrule-slo.yaml. To
tighten ADEN's target from 99% → 99.5%, change the budget factor in both
the aden_query_error_rate_* threshold expressions:
Then helm upgrade — Prometheus picks up the new rule within 30 s.
Alerts shipped¶
Eleven ADEN- and Tempo-related alert rules are defined:
AdenSLOFastBurn— fast-burn (14.4x, 1h+5m windows), warning.AdenSLOSlowBurn— slow-burn (6x, 6h+1h windows), info.AdenTraceIngestionStalled— Tempo received zero spans for 10 min, warning (tracing pipeline broken).
Recording rules pre-compute aden_query_success_rate_5m and
aden_query_error_rate_{5m,1h,6h,1d} so dashboards and alerts share the
same series and stay cheap to evaluate.
Where it runs¶
- Metrics — Prometheus (kube-prometheus-stack),
/metricsexposed by ADEN viaprometheus-fastapi-instrumentator. - Dashboards — Dashboards sidecar, auto-discovered via
grafana_dashboard: "akko"label on three ConfigMaps. - Alerts — Alertmanager, routed through the standard AKKO pipeline.
- Traces — Dashboards Tempo (
akko-temposub-chart), OTLP/gRPC on 4317, HTTP on 4318, query API on 3200.