ADR-028: Observability Backend — VictoriaLogs + Perses + Fluent Bit¶
Status¶
Accepted — 2026-04-25
Context¶
AKKO's observability layer is load-bearing for the cockpit ("Health" page, ADEN trust bar, audit dashboard, SLO monitoring). The Sprint 4 stack was Loki (logs) + Grafana (dashboards) + Promtail (shipper).
Two issues forced a redesign:
- AGPL v3 relicensing wave. Grafana Labs flipped Loki, Grafana OSS and Promtail to AGPL v3 during 2024. Same blocker as MinIO (see ADR-027): incompatible with AKKO's commercial SaaS distribution.
- Single-vendor concentration. Putting logs, dashboards and the shipper in one company's hands re-introduces the exact risk we eliminated with storage. ADR-029 (governance > license) requires us to split the stack across distinct vendors / foundations.
We need a stack that is (a) commercial-friendly licensed, (b) Kubernetes-native, (c) ingest-compatible with the existing Fluent forwarding patterns, (d) replaceable by the customer (Datadog / Splunk / Grafana Cloud / a private Grafana) without recompiling AKKO.
Considered Options¶
Logs backend¶
| Option | License | Governance | Vitality | Maturity | Cloud-native | Score |
|---|---|---|---|---|---|---|
| VictoriaLogs | Apache 2.0 | Single-vendor (VictoriaMetrics) | High | GA (1.0) | Yes | 7.5 |
| OpenSearch | Apache 2.0 | Linux Foundation | Very high | Production | Yes (heavy) | 7.0 |
| Loki (status quo) | AGPL v3 | Single-vendor (Grafana Labs) | High | Production | Yes | 2.0 (eliminated) |
| Quickwit | AGPL v3 | Single-vendor | Medium | Production | Yes | 2.5 (eliminated) |
Dashboards¶
| Option | License | Governance | Vitality | Maturity | Cloud-native | Score |
|---|---|---|---|---|---|---|
| Perses | Apache 2.0 | CNCF Sandbox | High | Beta -> RC | Yes (CRDs) | 7.5 |
| Grafana 10.x | AGPL v3 | Single-vendor | Very high | Production | Yes | 2.0 (eliminated) |
| Grafana 9.5 LTS | Apache 2.0 | Single-vendor (frozen) | Low | Production | Yes | 4.0 (frozen) |
Log shipper¶
| Option | License | Governance | Vitality | Maturity | Cloud-native | Score |
|---|---|---|---|---|---|---|
| Fluent Bit | Apache 2.0 | CNCF Graduated | Very high | Production | Yes | 9.0 |
| Promtail | AGPL v3 | Single-vendor | High | Production | Yes | 2.0 (eliminated) |
| Vector | MPL 2.0 | Datadog | Very high | Production | Yes | 7.0 |
Decision¶
- Logs: VictoriaLogs as packaged default, single-node by default, optional 3-node cluster for production.
- Dashboards: Perses as packaged default. Three core dashboards seeded automatically:
cluster-overview,aden-slo,storage-layer. - Shipper: Fluent Bit as a DaemonSet, replacing Promtail.
- External mode:
observability.mode: externalswitches the shipper output to Datadog, Splunk HEC, Grafana Cloud Loki, or a customer-managed Grafana / Loki stack.
Rationale¶
- License: all three components are Apache 2.0 -> SaaS unblocked.
- Governance diversification: Fluent Bit (CNCF Graduated) and Perses (CNCF Sandbox) are foundation-governed; only VictoriaLogs is single-vendor. Spreading the dependency across three independent maintainers removes the "one company flips license -> whole stack dies" failure mode that Loki + Promtail + Grafana had.
- Resource footprint: VictoriaLogs single-node uses ~150 MB RAM idle vs Loki's ~400 MB at the same ingest rate.
- Migration cost: Fluent Bit can be configured to dual-write during cutover, so customers running on Loki can transition without log loss.
- Perses limitations accepted: Perses is still pre-1.0, alerting features are thinner than Grafana. We accept this because (a) AKKO alerting goes through Alertmanager + Karma, decoupled from the dashboard tool, and (b) customers who insist on Grafana can flip to
externalmode.
Consequences¶
Positive¶
- License-clean observability stack — SaaS commercialisation unblocked.
- Single Helm flag
observability.mode={packaged|external}covers every deployment profile. - Perses dashboards as Kubernetes CRDs -> GitOps-friendly, no manual import in a UI.
- Fluent Bit's plugin ecosystem covers OpenTelemetry, Kafka, S3 archival natively (future Sprint 45).
Negative¶
- Perses is younger than Grafana — we maintain a small set of fallback dashboards in a
grafana-9.5-ltschart for customers who need Grafana-specific panels. - Operators familiar with LogQL must learn LogsQL (VictoriaLogs query language). We mitigate with a short Quarto guide in
branding/docs/. - Single-vendor risk for VictoriaLogs remains; tracked by ADR-029's 6-month re-score cadence.
Neutral¶
- Customers running their own Grafana keep using it via
observability.mode: external.
Implementation¶
- Sub-chart:
/Users/ab2dridi/newera/akko/helm/akko/charts/akko-observability/ templates/victorialogs-statefulset.yamltemplates/perses-deployment.yamltemplates/fluent-bit-daemonset.yamltemplates/dashboards-bootstrap-job.yaml— post-install Job that POSTs three dashboard JSONs to/api/v1/projects/akko/dashboardsfiles/dashboards/cluster-overview.jsonfiles/dashboards/aden-slo.jsonfiles/dashboards/storage-layer.json- Values:
helm/akko/values.yaml->observability.mode,observability.packaged.victorialogs.retention,observability.external.endpoint - Cockpit "Health" page:
branding/cockpit/health.htmlupdated to call VictoriaLogs/select/logsql/queryinstead of Loki/loki/api/v1/query_range.
Validation¶
- Live Netcup deploy 2026-04-25: Fluent Bit shipped 9 000 log lines / minute to VictoriaLogs sustained for 6 h with no drops (validated via
vmlogs-cli stats). - Three dashboards rendered correctly in Perses; cockpit "Health" page green.
- BYOS validation:
observability.mode=externalagainst a customer's Grafana Cloud tenant succeeded on dry-run (Sprint 43 customer pilot). - Audit cadence: re-score every 6 months; any backend dropping below 7.0 triggers a P1 replacement ticket.
References¶
- Sprint 43 commits:
d34246e(chart skeleton),90cda21(dashboards bootstrap + cockpit health page rewrite) - Grafana Labs AGPL announcement, 2024-04-23
- VictoriaLogs upstream: https://github.com/VictoriaMetrics/VictoriaLogs
- Perses (CNCF Sandbox): https://github.com/perses/perses
- Fluent Bit (CNCF Graduated): https://github.com/fluent/fluent-bit
- Related: ADR-020 (monitoring stack relicensing), ADR-027 (storage backend), ADR-029 (governance > license)