Skip to content

ADR-028: Observability Backend — VictoriaLogs + Perses + Fluent Bit

Status

Accepted — 2026-04-25

Context

AKKO's observability layer is load-bearing for the cockpit ("Health" page, ADEN trust bar, audit dashboard, SLO monitoring). The Sprint 4 stack was Loki (logs) + Grafana (dashboards) + Promtail (shipper).

Two issues forced a redesign:

  1. AGPL v3 relicensing wave. Grafana Labs flipped Loki, Grafana OSS and Promtail to AGPL v3 during 2024. Same blocker as MinIO (see ADR-027): incompatible with AKKO's commercial SaaS distribution.
  2. Single-vendor concentration. Putting logs, dashboards and the shipper in one company's hands re-introduces the exact risk we eliminated with storage. ADR-029 (governance > license) requires us to split the stack across distinct vendors / foundations.

We need a stack that is (a) commercial-friendly licensed, (b) Kubernetes-native, (c) ingest-compatible with the existing Fluent forwarding patterns, (d) replaceable by the customer (Datadog / Splunk / Grafana Cloud / a private Grafana) without recompiling AKKO.

Considered Options

Logs backend

Option License Governance Vitality Maturity Cloud-native Score
VictoriaLogs Apache 2.0 Single-vendor (VictoriaMetrics) High GA (1.0) Yes 7.5
OpenSearch Apache 2.0 Linux Foundation Very high Production Yes (heavy) 7.0
Loki (status quo) AGPL v3 Single-vendor (Grafana Labs) High Production Yes 2.0 (eliminated)
Quickwit AGPL v3 Single-vendor Medium Production Yes 2.5 (eliminated)

Dashboards

Option License Governance Vitality Maturity Cloud-native Score
Perses Apache 2.0 CNCF Sandbox High Beta -> RC Yes (CRDs) 7.5
Grafana 10.x AGPL v3 Single-vendor Very high Production Yes 2.0 (eliminated)
Grafana 9.5 LTS Apache 2.0 Single-vendor (frozen) Low Production Yes 4.0 (frozen)

Log shipper

Option License Governance Vitality Maturity Cloud-native Score
Fluent Bit Apache 2.0 CNCF Graduated Very high Production Yes 9.0
Promtail AGPL v3 Single-vendor High Production Yes 2.0 (eliminated)
Vector MPL 2.0 Datadog Very high Production Yes 7.0

Decision

  • Logs: VictoriaLogs as packaged default, single-node by default, optional 3-node cluster for production.
  • Dashboards: Perses as packaged default. Three core dashboards seeded automatically: cluster-overview, aden-slo, storage-layer.
  • Shipper: Fluent Bit as a DaemonSet, replacing Promtail.
  • External mode: observability.mode: external switches the shipper output to Datadog, Splunk HEC, Grafana Cloud Loki, or a customer-managed Grafana / Loki stack.

Rationale

  • License: all three components are Apache 2.0 -> SaaS unblocked.
  • Governance diversification: Fluent Bit (CNCF Graduated) and Perses (CNCF Sandbox) are foundation-governed; only VictoriaLogs is single-vendor. Spreading the dependency across three independent maintainers removes the "one company flips license -> whole stack dies" failure mode that Loki + Promtail + Grafana had.
  • Resource footprint: VictoriaLogs single-node uses ~150 MB RAM idle vs Loki's ~400 MB at the same ingest rate.
  • Migration cost: Fluent Bit can be configured to dual-write during cutover, so customers running on Loki can transition without log loss.
  • Perses limitations accepted: Perses is still pre-1.0, alerting features are thinner than Grafana. We accept this because (a) AKKO alerting goes through Alertmanager + Karma, decoupled from the dashboard tool, and (b) customers who insist on Grafana can flip to external mode.

Consequences

Positive

  • License-clean observability stack — SaaS commercialisation unblocked.
  • Single Helm flag observability.mode={packaged|external} covers every deployment profile.
  • Perses dashboards as Kubernetes CRDs -> GitOps-friendly, no manual import in a UI.
  • Fluent Bit's plugin ecosystem covers OpenTelemetry, Kafka, S3 archival natively (future Sprint 45).

Negative

  • Perses is younger than Grafana — we maintain a small set of fallback dashboards in a grafana-9.5-lts chart for customers who need Grafana-specific panels.
  • Operators familiar with LogQL must learn LogsQL (VictoriaLogs query language). We mitigate with a short Quarto guide in branding/docs/.
  • Single-vendor risk for VictoriaLogs remains; tracked by ADR-029's 6-month re-score cadence.

Neutral

  • Customers running their own Grafana keep using it via observability.mode: external.

Implementation

  • Sub-chart: /Users/ab2dridi/newera/akko/helm/akko/charts/akko-observability/
  • templates/victorialogs-statefulset.yaml
  • templates/perses-deployment.yaml
  • templates/fluent-bit-daemonset.yaml
  • templates/dashboards-bootstrap-job.yaml — post-install Job that POSTs three dashboard JSONs to /api/v1/projects/akko/dashboards
  • files/dashboards/cluster-overview.json
  • files/dashboards/aden-slo.json
  • files/dashboards/storage-layer.json
  • Values: helm/akko/values.yaml -> observability.mode, observability.packaged.victorialogs.retention, observability.external.endpoint
  • Cockpit "Health" page: branding/cockpit/health.html updated to call VictoriaLogs /select/logsql/query instead of Loki /loki/api/v1/query_range.

Validation

  • Live Netcup deploy 2026-04-25: Fluent Bit shipped 9 000 log lines / minute to VictoriaLogs sustained for 6 h with no drops (validated via vmlogs-cli stats).
  • Three dashboards rendered correctly in Perses; cockpit "Health" page green.
  • BYOS validation: observability.mode=external against a customer's Grafana Cloud tenant succeeded on dry-run (Sprint 43 customer pilot).
  • Audit cadence: re-score every 6 months; any backend dropping below 7.0 triggers a P1 replacement ticket.

References

  • Sprint 43 commits: d34246e (chart skeleton), 90cda21 (dashboards bootstrap + cockpit health page rewrite)
  • Grafana Labs AGPL announcement, 2024-04-23
  • VictoriaLogs upstream: https://github.com/VictoriaMetrics/VictoriaLogs
  • Perses (CNCF Sandbox): https://github.com/perses/perses
  • Fluent Bit (CNCF Graduated): https://github.com/fluent/fluent-bit
  • Related: ADR-020 (monitoring stack relicensing), ADR-027 (storage backend), ADR-029 (governance > license)