Monitoring Stack¶

Metrics, logs, dashboards, and alerting.


Dashboards URL	`https://metrics.akko.local`
Prometheus URL	`https://prometheus.akko.local`
Authentication	Keycloak SSO (Dashboards)
Helm sub-charts	`monitoring` (kube-prometheus-stack), `loki` (logs layer Stack)

Overview¶

AKKO includes a full observability stack that provides metrics collection, log aggregation, visualization, and alerting. All five monitoring services start with the core profile -- no extra flags needed.

                    grafana.akko.local
                          |
                     Traefik (TLS)
                          |
                    Dashboards (:3000)
                   /       |       \
          Prometheus    logs layer    Alertmanager
           (:9090)    (:3100)    (:9093)
              |          |
         scrape       log shipper
         targets      (:9080)
              |          |
      +-------+---+   Docker
      |       |   |   container
    object storage  JHub  ...   logs

Components¶

Prometheus¶

Metrics collection engine. Scrapes targets every 15 seconds and evaluates alert rules every 15 seconds.

Setting	Value
URL	`https://prometheus.akko.local`
Internal port	`9090`
Config file	`monitoring/prometheus/prometheus.yml`
Alert rules	`monitoring/prometheus/rules/akko-alerts.yml`
Scrape interval	15s

Active scrape targets:

Job	Target	Metrics Path
`prometheus`	`localhost:9090`	`/metrics`
`alertmanager`	`akko-alertmanager:9093`	`/metrics`
`minio`	`akko-minio:9000`	`/minio/v2/metrics/cluster`
`jupyterhub`	`akko-jupyterhub:8000`	`/hub/metrics`

Trino and Airflow metrics

Trino and Airflow scrape targets are currently disabled because their health endpoints return JSON, not Prometheus format. To enable them:

Trino: Deploy JMX Exporter as a Java agent in the Trino image (port 9483)
Airflow: Deploy a statsd_exporter sidecar or install apache-airflow-providers-statsd

Dashboards¶

Visualization and dashboard platform. Pre-configured with three datasources and four dashboards.

Setting	Value
URL	`https://metrics.akko.local`
Internal port	`3000`
Authentication	Keycloak SSO
Datasource config	`monitoring/grafana/provisioning/datasources/datasource.yml`
Dashboard config	`monitoring/grafana/provisioning/dashboards/dashboards.yml`

Pre-configured datasources:

Datasource	Type	URL	Default
Prometheus	`prometheus`	`http://akko-prometheus-server:9090`	Yes
logs layer	`loki`	`http://akko-loki:3100`	No
Alertmanager	`alertmanager`	`http://akko-alertmanager:9093`	No

Pre-built dashboards:

Dashboard	File	Description
AKKO Overview	`akko-overview.json`	Platform-wide health and resource usage
AKKO object storage	`akko-minio.json`	object storage metrics (objects, bandwidth)
AKKO Trino	`akko-trino.json`	Trino query metrics (requires JMX Exporter)
AKKO JupyterHub	`akko-jupyterhub.json`	JupyterHub user and spawner metrics

logs layer¶

Log aggregation engine. Receives logs from log shipper and makes them queryable through Dashboards using LogQL.

Setting	Value
Internal port	`3100`
Config file	`monitoring/loki/loki-config.yml`
Max query lines	1000

log shipper¶

Log collection agent. Scrapes Docker container logs and forwards them to logs layer with labels for service name, container ID, and other metadata.

Setting	Value
Internal port	`9080`
Config file	`monitoring/promtail/promtail-config.yml`
Source	Docker container logs

Alertmanager¶

Alert routing and notification engine. Receives alerts from Prometheus and routes them based on severity.

Setting	Value
Internal port	`9093`
Config file	`monitoring/alertmanager/alertmanager.yml`
Default receiver	Logging only (stdout)

Alert Rules¶

AKKO ships with pre-configured alert rules in monitoring/prometheus/rules/akko-alerts.yml:

Alert	Severity	Condition	For
ServiceDown	Critical	`up == 0`	2 min
HighLatency	Warning	Response time > 2s	5 min
HighMemoryUsage	Warning	Container memory > 90% of limit	2 min
HighCPUUsage	Warning	Container CPU > 80%	5 min
DiskSpaceRunningLow	Warning	Filesystem > 85% used	5 min
PostgresConnectionsHigh	Warning	Connections > 80% of max	5 min
StorageBucketEmpty	Info	Bucket has 0 objects	10 min

Alert routing is configured with severity-based grouping:

Critical alerts: 5-second group wait, 15-minute repeat interval
All other alerts: 10-second group wait, 1-hour repeat interval
Inhibition: A critical alert suppresses warnings for the same service

Accessing Dashboards¶

Open https://metrics.akko.local
Log in with Keycloak SSO (e.g., alice with the admin role)
Navigate to Dashboards in the left sidebar to view pre-built dashboards
Use Explore to query Prometheus metrics or logs layer logs directly

Querying Logs in Dashboards¶

To view logs from a specific AKKO service:

Go to Explore (compass icon in the left sidebar)
Select the logs layer datasource
Use a LogQL query:

{container_name="akko-trino"}

Filter by log level:

{container_name="akko-airflow"} |= "ERROR"

Querying Metrics¶

Select the Prometheus datasource in Explore and use PromQL:

# Service availability (1 = up, 0 = down)
up

# JupyterHub active users
jupyterhub_running_servers

# object storage requests per second
rate(minio_http_requests_total[5m])

Adding Custom Alerts¶

Create a new YAML file in monitoring/prometheus/rules/:

monitoring/prometheus/rules/my-alerts.yml

groups:
  - name: my-custom-alerts
    rules:
      - alert: SlowTrinoQueries
        expr: trino_query_execution_time_seconds > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow Trino queries detected"
          description: "Queries taking longer than 30 seconds for 5+ minutes."

Prometheus automatically loads all *.yml files from the rules directory.

Adding Custom Dashboards¶

Place JSON dashboard files in monitoring/grafana/dashboards/. Dashboards's provisioning system detects new files and loads them automatically.

To export an existing dashboard:

Open the dashboard in Dashboards
Click the Share icon (top bar)
Select Export > Save to file
Place the JSON file in monitoring/grafana/dashboards/

Configuring Notifications¶

The default Alertmanager configuration uses a logging-only receiver (alerts appear in stdout). To enable real notifications, edit monitoring/alertmanager/alertmanager.yml:

SlackWebhookEmail

receivers:
  - name: default
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#akko-alerts'
        send_resolved: true

receivers:
  - name: default
    webhook_configs:
      - url: 'http://your-webhook-endpoint:5001/'
        send_resolved: true

global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'akko-alerts@example.com'
  smtp_auth_username: 'user'
  smtp_auth_password: 'pass'

receivers:
  - name: default
    email_configs:
      - to: 'team@example.com'
        send_resolved: true

After editing, restart Alertmanager:

kubectl rollout restart deploy/akko-alertmanager -n akko

Configuration Files Reference¶

File	Purpose
`monitoring/prometheus/prometheus.yml`	Prometheus scrape targets and global config
`monitoring/prometheus/rules/akko-alerts.yml`	Alert rule definitions
`monitoring/grafana/provisioning/datasources/datasource.yml`	Datasource auto-provisioning
`monitoring/grafana/provisioning/dashboards/dashboards.yml`	Dashboard provisioning config
`monitoring/grafana/dashboards/*.json`	Pre-built Dashboard definitions
`monitoring/loki/loki-config.yml`	logs layer storage and retention config
`monitoring/promtail/promtail-config.yml`	log shipper log scraping config

Kubernetes Deployment

In Helm/k8s mode, Prometheus, Dashboards, and Alertmanager are deployed via the kube-prometheus-stack chart (key monitoring in values.yaml). logs layer and log shipper are deployed via a separate loki chart. log shipper collects pod logs from the Kubernetes node filesystem, not Docker socket.

| monitoring/alertmanager/alertmanager.yml | Alert routing and notification config |