Skip to content

Troubleshooting

A comprehensive problem-solution guide for common AKKO issues, organized by component. Each entry describes the symptom, root cause, and fix.


Traefik

Traefik crashes or fails to start with Docker Engine v29+

Symptom: Traefik container exits immediately or enters a restart loop after upgrading Docker Desktop.

Cause: Traefik v3.2 and v3.3 have a compatibility issue with Docker Engine v29+.

Fix: Use Traefik v3.4 or later. AKKO pins Traefik to v3.6.9 in the Helm chart.


PostgreSQL

Password in .env changed but login still fails

Symptom: After regenerating .env, services cannot connect to PostgreSQL with the new password. Error: password authentication failed for user "akko".

Cause: PostgreSQL stores user passwords in its data volume. The init scripts in postgres/init/ only run once (on first startup when the volume is empty). Changing .env does not update the password inside the database.

Fix: Manually update the password in the running database:

kubectl exec -n akko statefulset/akko-postgresql -- psql -U postgres -c \
  "ALTER USER akko WITH PASSWORD 'paste-new-password';"

Repeat for any affected user (postgres, akko, keycloak_user).

Warning

This applies to all PostgreSQL users. The init scripts in postgres/init/ (mounted to docker-entrypoint-initdb.d) only execute when the data volume is empty. The postgres-init sidecar handles idempotent schema/extension creation, but password changes still require ALTER USER.

Init scripts did not run

Symptom: Expected databases, extensions, or schemas are missing.

Cause: Docker PostgreSQL init scripts (docker-entrypoint-initdb.d/) only run on the first startup with an empty data volume.

Fix: For anything that must survive across restarts and rebuilds, AKKO uses the postgres-init sidecar with ensure.sql. If you need to re-run init scripts from scratch:

kubectl delete pvc -n akko -l app.kubernetes.io/name=akko-postgres
helm upgrade akko helm/akko/ -n akko

Danger

This destroys all database data. Export anything you need first.


Apache Polaris (Iceberg Catalog)

Bootstrap credentials not applied

Symptom: POLARIS_BOOTSTRAP_CREDENTIALS (set in Helm values or Kubernetes Secret) are ignored.

Cause: Polaris only reads bootstrap credentials on first startup (empty database).

Fix: If you need to reset Polaris credentials, remove the Polaris data from PostgreSQL and restart:

kubectl delete deploy -n akko -l app.kubernetes.io/name=akko-polaris
kubectl exec -n akko statefulset/akko-postgresql -- psql -U postgres -d akko -c "DROP SCHEMA IF EXISTS polaris CASCADE;"
helm upgrade akko helm/akko/ -n akko

Catalog disappears after database recreation

Symptom: polaris-init reports "catalog already exists" but Trino/Spark cannot find any tables. The management API returns an empty catalog list.

Cause: After a PostgreSQL volume reset, polaris-init may cache stale state. The catalog metadata is gone but the init script thinks it is still present.

Fix: Always recreate catalogs using the init sidecar:

# The polaris-init job runs automatically on helm upgrade:
helm upgrade akko helm/akko/ -n akko

Never create Polaris catalogs manually

The polaris-init script contains the correct storageConfigInfo format (with stsUnavailable, endpoint, pathStyleAccess as top-level fields) and the full RBAC setup (principal role, catalog role, grants). Manual curl commands frequently get the format wrong -- Polaris silently ignores dot-notation (s3.endpoint) and nested objects ({"s3": {...}})`.

Trino gets "invalid_scope" from Polaris

Symptom: Trino Iceberg connector fails with invalid_scope when authenticating to Polaris.

Cause: Trino's default OAuth2 scope is rejected by Polaris.

Fix: Set the explicit scope in trino/catalog/iceberg.properties:

iceberg.rest-catalog.oauth2.scope=PRINCIPAL_ROLE:ALL

DROP TABLE forbidden

Symptom: DROP TABLE fails with DROP_TABLE_WITH_PURGE denied.

Cause: Polaris RBAC does not grant purge privileges by default.

Fix: Use CALL iceberg.system.unregister_table('schema', 'table') instead of DROP TABLE.


Keycloak (SSO)

Safari login loop with self-signed certs

Symptom: Login redirects endlessly in Safari. Other browsers work fine.

Cause: Safari is strict about self-signed TLS certificates and blocks cross-origin cookies for untrusted certs.

Fix: Trust the AKKO certificate in macOS Keychain:

  1. Open Keychain Access
  2. Drag traefik/certs/akko.crt into the System keychain
  3. Double-click it, expand Trust, set Always Trust
  4. Restart Safari completely

emailVerified error on JupyterHub login

Symptom: oauth2-proxy rejects user login with a verification error.

Cause: Keycloak users need emailVerified: true to pass oauth2-proxy validation.

Fix: In Keycloak admin console, edit the user and toggle Email verified on. All test users (alice, bob, carol, dave) must have this flag set.

KC_HOSTNAME_URL silently ignored

Symptom: Keycloak issuer mismatch errors. Token validation fails.

Cause: KC_HOSTNAME_URL is a v1 configuration key, silently ignored in Keycloak 26.x.

Fix: Use KC_HOSTNAME (not KC_HOSTNAME_URL). Set KC_HOSTNAME_BACKCHANNEL_DYNAMIC: false for a consistent issuer.


Trino

Access denied for user

Symptom: Trino queries fail with Access Denied.

Cause: The user is not in the correct RBAC group.

Fix: Check trino/etc/group.txt and ensure the user is listed in the appropriate group. AKKO uses five roles: akko-admin, akko-engineer, akko-analyst, akko-user, akko-viewer. See the RBAC guide for details.

Authentication failed for PostgreSQL connector

Symptom: Trino cannot query the PostgreSQL catalog. Error: password authentication failed for user "akko".

Cause: The akko user password in PostgreSQL does not match what Trino has in its configuration (from Helm values or Kubernetes Secrets). This happens when secrets were regenerated but the PostgreSQL volume still has the old password.

Fix: See the PostgreSQL password persistence fix above.


Spark

JARs not found at runtime (ClassNotFoundException)

Symptom: Spark jobs fail with ClassNotFoundException for Iceberg or AWS classes.

Cause: Using --jars or --packages flags to load JARs does not work reliably with Spark Connect due to ClassLoader isolation.

Fix: All JARs must be baked into the Docker image in the akko-spark Dockerfile. After adding JARs:

# Rebuild the image and push to registry, then upgrade:
bash helm/scripts/build-images.sh
helm upgrade akko helm/akko/ -n akko

SerializedLambda error on .collect()

Symptom: SerializedLambda exception when calling .collect() on Iceberg metadata tables in Spark Connect mode.

Cause: Known Spark Connect serialization issue with Iceberg metadata tables.

Fix: Use .show() instead of .collect() for metadata table queries. For programmatic access, use .toPandas() which works around the serialization path.


Superset

EXPRESSION_NOT_AGGREGATE error on charts

Symptom: Superset chart fails with EXPRESSION_NOT_AGGREGATE when using a virtual dataset.

Cause: Superset metrics on virtual datasets must use aggregate functions. A bare column reference like amount is not valid as a metric.

Fix: Define metrics using aggregate functions: SUM(amount), COUNT(*), AVG(score), etc.

Dashboard chart association not working via REST API

Symptom: Charts created via the Superset REST API do not appear on the dashboard.

Cause: The REST API POST for dashboards does not create the many-to-many relationship with slices (charts).

Fix: Use the ORM approach in the bootstrap script: dash.slices = [slice1, slice2, ...] with a database session commit.


JupyterHub / Notebooks

code-server extensions not loading

Symptom: VS Code extensions are missing in code-server sessions.

Cause: Extensions installed at user level are lost between container restarts.

Fix: Extensions must be installed system-wide in the Dockerfile:

RUN code-server --install-extension ms-python.python \
    --extensions-dir /opt/code-server-extensions

A before-notebook.d hook symlinks the extensions directory into each user session.

CHOWN_HOME fails with read-only volumes

Symptom: Notebook container fails to start. Error: chown: read-only file system.

Cause: CHOWN_HOME_OPTS=-R tries to chown all mounted volumes, including read-only mounts like notebooks/.

Fix: Set CHOWN_HOME=no in the spawner configuration and use a custom before-notebook.d hook that does chown 2>/dev/null || true (ignores errors on read-only mounts).


Ollama (Local AI)

Out of memory with large models

Symptom: Ollama container crashes or restarts when loading a large LLM model.

Cause: Large models (7B+ parameters) require more RAM than Docker Desktop typically allocates. AKKO ships with qwen2.5-coder:7b (4.7 GB), qwen2.5:3b, and nomic-embed-text.

Fix: Use a smaller model like qwen2.5:3b for chat, or increase Docker Desktop memory allocation to 16 GB+. The qwen2.5-coder:7b model requires at least 10 GB of available RAM.

Healthcheck fails (no curl in image)

Symptom: Ollama healthcheck returns unhealthy even though the service works.

Cause: The Ollama image does not include curl or wget.

Fix: Use a TCP check in the healthcheck:

healthcheck:
  test: ["CMD", "bash", "-c", "exec 3<>/dev/tcp/localhost/11434"]

Note

Use CMD (not CMD-SHELL) because the Ollama image's /bin/sh does not support /dev/tcp.


OpenMetadata (Governance Profile)

Out of memory / restart loop

Symptom: OpenMetadata server or OpenSearch crash repeatedly.

Cause: Insufficient Docker Desktop memory. OpenMetadata + OpenSearch need ~2.5 GB combined.

Fix: Allocate at least 16 GB to Docker Desktop. Only start these services with the governance profile:

./scripts/start.sh --governance

OpenSearch crashes at 768m heap

Symptom: OpenSearch enters a restart loop with OOM errors.

Cause: OpenSearch needs ~1 GB for heap + native memory. A 768 MB limit is insufficient.

Fix: Ensure OpenSearch is configured with at least -Xms512m -Xmx512m and a container memory limit above 1 GB.

API serialization errors on PUT

Symptom: OpenMetadata REST API returns 400/500 on PUT requests for entities.

Cause: PUT endpoints expect FQN (fully qualified name) strings for fields like service, testSuite, and domain -- not {id, type} objects.

Fix: Always use string FQN values in PUT payloads. For data products, assets must be added via a separate PATCH call after creation.

Migrate container fails

Symptom: OpenMetadata migrate sidecar exits with an error.

Cause: Wrong entrypoint command.

Fix: The migrate command must be:

./bootstrap/openmetadata-ops.sh migrate

Not ./openmetadata.sh which is the server entrypoint.


Inter-Service Communication

Services cannot reach each other using external URLs

Symptom: A pod tries to reach identity.akko.local and gets connection refused or a DNS resolution failure.

Cause: External domain names (e.g., *.akko.local) may not resolve inside the cluster, or may resolve to the ingress IP which cannot route back to pods.

Fix: Always use Kubernetes service DNS names for inter-service communication:

Instead of Use
identity.akko.local akko-akko-keycloak:8080
minio.akko.local akko-minio:9000
federation.akko.local akko-trino:8080

In k3d, KC_HOSTNAME_BACKCHANNEL_DYNAMIC=true enables pod-to-pod OIDC token exchange using internal URLs.


published-reports Volume

Quarto render fails with permission denied

Symptom: Rendering a Quarto report from JupyterHub fails because the output directory is not writable.

Cause: The published-reports Docker volume is created as root. The notebook user (uid 1000) cannot write to it.

Fix: Ensure the volume is world-writable:

kubectl exec -n akko deploy/akko-akko-docs -- chmod 777 /usr/share/nginx/html/reports

This is handled automatically by the AKKO Docs service entrypoint, but may need to be re-applied if the volume is recreated.