Troubleshooting¶
A comprehensive problem-solution guide for common AKKO issues, organized by component. Each entry describes the symptom, root cause, and fix.
Traefik¶
Traefik crashes or fails to start with Docker Engine v29+¶
Symptom: Traefik container exits immediately or enters a restart loop after upgrading Docker Desktop.
Cause: Traefik v3.2 and v3.3 have a compatibility issue with Docker Engine v29+.
Fix: Use Traefik v3.4 or later. AKKO pins Traefik to v3.6.9 in
the Helm chart.
PostgreSQL¶
Password in .env changed but login still fails¶
Symptom: After regenerating .env, services cannot connect to PostgreSQL with
the new password. Error: password authentication failed for user "akko".
Cause: PostgreSQL stores user passwords in its data volume. The init scripts
in postgres/init/ only run once (on first startup when the volume is empty).
Changing .env does not update the password inside the database.
Fix: Manually update the password in the running database:
kubectl exec -n akko statefulset/akko-postgresql -- psql -U postgres -c \
"ALTER USER akko WITH PASSWORD 'paste-new-password';"
Repeat for any affected user (postgres, akko, keycloak_user).
Warning
This applies to all PostgreSQL users. The init scripts in
postgres/init/ (mounted to docker-entrypoint-initdb.d) only execute
when the data volume is empty. The postgres-init sidecar handles
idempotent schema/extension creation, but password changes still require
ALTER USER.
Init scripts did not run¶
Symptom: Expected databases, extensions, or schemas are missing.
Cause: Docker PostgreSQL init scripts (docker-entrypoint-initdb.d/) only
run on the first startup with an empty data volume.
Fix: For anything that must survive across restarts and rebuilds, AKKO uses
the postgres-init sidecar with ensure.sql. If you need to re-run init scripts
from scratch:
kubectl delete pvc -n akko -l app.kubernetes.io/name=akko-postgres
helm upgrade akko helm/akko/ -n akko
Danger
This destroys all database data. Export anything you need first.
Apache Polaris (Iceberg Catalog)¶
Bootstrap credentials not applied¶
Symptom: POLARIS_BOOTSTRAP_CREDENTIALS (set in Helm values or Kubernetes Secret) are ignored.
Cause: Polaris only reads bootstrap credentials on first startup (empty database).
Fix: If you need to reset Polaris credentials, remove the Polaris data from PostgreSQL and restart:
kubectl delete deploy -n akko -l app.kubernetes.io/name=akko-polaris
kubectl exec -n akko statefulset/akko-postgresql -- psql -U postgres -d akko -c "DROP SCHEMA IF EXISTS polaris CASCADE;"
helm upgrade akko helm/akko/ -n akko
Catalog disappears after database recreation¶
Symptom: polaris-init reports "catalog already exists" but Trino/Spark cannot
find any tables. The management API returns an empty catalog list.
Cause: After a PostgreSQL volume reset, polaris-init may cache stale state.
The catalog metadata is gone but the init script thinks it is still present.
Fix: Always recreate catalogs using the init sidecar:
Never create Polaris catalogs manually
The polaris-init script contains the correct storageConfigInfo format
(with stsUnavailable, endpoint, pathStyleAccess as top-level fields)
and the full RBAC setup (principal role, catalog role, grants). Manual
curl commands frequently get the format wrong -- Polaris silently
ignores dot-notation (s3.endpoint) and nested objects ({"s3": {...}})`.
Trino gets "invalid_scope" from Polaris¶
Symptom: Trino Iceberg connector fails with invalid_scope when authenticating
to Polaris.
Cause: Trino's default OAuth2 scope is rejected by Polaris.
Fix: Set the explicit scope in trino/catalog/iceberg.properties:
DROP TABLE forbidden¶
Symptom: DROP TABLE fails with DROP_TABLE_WITH_PURGE denied.
Cause: Polaris RBAC does not grant purge privileges by default.
Fix: Use CALL iceberg.system.unregister_table('schema', 'table') instead
of DROP TABLE.
Keycloak (SSO)¶
Safari login loop with self-signed certs¶
Symptom: Login redirects endlessly in Safari. Other browsers work fine.
Cause: Safari is strict about self-signed TLS certificates and blocks cross-origin cookies for untrusted certs.
Fix: Trust the AKKO certificate in macOS Keychain:
- Open Keychain Access
- Drag
traefik/certs/akko.crtinto the System keychain - Double-click it, expand Trust, set Always Trust
- Restart Safari completely
emailVerified error on JupyterHub login¶
Symptom: oauth2-proxy rejects user login with a verification error.
Cause: Keycloak users need emailVerified: true to pass oauth2-proxy validation.
Fix: In Keycloak admin console, edit the user and toggle Email verified on. All test users (alice, bob, carol, dave) must have this flag set.
KC_HOSTNAME_URL silently ignored¶
Symptom: Keycloak issuer mismatch errors. Token validation fails.
Cause: KC_HOSTNAME_URL is a v1 configuration key, silently ignored in
Keycloak 26.x.
Fix: Use KC_HOSTNAME (not KC_HOSTNAME_URL). Set
KC_HOSTNAME_BACKCHANNEL_DYNAMIC: false for a consistent issuer.
Trino¶
Access denied for user¶
Symptom: Trino queries fail with Access Denied.
Cause: The user is not in the correct RBAC group.
Fix: Check trino/etc/group.txt and ensure the user is listed in the
appropriate group. AKKO uses five roles: akko-admin, akko-engineer, akko-analyst,
akko-user, akko-viewer. See the RBAC guide for details.
Authentication failed for PostgreSQL connector¶
Symptom: Trino cannot query the PostgreSQL catalog. Error:
password authentication failed for user "akko".
Cause: The akko user password in PostgreSQL does not match what Trino has
in its configuration (from Helm values or Kubernetes Secrets). This happens when secrets were regenerated
but the PostgreSQL volume still has the old password.
Fix: See the PostgreSQL password persistence fix above.
Spark¶
JARs not found at runtime (ClassNotFoundException)¶
Symptom: Spark jobs fail with ClassNotFoundException for Iceberg or AWS
classes.
Cause: Using --jars or --packages flags to load JARs does not work
reliably with Spark Connect due to ClassLoader isolation.
Fix: All JARs must be baked into the Docker image in the akko-spark
Dockerfile. After adding JARs:
# Rebuild the image and push to registry, then upgrade:
bash helm/scripts/build-images.sh
helm upgrade akko helm/akko/ -n akko
SerializedLambda error on .collect()¶
Symptom: SerializedLambda exception when calling .collect() on Iceberg
metadata tables in Spark Connect mode.
Cause: Known Spark Connect serialization issue with Iceberg metadata tables.
Fix: Use .show() instead of .collect() for metadata table queries. For
programmatic access, use .toPandas() which works around the serialization path.
Superset¶
EXPRESSION_NOT_AGGREGATE error on charts¶
Symptom: Superset chart fails with EXPRESSION_NOT_AGGREGATE when using
a virtual dataset.
Cause: Superset metrics on virtual datasets must use aggregate functions.
A bare column reference like amount is not valid as a metric.
Fix: Define metrics using aggregate functions: SUM(amount), COUNT(*),
AVG(score), etc.
Dashboard chart association not working via REST API¶
Symptom: Charts created via the Superset REST API do not appear on the dashboard.
Cause: The REST API POST for dashboards does not create the many-to-many
relationship with slices (charts).
Fix: Use the ORM approach in the bootstrap script:
dash.slices = [slice1, slice2, ...] with a database session commit.
JupyterHub / Notebooks¶
code-server extensions not loading¶
Symptom: VS Code extensions are missing in code-server sessions.
Cause: Extensions installed at user level are lost between container restarts.
Fix: Extensions must be installed system-wide in the Dockerfile:
A before-notebook.d hook symlinks the extensions directory into each user session.
CHOWN_HOME fails with read-only volumes¶
Symptom: Notebook container fails to start. Error: chown: read-only file system.
Cause: CHOWN_HOME_OPTS=-R tries to chown all mounted volumes, including
read-only mounts like notebooks/.
Fix: Set CHOWN_HOME=no in the spawner configuration and use a custom
before-notebook.d hook that does chown 2>/dev/null || true (ignores errors
on read-only mounts).
Ollama (Local AI)¶
Out of memory with large models¶
Symptom: Ollama container crashes or restarts when loading a large LLM model.
Cause: Large models (7B+ parameters) require more RAM than Docker Desktop typically allocates. AKKO ships with qwen2.5-coder:7b (4.7 GB), qwen2.5:3b, and nomic-embed-text.
Fix: Use a smaller model like qwen2.5:3b for chat, or increase Docker Desktop memory allocation to 16 GB+. The qwen2.5-coder:7b model requires at least 10 GB of available RAM.
Healthcheck fails (no curl in image)¶
Symptom: Ollama healthcheck returns unhealthy even though the service works.
Cause: The Ollama image does not include curl or wget.
Fix: Use a TCP check in the healthcheck:
Note
Use CMD (not CMD-SHELL) because the Ollama image's /bin/sh does not
support /dev/tcp.
OpenMetadata (Governance Profile)¶
Out of memory / restart loop¶
Symptom: OpenMetadata server or OpenSearch crash repeatedly.
Cause: Insufficient Docker Desktop memory. OpenMetadata + OpenSearch need ~2.5 GB combined.
Fix: Allocate at least 16 GB to Docker Desktop. Only start these services with the governance profile:
OpenSearch crashes at 768m heap¶
Symptom: OpenSearch enters a restart loop with OOM errors.
Cause: OpenSearch needs ~1 GB for heap + native memory. A 768 MB limit is insufficient.
Fix: Ensure OpenSearch is configured with at least -Xms512m -Xmx512m and
a container memory limit above 1 GB.
API serialization errors on PUT¶
Symptom: OpenMetadata REST API returns 400/500 on PUT requests for entities.
Cause: PUT endpoints expect FQN (fully qualified name) strings for fields like
service, testSuite, and domain -- not {id, type} objects.
Fix: Always use string FQN values in PUT payloads. For data products, assets
must be added via a separate PATCH call after creation.
Migrate container fails¶
Symptom: OpenMetadata migrate sidecar exits with an error.
Cause: Wrong entrypoint command.
Fix: The migrate command must be:
Not ./openmetadata.sh which is the server entrypoint.
Inter-Service Communication¶
Services cannot reach each other using external URLs¶
Symptom: A pod tries to reach identity.akko.local and gets
connection refused or a DNS resolution failure.
Cause: External domain names (e.g., *.akko.local) may not resolve inside the cluster, or may resolve to the ingress IP which cannot route back to pods.
Fix: Always use Kubernetes service DNS names for inter-service communication:
| Instead of | Use |
|---|---|
identity.akko.local |
akko-akko-keycloak:8080 |
minio.akko.local |
akko-minio:9000 |
federation.akko.local |
akko-trino:8080 |
In k3d, KC_HOSTNAME_BACKCHANNEL_DYNAMIC=true enables pod-to-pod OIDC token exchange using internal URLs.
published-reports Volume¶
Quarto render fails with permission denied¶
Symptom: Rendering a Quarto report from JupyterHub fails because the output directory is not writable.
Cause: The published-reports Docker volume is created as root. The notebook
user (uid 1000) cannot write to it.
Fix: Ensure the volume is world-writable:
This is handled automatically by the AKKO Docs service entrypoint, but may need to be re-applied if the volume is recreated.