Skip to content

ADEN — Autonomous Data Exploration Node

ADEN is AKKO's sovereign open-source answer to Databricks Genie, Snowflake Copilot and Dremio Text-to-SQL. It converts natural-language questions into Trino SQL, enforces per-user authorization, gates runaway queries, redacts PII, and publishes the result as a Streamlit dashboard — entirely inside your cluster.

For the service-reference page (endpoints, env vars, images) see Services / ADEN. This page focuses on the AI capability and how ADEN fits in the stack.

Why ADEN

Feature ADEN Databricks Genie Snowflake Copilot
Self-hosted, OSS Yes No No
RBAC enforcement OPA Unity Catalog Snowflake Roles
PII redaction Yes Yes Yes
EXPLAIN cost gate Yes No No
Publish dashboard Streamlit Genie session None
Conversation memory Yes Yes Yes
100 % offline Yes No No

Architecture

flowchart LR
    USER[Cockpit chat] -->|POST /ask| FE[/api/cockpit/aden/]
    FE --> A[ADEN<br/>FastAPI]
    A --> CACHE{TTL cache<br/>5 min}
    CACHE -->|miss| OM[OpenMetadata<br/>_search]
    OM --> OPA[OPA<br/>/v1/data/akko/aden/allow]
    OPA --> LLM[LiteLLM<br/>qwen2.5-coder:7b]
    LLM --> VALID[sqlglot validate<br/>+ keyword denylist]
    VALID --> GATE[Trino EXPLAIN TYPE IO<br/>AKKO_COST_BYTES_LIMIT]
    GATE -->|ok| EXEC[Trino exec<br/>user OAuth token]
    EXEC --> MASK[PII redaction<br/>via OM tags]
    MASK --> PUB[Streamlit publish<br/>PVC akko-aden-reports]
    PUB -->|URL| USER

Request Pipeline

Every ADEN call is a strict pipeline — each step assumes the previous one succeeded.

  1. Cache lookupsha256(question | role) keyed TTL cache (5 min). Pass force=true to bypass.
  2. Catalog search — OpenMetadata _search returns the top-N candidate tables for the question.
  3. OPA per-table check — for each candidate, POST /v1/data/akko/aden/allow with {user, role, table}. Denied tables are dropped silently from the prompt context.
  4. LLM generationqwen2.5-coder:7b receives the system prompt, the allowed tables, and the last 3 turns of session history (when session_id is provided).
  5. sqlglot validation — the LLM output is parsed in the Trino dialect. Anything that is not a single SELECT / WITH / UNION / EXCEPT / INTERSECT is rejected. A keyword denylist (INSERT, UPDATE, DELETE, DROP, etc.) runs as a second line of defence. LIMIT 10000 is auto-injected when absent.
  6. EXPLAIN (TYPE IO) cost gate — Trino estimates bytes scanned. Over the gate (default 1 GiB, AKKO_COST_BYTES_LIMIT) returns HTTP 413. The user can resend with confirm_cost=true to override.
  7. Trino execution — with the caller's OAuth token, so any row-filter or column-mask upstream still applies.
  8. PII redaction — columns tagged PII in OpenMetadata are masked before the response.
  9. Streamlit dashboard — the result is written to the akko-aden-reports PVC and rendered at https://reports.<domain>/.

Explainability — the reasoning.pipeline_steps trace

Every ADEN response carries a reasoning object with a per-step trace that the cockpit renders as a "Comment ADEN a répondu" panel. Each step records:

Field Meaning
step Machine-readable phase name (e.g. catalog_search, opa_authorisation, semantic_layer_match)
description Human-readable explanation of what ADEN did in this phase
source Backend system consulted: verified_queries_cache, openmetadata, open_policy_agent, dbt_semantic_layer, local_llm, trino_executor
duration_ms Wall-clock time for the step
result Compact JSON payload showing the actual evidence (matched table names, OPA allow/deny per table, selected semantic model…)

Example response fragment (dry-run, execute: false):

{
  "reasoning": {
    "intent": "list 5 customers",
    "pipeline_overview": "1) requête vérifiée ? → 2) recherche catalogue → 3) filtrage OPA par rôle → 4) priorisation tables → 5) couche sémantique dbt → 6) génération SQL par le LLM local → 7) validation syntaxique Trino → 8) exécution avec votre token → 9) réponse",
    "sources_queried": ["verified_queries_cache", "openmetadata_catalog", "open_policy_agent", "dbt_semantic_layer", "local_llm"],
    "semantic_model_used": null,
    "pipeline_steps": [
      {"step": "question_received", "source": "aden", "duration_ms": 0, "result": {"question": "list 5 customers", "length": 16}},
      {"step": "catalog_search", "source": "openmetadata+trino_information_schema", "duration_ms": 335, "result": {"candidates": ["iceberg.banking.customers", "postgresql.banking.customers"], "candidate_count": 2}},
      {"step": "opa_authorisation", "source": "open_policy_agent", "duration_ms": 19, "result": {"decisions": [{"table": "iceberg.banking.customers", "allowed": true}], "allowed_count": 2}},
      {"step": "semantic_layer_match", "source": "dbt_semantic_layer", "duration_ms": 0, "result": {"matched_model": null, "reason": "Aucun modèle métier validé — SQL généré ad-hoc depuis les métadonnées tables/colonnes."}}
    ]
  }
}

This answers the question any prospect will ask: "how did the AI find that table / why did it pick that schema / did it actually consult the catalog before hitting the LLM?" — the evidence is in the response itself, not in a log line the data team has to correlate.

Security guarantees — SELECT-only

ADEN is strictly read-only by construction. Even in a worst-case scenario where a prompt injection persuades the LLM to emit DROP TABLE, four independent layers block the write:

  1. sqlglot parse check — only SELECT, WITH, UNION, EXCEPT, INTERSECT statements are accepted. Anything else → HTTP 400.
  2. Keyword regex with comment strippingINSERT UPDATE DELETE DROP CREATE ALTER TRUNCATE GRANT REVOKE MERGE CALL REPLACE are rejected as standalone words (so column names like deleted_at still work), after stripping -- and /* */ comments so obfuscation fails.
  3. Multi-statement guard — a second ;<alphanum> check catches any payload that bypasses the parser's split.
  4. OPA read_only flag on the ADEN service account — even if the ADEN process itself is compromised, the svc-aden service account has read_only: true in the OPA policy; Trino refuses the write at the server.

The 22 adversarial payloads in tests/integration/aden-select-only.py exercise direct DDL, direct DML, multi-statement injection, comment obfuscation, privilege escalation (GRANT/REVOKE), CALL system.runtime.kill_query, case variations and markdown-fenced LLM wrappers. Run locally before any ADEN release:

python3 -m pytest tests/integration/aden-select-only.py -v

API Endpoints

Method Path Body Purpose
POST /ask {question, session_id?, force?, confirm_cost?} Run a question -> SQL -> dashboard
POST /feedback {query_id, thumbs, comment?} Capture UX feedback
GET /sessions/{id} Fetch conversation history
GET /metrics Prometheus (aden_query_total, aden_query_duration_seconds)
GET /healthz Liveness
GET /readyz Readiness (checks Trino, OPA, OM, LiteLLM)

All mutating routes require a valid Keycloak JWT forwarded by Cockpit via X-User-Id and X-User-Role headers.

Structured Audit Events

ADEN emits one JSON line per lifecycle event to Loki (label app=akko-aden):

  • aden_query_received — entry point
  • aden_cache_hit
  • aden_no_tables, aden_opa_denied
  • aden_sql_generated, aden_cost_gate, aden_trino_executed
  • aden_pii_masked, aden_dashboard_published
  • aden_feedback

Grafana ships an AKKO ADEN SLO dashboard with p50 / p95 / p99 latency, error-budget burn (1 h / 6 h / 1 d), and top denied tables.

Examples

1. Simple question

curl -s https://akko.local/api/cockpit/aden/ask \
  -H "Content-Type: application/json" \
  -H "X-User-Id: alice" \
  -H "X-User-Role: akko-analyst" \
  -d '{"question": "Top 10 regions by transaction volume last week"}' | jq

2. With session memory

curl -s https://akko.local/api/cockpit/aden/ask \
  -H "Content-Type: application/json" \
  -H "X-User-Id: alice" \
  -H "X-User-Role: akko-analyst" \
  -d '{"question": "Break it down by day", "session_id": "abc-123"}' | jq

3. Override cost gate

curl -s https://akko.local/api/cockpit/aden/ask \
  -H "Content-Type: application/json" \
  -H "X-User-Id: bob" \
  -H "X-User-Role: akko-engineer" \
  -d '{"question": "Full year 2025 transactions joined with customers",
       "confirm_cost": true}' | jq

Configuration

Key env vars (see helm/akko/charts/akko-aden/values.yaml):

Variable Default Purpose
AKKO_TRINO_URL https://trino.akko.svc:8443 Trino coordinator
AKKO_OPA_URL http://akko-opa:8181 OPA service
AKKO_OM_URL http://akko-openmetadata:8585 OpenMetadata API
AKKO_LITELLM_URL http://akko-litellm:4000 LiteLLM gateway
AKKO_COST_BYTES_LIMIT 1073741824 EXPLAIN cost gate (1 GiB)
AKKO_CACHE_TTL_SECONDS 300 Cache TTL
AKKO_MAX_TABLES 5 Max candidate tables sent to LLM

Operational Notes

  • ADEN state is stateless except for the PVC of published dashboards. Killing the pod is safe.
  • Session memory lives in Redis when AKKO_REDIS_URL is set; otherwise in-process (lost on restart).
  • The Streamlit sidecar mounts the akko-aden-reports PVC read-only and serves it at reports.<domain>.
  • Tempo traces are emitted with service name akko-aden — follow a request id from Cockpit to SQL execution.