ADEN — Autonomous Data Exploration Node¶

ADEN is AKKO's sovereign open-source answer to Databricks Genie, Snowflake Copilot and Dremio Text-to-SQL. It converts natural-language questions into Trino SQL, enforces per-user authorization, gates runaway queries, redacts PII, and publishes the result as a Streamlit dashboard — entirely inside your cluster.

For the service-reference page (endpoints, env vars, images) see Services / ADEN. This page focuses on the AI capability and how ADEN fits in the stack.

Why ADEN¶

Feature	ADEN	Databricks Genie	Snowflake Copilot
Self-hosted, OSS	Yes	No	No
RBAC enforcement	OPA	Unity Catalog	Snowflake Roles
PII redaction	Yes	Yes	Yes
EXPLAIN cost gate	Yes	No	No
Publish dashboard	Streamlit	Genie session	None
Conversation memory	Yes	Yes	Yes
100 % offline	Yes	No	No

Architecture¶

flowchart LR
    USER[Cockpit chat] -->|POST /ask| FE[/api/cockpit/aden/]
    FE --> A[ADEN<br/>FastAPI]
    A --> CACHE{TTL cache<br/>5 min}
    CACHE -->|miss| OM[OpenMetadata<br/>_search]
    OM --> OPA[OPA<br/>/v1/data/akko/aden/allow]
    OPA --> LLM[LiteLLM<br/>qwen2.5-coder:7b]
    LLM --> VALID[sqlglot validate<br/>+ keyword denylist]
    VALID --> GATE[Trino EXPLAIN TYPE IO<br/>AKKO_COST_BYTES_LIMIT]
    GATE -->|ok| EXEC[Trino exec<br/>user OAuth token]
    EXEC --> MASK[PII redaction<br/>via OM tags]
    MASK --> PUB[Streamlit publish<br/>PVC akko-aden-reports]
    PUB -->|URL| USER

Request Pipeline¶

Every ADEN call is a strict pipeline — each step assumes the previous one succeeded.

Cache lookup — sha256(question | role) keyed TTL cache (5 min). Pass force=true to bypass.
Catalog search — OpenMetadata _search returns the top-N candidate tables for the question.
OPA per-table check — for each candidate, POST /v1/data/akko/aden/allow with {user, role, table}. Denied tables are dropped silently from the prompt context.
LLM generation — qwen2.5-coder:7b receives the system prompt, the allowed tables, and the last 3 turns of session history (when session_id is provided).
sqlglot validation — the LLM output is parsed in the Trino dialect. Anything that is not a single SELECT / WITH / UNION / EXCEPT / INTERSECT is rejected. A keyword denylist (INSERT, UPDATE, DELETE, DROP, etc.) runs as a second line of defence. LIMIT 10000 is auto-injected when absent.
EXPLAIN (TYPE IO) cost gate — Trino estimates bytes scanned. Over the gate (default 1 GiB, AKKO_COST_BYTES_LIMIT) returns HTTP 413. The user can resend with confirm_cost=true to override.
Trino execution — with the caller's OAuth token, so any row-filter or column-mask upstream still applies.
PII redaction — columns tagged PII in OpenMetadata are masked before the response.
Streamlit dashboard — the result is written to the akko-aden-reports PVC and rendered at https://reports.<domain>/.

Explainability — the `reasoning.pipeline_steps` trace¶

Every ADEN response carries a reasoning object with a per-step trace that the cockpit renders as a "Comment ADEN a répondu" panel. Each step records:

Field	Meaning
`step`	Machine-readable phase name (e.g. `catalog_search`, `opa_authorisation`, `semantic_layer_match`)
`description`	Human-readable explanation of what ADEN did in this phase
`source`	Backend system consulted: `verified_queries_cache`, `openmetadata`, `open_policy_agent`, `dbt_semantic_layer`, `local_llm`, `trino_executor`
`duration_ms`	Wall-clock time for the step
`result`	Compact JSON payload showing the actual evidence (matched table names, OPA allow/deny per table, selected semantic model…)

Example response fragment (dry-run, execute: false):

{
  "reasoning": {
    "intent": "list 5 customers",
    "pipeline_overview": "1) requête vérifiée ? → 2) recherche catalogue → 3) filtrage OPA par rôle → 4) priorisation tables → 5) couche sémantique dbt → 6) génération SQL par le LLM local → 7) validation syntaxique Trino → 8) exécution avec votre token → 9) réponse",
    "sources_queried": ["verified_queries_cache", "openmetadata_catalog", "open_policy_agent", "dbt_semantic_layer", "local_llm"],
    "semantic_model_used": null,
    "pipeline_steps": [
      {"step": "question_received", "source": "aden", "duration_ms": 0, "result": {"question": "list 5 customers", "length": 16}},
      {"step": "catalog_search", "source": "openmetadata+trino_information_schema", "duration_ms": 335, "result": {"candidates": ["iceberg.banking.customers", "postgresql.banking.customers"], "candidate_count": 2}},
      {"step": "opa_authorisation", "source": "open_policy_agent", "duration_ms": 19, "result": {"decisions": [{"table": "iceberg.banking.customers", "allowed": true}], "allowed_count": 2}},
      {"step": "semantic_layer_match", "source": "dbt_semantic_layer", "duration_ms": 0, "result": {"matched_model": null, "reason": "Aucun modèle métier validé — SQL généré ad-hoc depuis les métadonnées tables/colonnes."}}
    ]
  }
}

This answers the question any prospect will ask: "how did the AI find that table / why did it pick that schema / did it actually consult the catalog before hitting the LLM?" — the evidence is in the response itself, not in a log line the data team has to correlate.

Security guarantees — SELECT-only¶

ADEN is strictly read-only by construction. Even in a worst-case scenario where a prompt injection persuades the LLM to emit DROP TABLE, four independent layers block the write:

sqlglot parse check — only SELECT, WITH, UNION, EXCEPT, INTERSECT statements are accepted. Anything else → HTTP 400.
Keyword regex with comment stripping — INSERT UPDATE DELETE DROP CREATE ALTER TRUNCATE GRANT REVOKE MERGE CALL REPLACE are rejected as standalone words (so column names like deleted_at still work), after stripping -- and /* */ comments so obfuscation fails.
Multi-statement guard — a second ;<alphanum> check catches any payload that bypasses the parser's split.
OPA read_only flag on the ADEN service account — even if the ADEN process itself is compromised, the svc-aden service account has read_only: true in the OPA policy; Trino refuses the write at the server.

The 22 adversarial payloads in tests/integration/aden-select-only.py exercise direct DDL, direct DML, multi-statement injection, comment obfuscation, privilege escalation (GRANT/REVOKE), CALL system.runtime.kill_query, case variations and markdown-fenced LLM wrappers. Run locally before any ADEN release:

python3 -m pytest tests/integration/aden-select-only.py -v

API Endpoints¶

Method	Path	Body	Purpose
POST	`/ask`	`{question, session_id?, force?, confirm_cost?}`	Run a question -> SQL -> dashboard
POST	`/feedback`	`{query_id, thumbs, comment?}`	Capture UX feedback
GET	`/sessions/{id}`	—	Fetch conversation history
GET	`/metrics`	—	Prometheus (`aden_query_total`, `aden_query_duration_seconds`)
GET	`/healthz`	—	Liveness
GET	`/readyz`	—	Readiness (checks Trino, OPA, OM, LiteLLM)

All mutating routes require a valid Keycloak JWT forwarded by Cockpit via X-User-Id and X-User-Role headers.

Structured Audit Events¶

ADEN emits one JSON line per lifecycle event to Loki (label app=akko-aden):

aden_query_received — entry point
aden_cache_hit
aden_no_tables, aden_opa_denied
aden_sql_generated, aden_cost_gate, aden_trino_executed
aden_pii_masked, aden_dashboard_published
aden_feedback

Grafana ships an AKKO ADEN SLO dashboard with p50 / p95 / p99 latency, error-budget burn (1 h / 6 h / 1 d), and top denied tables.

Examples¶

1. Simple question¶

curl -s https://akko.local/api/cockpit/aden/ask \
  -H "Content-Type: application/json" \
  -H "X-User-Id: alice" \
  -H "X-User-Role: akko-analyst" \
  -d '{"question": "Top 10 regions by transaction volume last week"}' | jq

2. With session memory¶

curl -s https://akko.local/api/cockpit/aden/ask \
  -H "Content-Type: application/json" \
  -H "X-User-Id: alice" \
  -H "X-User-Role: akko-analyst" \
  -d '{"question": "Break it down by day", "session_id": "abc-123"}' | jq

3. Override cost gate¶

curl -s https://akko.local/api/cockpit/aden/ask \
  -H "Content-Type: application/json" \
  -H "X-User-Id: bob" \
  -H "X-User-Role: akko-engineer" \
  -d '{"question": "Full year 2025 transactions joined with customers",
       "confirm_cost": true}' | jq

Configuration¶

Key env vars (see helm/akko/charts/akko-aden/values.yaml):

Variable	Default	Purpose
`AKKO_TRINO_URL`	`https://trino.akko.svc:8443`	Trino coordinator
`AKKO_OPA_URL`	`http://akko-opa:8181`	OPA service
`AKKO_OM_URL`	`http://akko-openmetadata:8585`	OpenMetadata API
`AKKO_LITELLM_URL`	`http://akko-litellm:4000`	LiteLLM gateway
`AKKO_COST_BYTES_LIMIT`	`1073741824`	EXPLAIN cost gate (1 GiB)
`AKKO_CACHE_TTL_SECONDS`	`300`	Cache TTL
`AKKO_MAX_TABLES`	`5`	Max candidate tables sent to LLM

Operational Notes¶

ADEN state is stateless except for the PVC of published dashboards. Killing the pod is safe.
Session memory lives in Redis when AKKO_REDIS_URL is set; otherwise in-process (lost on restart).
The Streamlit sidecar mounts the akko-aden-reports PVC read-only and serves it at reports.<domain>.
Tempo traces are emitted with service name akko-aden — follow a request id from Cockpit to SQL execution.