ADEN — Autonomous Data Exploration Node¶
ADEN is AKKO's sovereign open-source answer to Databricks Genie, Snowflake Copilot and Dremio Text-to-SQL. It converts natural-language questions into Trino SQL, enforces per-user authorization, gates runaway queries, redacts PII, and publishes the result as a Streamlit dashboard — entirely inside your cluster.
For the service-reference page (endpoints, env vars, images) see Services / ADEN. This page focuses on the AI capability and how ADEN fits in the stack.
Why ADEN¶
| Feature | ADEN | Databricks Genie | Snowflake Copilot |
|---|---|---|---|
| Self-hosted, OSS | Yes | No | No |
| RBAC enforcement | OPA | Unity Catalog | Snowflake Roles |
| PII redaction | Yes | Yes | Yes |
| EXPLAIN cost gate | Yes | No | No |
| Publish dashboard | Streamlit | Genie session | None |
| Conversation memory | Yes | Yes | Yes |
| 100 % offline | Yes | No | No |
Architecture¶
flowchart LR
USER[Cockpit chat] -->|POST /ask| FE[/api/cockpit/aden/]
FE --> A[ADEN<br/>FastAPI]
A --> CACHE{TTL cache<br/>5 min}
CACHE -->|miss| OM[OpenMetadata<br/>_search]
OM --> OPA[OPA<br/>/v1/data/akko/aden/allow]
OPA --> LLM[LiteLLM<br/>qwen2.5-coder:7b]
LLM --> VALID[sqlglot validate<br/>+ keyword denylist]
VALID --> GATE[Trino EXPLAIN TYPE IO<br/>AKKO_COST_BYTES_LIMIT]
GATE -->|ok| EXEC[Trino exec<br/>user OAuth token]
EXEC --> MASK[PII redaction<br/>via OM tags]
MASK --> PUB[Streamlit publish<br/>PVC akko-aden-reports]
PUB -->|URL| USER
Request Pipeline¶
Every ADEN call is a strict pipeline — each step assumes the previous one succeeded.
- Cache lookup —
sha256(question | role)keyed TTL cache (5 min). Passforce=trueto bypass. - Catalog search — OpenMetadata
_searchreturns the top-N candidate tables for the question. - OPA per-table check — for each candidate, POST
/v1/data/akko/aden/allowwith{user, role, table}. Denied tables are dropped silently from the prompt context. - LLM generation —
qwen2.5-coder:7breceives the system prompt, the allowed tables, and the last 3 turns of session history (whensession_idis provided). sqlglotvalidation — the LLM output is parsed in the Trino dialect. Anything that is not a singleSELECT / WITH / UNION / EXCEPT / INTERSECTis rejected. A keyword denylist (INSERT,UPDATE,DELETE,DROP, etc.) runs as a second line of defence.LIMIT 10000is auto-injected when absent.EXPLAIN (TYPE IO)cost gate — Trino estimates bytes scanned. Over the gate (default 1 GiB,AKKO_COST_BYTES_LIMIT) returnsHTTP 413. The user can resend withconfirm_cost=trueto override.- Trino execution — with the caller's OAuth token, so any row-filter or column-mask upstream still applies.
- PII redaction — columns tagged
PIIin OpenMetadata are masked before the response. - Streamlit dashboard — the result is written to the
akko-aden-reportsPVC and rendered athttps://reports.<domain>/.
Explainability — the reasoning.pipeline_steps trace¶
Every ADEN response carries a reasoning object with a per-step trace that
the cockpit renders as a "Comment ADEN a répondu" panel. Each step records:
| Field | Meaning |
|---|---|
step |
Machine-readable phase name (e.g. catalog_search, opa_authorisation, semantic_layer_match) |
description |
Human-readable explanation of what ADEN did in this phase |
source |
Backend system consulted: verified_queries_cache, openmetadata, open_policy_agent, dbt_semantic_layer, local_llm, trino_executor |
duration_ms |
Wall-clock time for the step |
result |
Compact JSON payload showing the actual evidence (matched table names, OPA allow/deny per table, selected semantic model…) |
Example response fragment (dry-run, execute: false):
{
"reasoning": {
"intent": "list 5 customers",
"pipeline_overview": "1) requête vérifiée ? → 2) recherche catalogue → 3) filtrage OPA par rôle → 4) priorisation tables → 5) couche sémantique dbt → 6) génération SQL par le LLM local → 7) validation syntaxique Trino → 8) exécution avec votre token → 9) réponse",
"sources_queried": ["verified_queries_cache", "openmetadata_catalog", "open_policy_agent", "dbt_semantic_layer", "local_llm"],
"semantic_model_used": null,
"pipeline_steps": [
{"step": "question_received", "source": "aden", "duration_ms": 0, "result": {"question": "list 5 customers", "length": 16}},
{"step": "catalog_search", "source": "openmetadata+trino_information_schema", "duration_ms": 335, "result": {"candidates": ["iceberg.banking.customers", "postgresql.banking.customers"], "candidate_count": 2}},
{"step": "opa_authorisation", "source": "open_policy_agent", "duration_ms": 19, "result": {"decisions": [{"table": "iceberg.banking.customers", "allowed": true}], "allowed_count": 2}},
{"step": "semantic_layer_match", "source": "dbt_semantic_layer", "duration_ms": 0, "result": {"matched_model": null, "reason": "Aucun modèle métier validé — SQL généré ad-hoc depuis les métadonnées tables/colonnes."}}
]
}
}
This answers the question any prospect will ask: "how did the AI find that table / why did it pick that schema / did it actually consult the catalog before hitting the LLM?" — the evidence is in the response itself, not in a log line the data team has to correlate.
Security guarantees — SELECT-only¶
ADEN is strictly read-only by construction. Even in a worst-case
scenario where a prompt injection persuades the LLM to emit DROP TABLE,
four independent layers block the write:
sqlglotparse check — onlySELECT,WITH,UNION,EXCEPT,INTERSECTstatements are accepted. Anything else → HTTP 400.- Keyword regex with comment stripping —
INSERT UPDATE DELETE DROP CREATE ALTER TRUNCATE GRANT REVOKE MERGE CALL REPLACEare rejected as standalone words (so column names likedeleted_atstill work), after stripping--and/* */comments so obfuscation fails. - Multi-statement guard — a second
;<alphanum>check catches any payload that bypasses the parser's split. - OPA
read_onlyflag on the ADEN service account — even if the ADEN process itself is compromised, thesvc-adenservice account hasread_only: truein the OPA policy; Trino refuses the write at the server.
The 22 adversarial payloads in tests/integration/aden-select-only.py
exercise direct DDL, direct DML, multi-statement injection, comment
obfuscation, privilege escalation (GRANT/REVOKE), CALL
system.runtime.kill_query, case variations and markdown-fenced LLM
wrappers. Run locally before any ADEN release:
API Endpoints¶
| Method | Path | Body | Purpose |
|---|---|---|---|
| POST | /ask |
{question, session_id?, force?, confirm_cost?} |
Run a question -> SQL -> dashboard |
| POST | /feedback |
{query_id, thumbs, comment?} |
Capture UX feedback |
| GET | /sessions/{id} |
— | Fetch conversation history |
| GET | /metrics |
— | Prometheus (aden_query_total, aden_query_duration_seconds) |
| GET | /healthz |
— | Liveness |
| GET | /readyz |
— | Readiness (checks Trino, OPA, OM, LiteLLM) |
All mutating routes require a valid Keycloak JWT forwarded by Cockpit via X-User-Id and X-User-Role headers.
Structured Audit Events¶
ADEN emits one JSON line per lifecycle event to Loki (label app=akko-aden):
aden_query_received— entry pointaden_cache_hitaden_no_tables,aden_opa_deniedaden_sql_generated,aden_cost_gate,aden_trino_executedaden_pii_masked,aden_dashboard_publishedaden_feedback
Grafana ships an AKKO ADEN SLO dashboard with p50 / p95 / p99 latency, error-budget burn (1 h / 6 h / 1 d), and top denied tables.
Examples¶
1. Simple question¶
curl -s https://akko.local/api/cockpit/aden/ask \
-H "Content-Type: application/json" \
-H "X-User-Id: alice" \
-H "X-User-Role: akko-analyst" \
-d '{"question": "Top 10 regions by transaction volume last week"}' | jq
2. With session memory¶
curl -s https://akko.local/api/cockpit/aden/ask \
-H "Content-Type: application/json" \
-H "X-User-Id: alice" \
-H "X-User-Role: akko-analyst" \
-d '{"question": "Break it down by day", "session_id": "abc-123"}' | jq
3. Override cost gate¶
curl -s https://akko.local/api/cockpit/aden/ask \
-H "Content-Type: application/json" \
-H "X-User-Id: bob" \
-H "X-User-Role: akko-engineer" \
-d '{"question": "Full year 2025 transactions joined with customers",
"confirm_cost": true}' | jq
Configuration¶
Key env vars (see helm/akko/charts/akko-aden/values.yaml):
| Variable | Default | Purpose |
|---|---|---|
AKKO_TRINO_URL |
https://trino.akko.svc:8443 |
Trino coordinator |
AKKO_OPA_URL |
http://akko-opa:8181 |
OPA service |
AKKO_OM_URL |
http://akko-openmetadata:8585 |
OpenMetadata API |
AKKO_LITELLM_URL |
http://akko-litellm:4000 |
LiteLLM gateway |
AKKO_COST_BYTES_LIMIT |
1073741824 |
EXPLAIN cost gate (1 GiB) |
AKKO_CACHE_TTL_SECONDS |
300 |
Cache TTL |
AKKO_MAX_TABLES |
5 |
Max candidate tables sent to LLM |
Operational Notes¶
- ADEN state is stateless except for the PVC of published dashboards. Killing the pod is safe.
- Session memory lives in Redis when
AKKO_REDIS_URLis set; otherwise in-process (lost on restart). - The Streamlit sidecar mounts the
akko-aden-reportsPVC read-only and serves it atreports.<domain>. - Tempo traces are emitted with service name
akko-aden— follow a request id from Cockpit to SQL execution.