Skip to content

DPIA inventory — personal-data flows in AKKO

Who owns the DPIA? The AKKO operator (the customer) is the data controller and is responsible for filing the DPIA with their DPO. AKKO the platform vendor provides this inventory so the controller knows exactly what data AKKO components touch — they don't have to reverse- engineer it from logs.

Trigger — GDPR Art. 35(1) requires a DPIA when "a type of processing is likely to result in a high risk to the rights and freedoms of natural persons". A platform like AKKO that ingests authenticated user activity, federates queries across multiple data sources, and runs LLM agents on user-supplied content typically clears that bar.

Reading the inventory

Each row is one distinct personal-data flow. Columns:

Column Meaning
Component The AKKO service that owns the data
Data category GDPR Art. 4(1) — what kind of personal data
Subject Whose data it is (employee / customer / end-user)
Storage Where it lands physically
Retention How long the platform keeps it by default
Lawful basis Default basis when AKKO is used as designed (controller may override)
Security measure Technical/organisational measure mitigating risk

The defaults are AKKO's out-of-the-box configuration. The data controller can tighten retention or change the lawful basis via the runbooks listed in the References section at the end.


1 — Authentication & identity

Component Data category Subject Storage Retention Lawful basis Security measure
Keycloak (akko-keycloak) Email, given/family name, group memberships, password hash (bcrypt), OTP secrets Employee, partner PostgreSQL keycloak DB Until account deletion Contract (Art. 6(1)(b)) — required to authenticate Bcrypt, optional TOTP, SSO, audit log
directory service / 389-DS (akko-directory) Same as Keycloak when federated Employee LDAP backend Until directory deletes Contract TLS 1.3, RBAC
oauth2-proxy session cookie Encrypted session ID, OIDC ID token (claims: sub, email, groups, exp) Authenticated user Browser cookie + Redis Cookie lifetime (default 8h) Contract AES-GCM cookie, Secure, HttpOnly, SameSite=Lax

2 — Activity & audit logs

Component Data category Subject Storage Retention Lawful basis Security measure
OPA decision logs User ID, role, requested resource, decision, timestamp Authenticated user logs layer decision_logs stream 90 days default Legitimate interest (Art. 6(1)(f)) — security audit Append-only, role-restricted access via cockpit
Keycloak admin events Admin user ID, target user ID, action (create/delete/role change), timestamp Admin + target PostgreSQL keycloak.admin_events 90 days default Legal obligation (Art. 6(1)(c)) — DORA Art. 11 / NIS2 Art. 21 Tamper-evident, role-restricted
Trino query log User ID, SQL text, source IP, timestamp, bytes scanned Authenticated user OPA → logs layer trino_queries 90 days default Legitimate interest — audit, capacity planning Query text may contain personal data if user SELECTs from PII columns; gated by OPA row-filters
Airflow audit log DAG-run user ID, task instance, timestamp Operator PostgreSQL airflow.log 30 days default Legal obligation Role-restricted UI
Cockpit access log User ID, page visited, timestamp Authenticated user nginx access log → logs layer 30 days default Legitimate interest Aggregated for usage stats

3 — User-supplied content

Component Data category Subject Storage Retention Lawful basis Security measure
akko-rag uploaded documents Document text + embeddings (pgvector); document filename, uploader user ID Uploader's customers/contracts (potentially) PostgreSQL akko_rag.documents + SeaweedFS akko-rag/originals/ No automatic deletion — operator-driven Consent (Art. 6(1)(a)) when end-user uploads via UI; otherwise contract TLS in transit, pgcrypto column-level encryption available, OPA gate on /query
ADEN natural-language questions Question text, user ID, timestamp Authenticated user PostgreSQL aden.questions 30 days default Legitimate interest — model improvement Operator can disable persistence via aden.persistQuestions=false
ADEN dashboards Dashboard JSON (may contain query text + user ID) Authenticated user SeaweedFS aden/dashboards/ Until user-deletes Legitimate interest OPA-gated, signed URLs
Notebook outputs Cell output text/images (may include personal data the user printed) Notebook owner JupyterHub PVC Operator-controlled Consent Per-user PVC, no cross-user read
MLflow experiments Experiment author, run params, metrics, artifact URIs Authenticated user PostgreSQL mlflow + SeaweedFS mlflow/ Operator-controlled Contract Role-restricted via Keycloak

4 — LLM-mediated processing

Component Data category Subject Storage Retention Lawful basis Security measure
LiteLLM gateway Routed prompts + completions Authenticated user In-memory only by default; optional log to logs layer 0 (no persistence) by default Contract TLS to upstream Ollama; no external API call by default — sovereign
Ollama (akko-llm) Same as above Same In-memory only 0 Contract Air-gapped — model runs on local GPU/CPU
ADEN reasoning trace Pipeline step inputs/outputs, retrieved context, generated SQL Authenticated user PostgreSQL aden.reasoning 30 days default Legitimate interest — explainability (Sprint 41) Op-controlled retention; user can clear own trace

5 — Data lake & warehouse

Component Data category Subject Storage Retention Lawful basis Security measure
SeaweedFS object storage Whatever the customer ingests (may include PII) Customer's data subjects SeaweedFS volumes Operator-controlled Customer-defined Optional volume-level encryption (-volume.encrypted=true — Sprint 52 P1)
PostgreSQL akko_data Same Same PostgreSQL data DB Operator-controlled Customer-defined pgcrypto column-level encryption available
Iceberg tables (Polaris) Same Same SeaweedFS iceberg/ Operator-controlled Customer-defined OPA row/column policies via Trino
OpenMetadata catalog Tags + dataset descriptions (no row data) PostgreSQL openmetadata + OpenSearch Operator-controlled Metadata only, no PII bodies

How to handle a Subject Access Request (GDPR Art. 15)

Run from the akko-aden namespace where the controller's bot account has read across all subsystems. The dsr.sh helper is shipped at scripts/dsr.sh and produces a single JSON bundle for the user.

# Identify the user (Keycloak userId, not email — emails can change)
USER_ID=$(curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
  "https://identity.akko-ai.com/admin/realms/akko/users?email=user@example.com" \
  | jq -r '.[0].id')

# Bundle every flow for that user
bash scripts/dsr.sh "$USER_ID" > dsr-$USER_ID.json

The bundle covers Keycloak identity + OPA decisions + Trino queries + Airflow runs + ADEN questions + MLflow experiments + notebook PVC listing + uploaded RAG documents. Hand it to the requester within the 1-month statutory window.

How to handle a Right-to-Erasure (GDPR Art. 17)

Same bundle scope, but deletion is operator-acknowledged because some records are subject to legal retention (e.g. DORA Art. 11 requires audit log retention even after the subject erases). The script emits a deletion plan; operator signs off; the script then executes per component, with a tombstone written to the audit log.

bash scripts/erasure.sh "$USER_ID" --dry-run    # plan only
bash scripts/erasure.sh "$USER_ID" --execute    # after sign-off

Cross-border transfers

AKKO is sovereign by design — no service makes outbound HTTPS calls in the default deployment. The data controller can verify by inspecting NetworkPolicies (kubectl get netpol -n akko) — only Traefik ingress and intra-namespace traffic is allowed; egress to public Internet is blocked unless explicitly enabled (e.g. for SIEM forwarding, documented in SIEM forwarder).

If the operator activates an external SIEM target (Splunk Cloud, Sentinel, Elastic Cloud), the activation is itself a transfer and the operator must add it to their DPIA addenda with the chosen SCC / adequacy basis.

Sprint 59 addendum — ADEN ADR-041/042/043 flows

The Sprint 59 ADEN refactor introduces three new internal flows. None of them touches user PII — they reason about table metadata + the caller's role only — but the controller's DPO must still know they exist and where the data lives.

Flow Component Data category Storage Retention Lawful basis
ADR-041 OPA scope-first batch OPA + ADEN routes_query.py Caller's Keycloak role + a list of candidate catalog.schema.table FQNs (no row data) Process memory only Per-request, never persisted Legitimate interest (Art. 6(1)(f)) — access enforcement
ADR-042 Tier 2 semantic cache ADEN cache_layers.py (in-process LRU) Question text → 768-d embedding + 32-char cache key. Same role/model partition. Process memory only (no PVC, no Postgres) TTL 1 h, max 512 entries per role/model Legitimate interest — perf optimisation. Bypass via body.force=true or pod restart.
ADR-043 Milvus semantic catalog akko-milvus (off by default) Table descriptions + column names + 768-d embeddings — no row data StatefulSet PVC, in-cluster only Until next aden-catalog-indexer CronJob run (default hourly) Legitimate interest — semantic catalog search

Notes for the controller :

  • ADR-042 cache content can include user questions (the prompt itself). When questions contain personal data ("show transactions for client X"), the embedding is reversible only with the same nomic-embed-text-v2 model — but the question text is also kept verbatim long enough to build the cache key. Treat as personal data. RTBF is honoured by the TTL and by kubectl rollout restart deploy/akko-akko-aden.
  • ADR-043 Milvus stores schema metadata, not data rows. Listing the bank's table names and column lists is not personal data per se, but the controller may still want to assess sensitivity (e.g. table named client_terminations_2026 is itself revealing).
  • All three flows respect ADR-041 allowed_tables — a viewer's cache hit can never be served to an admin or vice versa (per-(role, model) bucketing). The cache write path also enforces remember_semantic with the caller's role — proven by tests/test_cache_layers.py::test_semantic_cache_role_partition.

References