DPIA inventory — personal-data flows in AKKO¶
Who owns the DPIA? The AKKO operator (the customer) is the data controller and is responsible for filing the DPIA with their DPO. AKKO the platform vendor provides this inventory so the controller knows exactly what data AKKO components touch — they don't have to reverse- engineer it from logs.
Trigger — GDPR Art. 35(1) requires a DPIA when "a type of processing is likely to result in a high risk to the rights and freedoms of natural persons". A platform like AKKO that ingests authenticated user activity, federates queries across multiple data sources, and runs LLM agents on user-supplied content typically clears that bar.
Reading the inventory¶
Each row is one distinct personal-data flow. Columns:
| Column | Meaning |
|---|---|
| Component | The AKKO service that owns the data |
| Data category | GDPR Art. 4(1) — what kind of personal data |
| Subject | Whose data it is (employee / customer / end-user) |
| Storage | Where it lands physically |
| Retention | How long the platform keeps it by default |
| Lawful basis | Default basis when AKKO is used as designed (controller may override) |
| Security measure | Technical/organisational measure mitigating risk |
The defaults are AKKO's out-of-the-box configuration. The data controller can tighten retention or change the lawful basis via the runbooks listed in the References section at the end.
1 — Authentication & identity¶
| Component | Data category | Subject | Storage | Retention | Lawful basis | Security measure |
|---|---|---|---|---|---|---|
| Keycloak (akko-keycloak) | Email, given/family name, group memberships, password hash (bcrypt), OTP secrets | Employee, partner | PostgreSQL keycloak DB |
Until account deletion | Contract (Art. 6(1)(b)) — required to authenticate | Bcrypt, optional TOTP, SSO, audit log |
| directory service / 389-DS (akko-directory) | Same as Keycloak when federated | Employee | LDAP backend | Until directory deletes | Contract | TLS 1.3, RBAC |
| oauth2-proxy session cookie | Encrypted session ID, OIDC ID token (claims: sub, email, groups, exp) | Authenticated user | Browser cookie + Redis | Cookie lifetime (default 8h) | Contract | AES-GCM cookie, Secure, HttpOnly, SameSite=Lax |
2 — Activity & audit logs¶
| Component | Data category | Subject | Storage | Retention | Lawful basis | Security measure |
|---|---|---|---|---|---|---|
| OPA decision logs | User ID, role, requested resource, decision, timestamp | Authenticated user | logs layer decision_logs stream |
90 days default | Legitimate interest (Art. 6(1)(f)) — security audit | Append-only, role-restricted access via cockpit |
| Keycloak admin events | Admin user ID, target user ID, action (create/delete/role change), timestamp | Admin + target | PostgreSQL keycloak.admin_events |
90 days default | Legal obligation (Art. 6(1)(c)) — DORA Art. 11 / NIS2 Art. 21 | Tamper-evident, role-restricted |
| Trino query log | User ID, SQL text, source IP, timestamp, bytes scanned | Authenticated user | OPA → logs layer trino_queries |
90 days default | Legitimate interest — audit, capacity planning | Query text may contain personal data if user SELECTs from PII columns; gated by OPA row-filters |
| Airflow audit log | DAG-run user ID, task instance, timestamp | Operator | PostgreSQL airflow.log |
30 days default | Legal obligation | Role-restricted UI |
| Cockpit access log | User ID, page visited, timestamp | Authenticated user | nginx access log → logs layer | 30 days default | Legitimate interest | Aggregated for usage stats |
3 — User-supplied content¶
| Component | Data category | Subject | Storage | Retention | Lawful basis | Security measure |
|---|---|---|---|---|---|---|
| akko-rag uploaded documents | Document text + embeddings (pgvector); document filename, uploader user ID | Uploader's customers/contracts (potentially) | PostgreSQL akko_rag.documents + SeaweedFS akko-rag/originals/ |
No automatic deletion — operator-driven | Consent (Art. 6(1)(a)) when end-user uploads via UI; otherwise contract | TLS in transit, pgcrypto column-level encryption available, OPA gate on /query |
| ADEN natural-language questions | Question text, user ID, timestamp | Authenticated user | PostgreSQL aden.questions |
30 days default | Legitimate interest — model improvement | Operator can disable persistence via aden.persistQuestions=false |
| ADEN dashboards | Dashboard JSON (may contain query text + user ID) | Authenticated user | SeaweedFS aden/dashboards/ |
Until user-deletes | Legitimate interest | OPA-gated, signed URLs |
| Notebook outputs | Cell output text/images (may include personal data the user printed) | Notebook owner | JupyterHub PVC | Operator-controlled | Consent | Per-user PVC, no cross-user read |
| MLflow experiments | Experiment author, run params, metrics, artifact URIs | Authenticated user | PostgreSQL mlflow + SeaweedFS mlflow/ |
Operator-controlled | Contract | Role-restricted via Keycloak |
4 — LLM-mediated processing¶
| Component | Data category | Subject | Storage | Retention | Lawful basis | Security measure |
|---|---|---|---|---|---|---|
| LiteLLM gateway | Routed prompts + completions | Authenticated user | In-memory only by default; optional log to logs layer | 0 (no persistence) by default | Contract | TLS to upstream Ollama; no external API call by default — sovereign |
| Ollama (akko-llm) | Same as above | Same | In-memory only | 0 | Contract | Air-gapped — model runs on local GPU/CPU |
| ADEN reasoning trace | Pipeline step inputs/outputs, retrieved context, generated SQL | Authenticated user | PostgreSQL aden.reasoning |
30 days default | Legitimate interest — explainability (Sprint 41) | Op-controlled retention; user can clear own trace |
5 — Data lake & warehouse¶
| Component | Data category | Subject | Storage | Retention | Lawful basis | Security measure |
|---|---|---|---|---|---|---|
| SeaweedFS object storage | Whatever the customer ingests (may include PII) | Customer's data subjects | SeaweedFS volumes | Operator-controlled | Customer-defined | Optional volume-level encryption (-volume.encrypted=true — Sprint 52 P1) |
PostgreSQL akko_data |
Same | Same | PostgreSQL data DB | Operator-controlled | Customer-defined | pgcrypto column-level encryption available |
| Iceberg tables (Polaris) | Same | Same | SeaweedFS iceberg/ |
Operator-controlled | Customer-defined | OPA row/column policies via Trino |
| OpenMetadata catalog | Tags + dataset descriptions (no row data) | — | PostgreSQL openmetadata + OpenSearch |
Operator-controlled | — | Metadata only, no PII bodies |
How to handle a Subject Access Request (GDPR Art. 15)¶
Run from the akko-aden namespace where the controller's bot account
has read across all subsystems. The dsr.sh helper is shipped at
scripts/dsr.sh and produces a single JSON bundle for the user.
# Identify the user (Keycloak userId, not email — emails can change)
USER_ID=$(curl -s -H "Authorization: Bearer $ADMIN_TOKEN" \
"https://identity.akko-ai.com/admin/realms/akko/users?email=user@example.com" \
| jq -r '.[0].id')
# Bundle every flow for that user
bash scripts/dsr.sh "$USER_ID" > dsr-$USER_ID.json
The bundle covers Keycloak identity + OPA decisions + Trino queries + Airflow runs + ADEN questions + MLflow experiments + notebook PVC listing + uploaded RAG documents. Hand it to the requester within the 1-month statutory window.
How to handle a Right-to-Erasure (GDPR Art. 17)¶
Same bundle scope, but deletion is operator-acknowledged because some records are subject to legal retention (e.g. DORA Art. 11 requires audit log retention even after the subject erases). The script emits a deletion plan; operator signs off; the script then executes per component, with a tombstone written to the audit log.
bash scripts/erasure.sh "$USER_ID" --dry-run # plan only
bash scripts/erasure.sh "$USER_ID" --execute # after sign-off
Cross-border transfers¶
AKKO is sovereign by design — no service makes outbound HTTPS calls
in the default deployment. The data controller can verify by inspecting
NetworkPolicies (kubectl get netpol -n akko) — only Traefik ingress
and intra-namespace traffic is allowed; egress to public Internet is
blocked unless explicitly enabled (e.g. for SIEM forwarding,
documented in SIEM forwarder).
If the operator activates an external SIEM target (Splunk Cloud, Sentinel, Elastic Cloud), the activation is itself a transfer and the operator must add it to their DPIA addenda with the chosen SCC / adequacy basis.
Sprint 59 addendum — ADEN ADR-041/042/043 flows¶
The Sprint 59 ADEN refactor introduces three new internal flows. None of them touches user PII — they reason about table metadata + the caller's role only — but the controller's DPO must still know they exist and where the data lives.
| Flow | Component | Data category | Storage | Retention | Lawful basis |
|---|---|---|---|---|---|
| ADR-041 OPA scope-first batch | OPA + ADEN routes_query.py |
Caller's Keycloak role + a list of candidate catalog.schema.table FQNs (no row data) |
Process memory only | Per-request, never persisted | Legitimate interest (Art. 6(1)(f)) — access enforcement |
| ADR-042 Tier 2 semantic cache | ADEN cache_layers.py (in-process LRU) |
Question text → 768-d embedding + 32-char cache key. Same role/model partition. | Process memory only (no PVC, no Postgres) | TTL 1 h, max 512 entries per role/model | Legitimate interest — perf optimisation. Bypass via body.force=true or pod restart. |
| ADR-043 Milvus semantic catalog | akko-milvus (off by default) |
Table descriptions + column names + 768-d embeddings — no row data | StatefulSet PVC, in-cluster only | Until next aden-catalog-indexer CronJob run (default hourly) |
Legitimate interest — semantic catalog search |
Notes for the controller :
- ADR-042 cache content can include user questions (the prompt itself).
When questions contain personal data ("show transactions for client
X"), the embedding is reversible only with the same
nomic-embed-text-v2model — but the question text is also kept verbatim long enough to build the cache key. Treat as personal data. RTBF is honoured by the TTL and bykubectl rollout restart deploy/akko-akko-aden. - ADR-043 Milvus stores schema metadata, not data rows. Listing the
bank's table names and column lists is not personal data per se, but
the controller may still want to assess sensitivity (e.g. table named
client_terminations_2026is itself revealing). - All three flows respect ADR-041
allowed_tables— a viewer's cache hit can never be served to an admin or vice versa (per-(role, model) bucketing). The cache write path also enforcesremember_semanticwith the caller's role — proven bytests/test_cache_layers.py::test_semantic_cache_role_partition.
References¶
- Compliance matrix — DORA / NIS2 / GDPR mapping
- Audit playbook — extracting the audit trail
- Audit trail — log shape and retention controls
- SIEM forwarder — third-party SIEM integration
- Encryption at rest — pgcrypto + SeaweedFS volume encryption
- DR playbook — recovery procedures (DORA Art. 11)
- GDPR Art. 35 — DPIA trigger
- ENISA "Handbook on Security of Personal Data Processing" — technical guidance