Data Flow¶
This page traces a piece of data from an upstream OLTP system all the way to an executive dashboard and an AI agent answering a natural-language question.
End-to-End Pipeline¶
flowchart LR
subgraph Sources
OLTP[(OLTP / SaaS<br/>PostgreSQL, APIs, files)]
end
subgraph Ingestion[Layer 1 — Ingestion]
DAG[Airflow DAG<br/>akko_banking_fraud_demo]
SPARKI[Spark Connect<br/>JDBC replicator]
end
subgraph Storage[Layer 2 — Storage]
PG[(PostgreSQL<br/>akko-postgresql-data)]
MINIO[(SeaweedFS<br/>akko-warehouse bucket)]
POL[Apache Polaris<br/>Iceberg REST catalog]
end
subgraph Compute[Layer 3 — Compute]
TRINO[Trino 480<br/>federated SQL]
DBT[dbt Core<br/>semantic layer]
SPARK[Spark Connect<br/>ETL + ML features]
end
subgraph Consume[Layer 5 — Consumption]
SUP[Superset<br/>dashboards]
JH[JupyterHub<br/>notebooks]
ADEN[ADEN<br/>NL to SQL to Streamlit]
end
subgraph Govern[Layer 6 — Governance]
OM[OpenMetadata<br/>lineage + tags]
KC[Keycloak<br/>OIDC tokens]
end
OLTP -->|CDC / batch| DAG
DAG --> PG
DAG --> SPARKI
SPARKI -->|MERGE INTO iceberg.banking.transactions| POL
POL -.metadata.-> MINIO
POL --> TRINO
PG --> TRINO
TRINO --> DBT
DBT --> POL
TRINO --> SUP
TRINO --> JH
TRINO --> ADEN
SPARK --> JH
OM -.column-level lineage.-> TRINO
OM -.ingest.-> POL
KC -.user token.-> TRINO
KC -.user token.-> SUP
KC -.user token.-> ADEN
Step-by-Step¶
1. Raw ingestion — Postgres / files / API¶
Airflow DAGs land raw rows in akko-postgresql-data or directly in SeaweedFS (S3). The akko_banking_fraud_demo DAG calls generate_transactions.py, then hands off to Spark Connect for JDBC replication.
2. Lakehouse write — Iceberg on SeaweedFS¶
Spark writes partitioned Iceberg tables (iceberg.banking.transactions, partitioned by region). Polaris stores the schema, snapshot history, and partition spec in its own PostgreSQL backend.
3. Query — Trino + dbt¶
Trino federates reads across iceberg and postgresql catalogs. dbt models (dbt-trino adapter) build curated layers (raw, staging, analytics). Every dbt run emits OpenLineage events that OpenMetadata consumes for column-level lineage.
4. Consumption — Superset, JupyterHub, ADEN¶
- Superset runs SQL against Trino using the user's Keycloak JWT — RBAC + masks applied per query.
- JupyterHub spawns
akko-notebookpods withjupyter-ai,dbt-trino,pysparkandduckdb. - ADEN (Layer 4) takes a natural-language question, calls OpenMetadata + OPA + LiteLLM, then executes the SQL through Trino with the caller's identity.
5. Governance — OpenMetadata + OPA + Keycloak¶
Every table lands in OpenMetadata with tags (PII, GDPR.Personal), business glossary terms, and lineage. OPA policies gate every Trino query on top of Keycloak role claims.
Reference Demo¶
The banking-fraud demo (/demos/banking-fraud/) runs this flow end-to-end in under 10 minutes on k3d:
- Seed synthetic transactions in Postgres.
- Replicate to Iceberg via Spark Connect.
- Train an Isolation Forest in MLflow.
- Score transactions with
akko_ai_anomaly()+akko_ai_sentiment()in a single MERGE. - Refresh the Superset dashboard.
- Ask ADEN: "Which regions saw a fraud rate above 5 % last week?"