Data Flow¶

This page traces a piece of data from an upstream OLTP system all the way to an executive dashboard and an AI agent answering a natural-language question.

End-to-End Pipeline¶

flowchart LR
    subgraph Sources
        OLTP[(OLTP / SaaS<br/>PostgreSQL, APIs, files)]
    end
    subgraph Ingestion[Layer 1 — Ingestion]
        DAG[Airflow DAG<br/>akko_banking_fraud_demo]
        SPARKI[Spark Connect<br/>JDBC replicator]
    end
    subgraph Storage[Layer 2 — Storage]
        PG[(PostgreSQL<br/>akko-postgresql-data)]
        MINIO[(SeaweedFS<br/>akko-warehouse bucket)]
        POL[Apache Polaris<br/>Iceberg REST catalog]
    end
    subgraph Compute[Layer 3 — Compute]
        TRINO[Trino 480<br/>federated SQL]
        DBT[dbt Core<br/>semantic layer]
        SPARK[Spark Connect<br/>ETL + ML features]
    end
    subgraph Consume[Layer 5 — Consumption]
        SUP[Superset<br/>dashboards]
        JH[JupyterHub<br/>notebooks]
        ADEN[ADEN<br/>NL to SQL to Streamlit]
    end
    subgraph Govern[Layer 6 — Governance]
        OM[OpenMetadata<br/>lineage + tags]
        KC[Keycloak<br/>OIDC tokens]
    end

    OLTP -->|CDC / batch| DAG
    DAG --> PG
    DAG --> SPARKI
    SPARKI -->|MERGE INTO iceberg.banking.transactions| POL
    POL -.metadata.-> MINIO
    POL --> TRINO
    PG --> TRINO
    TRINO --> DBT
    DBT --> POL
    TRINO --> SUP
    TRINO --> JH
    TRINO --> ADEN
    SPARK --> JH
    OM -.column-level lineage.-> TRINO
    OM -.ingest.-> POL
    KC -.user token.-> TRINO
    KC -.user token.-> SUP
    KC -.user token.-> ADEN

Step-by-Step¶

1. Raw ingestion — Postgres / files / API¶

Airflow DAGs land raw rows in akko-postgresql-data or directly in SeaweedFS (S3). The akko_banking_fraud_demo DAG calls generate_transactions.py, then hands off to Spark Connect for JDBC replication.

2. Lakehouse write — Iceberg on SeaweedFS¶

Spark writes partitioned Iceberg tables (iceberg.banking.transactions, partitioned by region). Polaris stores the schema, snapshot history, and partition spec in its own PostgreSQL backend.

3. Query — Trino + dbt¶

Trino federates reads across iceberg and postgresql catalogs. dbt models (dbt-trino adapter) build curated layers (raw, staging, analytics). Every dbt run emits OpenLineage events that OpenMetadata consumes for column-level lineage.

4. Consumption — Superset, JupyterHub, ADEN¶

Superset runs SQL against Trino using the user's Keycloak JWT — RBAC + masks applied per query.
JupyterHub spawns akko-notebook pods with jupyter-ai, dbt-trino, pyspark and duckdb.
ADEN (Layer 4) takes a natural-language question, calls OpenMetadata + OPA + LiteLLM, then executes the SQL through Trino with the caller's identity.

5. Governance — OpenMetadata + OPA + Keycloak¶

Every table lands in OpenMetadata with tags (PII, GDPR.Personal), business glossary terms, and lineage. OPA policies gate every Trino query on top of Keycloak role claims.

Reference Demo¶

The banking-fraud demo (/demos/banking-fraud/) runs this flow end-to-end in under 10 minutes on k3d:

Seed synthetic transactions in Postgres.
Replicate to Iceberg via Spark Connect.
Train an Isolation Forest in MLflow.
Score transactions with akko_ai_anomaly() + akko_ai_sentiment() in a single MERGE.
Refresh the Superset dashboard.
Ask ADEN: "Which regions saw a fraud rate above 5 % last week?"