Skip to content

Data Flow

This page traces a piece of data from an upstream OLTP system all the way to an executive dashboard and an AI agent answering a natural-language question.

End-to-End Pipeline

flowchart LR
    subgraph Sources
        OLTP[(OLTP / SaaS<br/>PostgreSQL, APIs, files)]
    end
    subgraph Ingestion[Layer 1 — Ingestion]
        DAG[Airflow DAG<br/>akko_banking_fraud_demo]
        SPARKI[Spark Connect<br/>JDBC replicator]
    end
    subgraph Storage[Layer 2 — Storage]
        PG[(PostgreSQL<br/>akko-postgresql-data)]
        MINIO[(SeaweedFS<br/>akko-warehouse bucket)]
        POL[Apache Polaris<br/>Iceberg REST catalog]
    end
    subgraph Compute[Layer 3 — Compute]
        TRINO[Trino 480<br/>federated SQL]
        DBT[dbt Core<br/>semantic layer]
        SPARK[Spark Connect<br/>ETL + ML features]
    end
    subgraph Consume[Layer 5 — Consumption]
        SUP[Superset<br/>dashboards]
        JH[JupyterHub<br/>notebooks]
        ADEN[ADEN<br/>NL to SQL to Streamlit]
    end
    subgraph Govern[Layer 6 — Governance]
        OM[OpenMetadata<br/>lineage + tags]
        KC[Keycloak<br/>OIDC tokens]
    end

    OLTP -->|CDC / batch| DAG
    DAG --> PG
    DAG --> SPARKI
    SPARKI -->|MERGE INTO iceberg.banking.transactions| POL
    POL -.metadata.-> MINIO
    POL --> TRINO
    PG --> TRINO
    TRINO --> DBT
    DBT --> POL
    TRINO --> SUP
    TRINO --> JH
    TRINO --> ADEN
    SPARK --> JH
    OM -.column-level lineage.-> TRINO
    OM -.ingest.-> POL
    KC -.user token.-> TRINO
    KC -.user token.-> SUP
    KC -.user token.-> ADEN

Step-by-Step

1. Raw ingestion — Postgres / files / API

Airflow DAGs land raw rows in akko-postgresql-data or directly in SeaweedFS (S3). The akko_banking_fraud_demo DAG calls generate_transactions.py, then hands off to Spark Connect for JDBC replication.

2. Lakehouse write — Iceberg on SeaweedFS

Spark writes partitioned Iceberg tables (iceberg.banking.transactions, partitioned by region). Polaris stores the schema, snapshot history, and partition spec in its own PostgreSQL backend.

3. Query — Trino + dbt

Trino federates reads across iceberg and postgresql catalogs. dbt models (dbt-trino adapter) build curated layers (raw, staging, analytics). Every dbt run emits OpenLineage events that OpenMetadata consumes for column-level lineage.

4. Consumption — Superset, JupyterHub, ADEN

  • Superset runs SQL against Trino using the user's Keycloak JWT — RBAC + masks applied per query.
  • JupyterHub spawns akko-notebook pods with jupyter-ai, dbt-trino, pyspark and duckdb.
  • ADEN (Layer 4) takes a natural-language question, calls OpenMetadata + OPA + LiteLLM, then executes the SQL through Trino with the caller's identity.

5. Governance — OpenMetadata + OPA + Keycloak

Every table lands in OpenMetadata with tags (PII, GDPR.Personal), business glossary terms, and lineage. OPA policies gate every Trino query on top of Keycloak role claims.

Reference Demo

The banking-fraud demo (/demos/banking-fraud/) runs this flow end-to-end in under 10 minutes on k3d:

  1. Seed synthetic transactions in Postgres.
  2. Replicate to Iceberg via Spark Connect.
  3. Train an Isolation Forest in MLflow.
  4. Score transactions with akko_ai_anomaly() + akko_ai_sentiment() in a single MERGE.
  5. Refresh the Superset dashboard.
  6. Ask ADEN: "Which regions saw a fraud rate above 5 % last week?"