First Notebook¶
This guide walks you through running the AKKO banking demo notebook, which creates Iceberg tables, generates synthetic data, and demonstrates Trino federation.
Open JupyterHub¶
- Navigate to https://lab.akko.local
- Log in as alice (
alice123) - Wait for your notebook server to spawn (first time takes ~30 seconds)
Open the Banking Demo¶
In the JupyterLab file browser, navigate to:
Notebooks are read-only
The notebooks/ directory is mounted read-only from the host. To edit a
notebook, copy it to your home directory first:
Then open the copy from your home directory.
What the Notebook Does¶
The banking demo simulates a retail bank with 5 French branches, 200 customers, and 1000 transactions. It exercises the full AKKO stack in a single notebook:
Step 1 -- Spark Connect Session¶
Connects to Spark via the gRPC protocol (sc://spark-connect:15002). This is a remote Spark session -- no local Spark installation needed. The notebook creates the iceberg.analytics namespace.
Step 2 -- Create Iceberg Tables¶
Four tables are created in the iceberg.analytics namespace, stored as Iceberg format on object storage (S3-compatible storage), with the catalog managed by Apache Polaris:
| Table | Rows | Partitioned By | Description |
|---|---|---|---|
advisors |
15 | -- | Bank advisors with specialty and branch |
customers |
200 | segment |
Customers (retail, business, premium) |
accounts |
~350 | account_type |
Checking, savings, investment accounts |
transactions |
1000 | months(transaction_date) |
6 months of card, transfer, deposit operations |
Step 3 -- Verify Data¶
The notebook prints row counts for all four tables and displays sample data for each.
Step 4 -- Iceberg Time-Travel¶
Demonstrates Iceberg's snapshot history. Each INSERT creates a new snapshot, and you can query any historical version of the data:
SELECT snapshot_id, committed_at, operation
FROM iceberg.analytics.transactions.snapshots
ORDER BY committed_at
Spark Connect limitation
Iceberg metadata tables (like .snapshots) must be queried with .show()
instead of .collect() in Spark Connect mode, due to a SerializedLambda
serialization issue.
Step 5 -- Trino Federation¶
The flagship query: a federated join across two different data sources in a single SQL statement:
- Iceberg tables (customers, accounts, transactions) stored on object storage
- PostGIS table (branches with geospatial coordinates) stored in PostgreSQL
SELECT
b.name AS branch, b.city,
COUNT(DISTINCT c.customer_id) AS customer_count,
COUNT(t.transaction_id) AS transaction_count,
ROUND(SUM(ABS(t.amount)), 2) AS total_volume
FROM postgresql.geospatial.branches b
JOIN iceberg.analytics.customers c ON c.branch_id = b.id
JOIN iceberg.analytics.accounts a ON a.customer_id = c.customer_id
JOIN iceberg.analytics.transactions t ON t.account_id = a.account_id
WHERE a.status = 'active'
GROUP BY b.name, b.city
ORDER BY total_volume DESC
This query is executed by Trino, which federates across the Iceberg catalog (via Polaris REST) and the PostgreSQL catalog transparently.
Step 6 -- SQL Queries for Superset¶
The notebook prints ready-to-use SQL queries that you can paste into Superset SQL Lab, including KPIs, monthly volume breakdowns, spending categories, and the federated branch revenue query.
Run It Cell by Cell¶
Select the first cell and press ++shift+enter++ to run it. Continue through each cell in order. The entire notebook takes about 2-3 minutes to complete.
Run cells in order
Each cell depends on the previous ones. Do not skip cells or run them out of order, as later cells reference tables created by earlier cells.
Expected output at the verification step:
========================================
AKKO Banking -- Iceberg Tables
========================================
advisors : 15 rows
customers : 200 rows
accounts : ~350 rows
transactions : 1000 rows
========================================
After Running the Notebook¶
Once the notebook completes successfully, the data is available everywhere in the platform:
- Trino -- Query tables at
iceberg.analytics.*via the Trino UI or any SQL client - Superset -- The auto-provisioned dashboard now displays live data. Navigate to Dashboards > AKKO Banking Overview and refresh
- Other notebooks -- All notebooks that query
iceberg.analytics.*will see the data - OpenMetadata -- If the governance profile is running, the catalog can ingest these tables for metadata management
Architecture Recap¶
Notebook (Spark Connect) PostgreSQL
| +--------------------+
| gRPC :15002 | geospatial |
v | .branches (5) |
Spark Connect --> Polaris +--------+-----------+
| | |
v v |
object storage (S3) |
+---------------+ |
| analytics | |
| .advisors | |
| .customers | |
| .accounts | |
| .transactions | |
+------+---------+ |
| |
+------v-------------------------------v--+
| TRINO (federation) |
+------------------+-----------------------+
|
+------v------+
| SUPERSET |
| Dashboard |
+-------------+
Explore Other Notebooks¶
AKKO ships with 14 notebooks organized by category. After completing the banking demo, try these:
Getting Started¶
| # | Notebook | Description |
|---|---|---|
| 01 | akko-banking-demo |
Banking data model, Spark Connect, Trino federation (this guide) |
| 03 | spark-iceberg-demo |
Deep dive into Iceberg features (schema evolution, partitioning, time-travel) |
AI¶
| # | Notebook | Description |
|---|---|---|
| 02 | rag-pipeline-demo |
RAG pipeline with Ollama, pgvector, and LangChain |
| 13 | akko-jupyter-ai-demo |
Jupyter AI integration with local LLMs via Ollama |
Analytics¶
| # | Notebook | Description |
|---|---|---|
| 04 | akko-duckdb-analytics |
In-process analytics with DuckDB on Iceberg data |
| 07 | akko-r-analytics |
R kernel: tidyverse analytics on banking data |
| 08 | akko-julia-dataframes |
Julia kernel: DataFrames.jl on banking data |
| 11 | akko-polars-analytics |
Polars DataFrame library for fast analytics |
Engineering¶
| # | Notebook | Description |
|---|---|---|
| 05 | akko-dbt-transforms |
dbt transformations on Iceberg tables |
| 06 | akko-data-quality |
Data quality checks and validation |
| 10 | akko-polaris-catalog-admin |
Polaris catalog administration via REST API |
Visualization¶
| # | Notebook | Description |
|---|---|---|
| 09 | akko-geospatial-analysis |
PostGIS geospatial analysis with branch locations |
| 12 | akko-altair-visualization |
Interactive Altair/Vega charts |
Reports¶
| File | Description |
|---|---|
04-akko-banking-report.qmd |
Quarto report rendered to HTML, served at https://docs.akko.local/reports/ |
Notebook numbering
Notebooks are numbered in suggested reading order. Start with 01 (this guide), then try 02 (RAG) or 03 (Iceberg deep dive) depending on your interest.
Next Steps¶
- Explore the Superset dashboard with live data from the banking demo
- Try the RAG pipeline notebook to build a retrieval-augmented generation system with Ollama and pgvector
- Learn about Trino federation and how to add your own data sources