Skip to content

JupyterHub

Overview

JupyterHub provides the multi-user notebook environment for AKKO, accessible at https://lab.akko.local. It authenticates users via Keycloak SSO and spawns individual Kubernetes pods (akko-notebook) per user, each with a rich data engineering and analytics toolkit pre-installed.

Architecture

                Keycloak SSO
                    |
           +--------+--------+
           |   JupyterHub    |  (z2jh k8s-hub image)
           | lab.akko.local |
           +--------+--------+
                    |  KubeSpawner
        +-----------+-----------+
        |           |           |
   +---------+ +---------+ +---------+
   | alice   | |  bob    | | carol   |  (akko-notebook containers)
   | notebook| | notebook| | notebook|
   +---------+ +---------+ +---------+
  • JupyterHub runs as a central hub service behind Traefik.
  • KubeSpawner creates one akko-notebook pod per authenticated user.
  • Each user pod gets 4 GB RAM and 2 CPUs max (production defaults).
  • User work persists in a dynamic PersistentVolumeClaim (10 Gi default).

Available Kernels

Kernel Version Notes
Python 3 scipy-notebook base Default kernel, full data stack
R (IRkernel) System R + tidyverse, sf, DBI, RPostgres Geospatial-ready
Julia (IJulia) 1.11.3 DataFrames, CSV packages pre-installed
Scala 2.13 (Almond) 0.14.0-RC15 For Spark Scala developers, Ammonite REPL

Pre-installed Python Packages

The akko-notebook image ships with a comprehensive data engineering stack:

Data Engineering & Lakehouse

Package Purpose
pyspark 3.5.1 Spark Connect client
trino Trino Python DB-API driver
pyiceberg[s3fs] Apache Iceberg Python client
duckdb Embedded analytical database
polars Fast DataFrame library
pandas DataFrame library (from scipy-notebook base)
pyarrow Columnar in-memory format
dbt-core + dbt-trino Data transformation framework
sqlalchemy + sqlalchemy-trino SQL toolkit with Trino dialect
boto3 AWS/object storage S3 SDK
psycopg2-binary PostgreSQL driver

Analytics & Visualization

Package Purpose
altair Declarative statistical visualization
plotly Interactive charts
folium Leaflet.js maps
geopandas Geospatial DataFrames
great-expectations Data quality validation

AI & LLM

Package Purpose
jupyter-ai[all] AI assistant in JupyterLab
langchain + langchain-ollama LLM orchestration framework
langchain-community Community integrations

JupyterLab Extensions

Extension Purpose
jupyterlab-git Git integration
jupyterlab-lsp Language Server Protocol
jupyterlab-execute-time Cell execution time display
jupyterlab-code-formatter Black + isort formatting
jupyterlab-favorites Favorite files sidebar
jupyterlab-mermaid Render Mermaid diagrams (architecture, flowcharts, ER diagrams) directly in notebooks
jupyter-resource-usage RAM/CPU usage indicator
ipywidgets Interactive widgets

code-server (VS Code in Browser)

Each notebook container includes code-server, providing a full VS Code experience in the browser. It is accessible via the JupyterLab launcher or the /code-server proxy path.

Pre-installed extensions:

  • Languages: Python, R, Julia, Scala (Almond kernel), Go
  • Data: SQLTools (PostgreSQL + Trino drivers), Rainbow CSV, Data Table viewer
  • dbt: dbt Power User
  • DevOps: Docker, REST Client, GitLens, Error Lens
  • Quarto: Quarto extension for authoring reports
  • Diagrams: Mermaid (inline diagrams in notebooks and Markdown cells)
  • Theme: Dracula + Material Icon Theme

System-wide Extensions

Extensions are installed at /opt/code-server-extensions (system-wide, immutable) and symlinked into each user's home directory at container startup via a before-notebook.d hook.

Quarto Publishing

Quarto (v1.6.43) is pre-installed for authoring and rendering data reports. Rendered HTML output can be written to the published-reports volume, which is shared with the AKKO Docs service at https://docs.akko.local.

# Render a Quarto report from a notebook terminal
quarto render 04-akko-banking-report.qmd --to html \
  --output-dir ~/published-reports/

Shared Notebooks

The host notebooks/ directory is mounted read-only into each user container at /home/{username}/work/notebooks/. This provides shared reference notebooks:

Notebook Description
akko-banking-demo.ipynb End-to-end banking demo (Spark Connect + Trino)
rag-pipeline-demo.ipynb RAG pipeline (Ollama + pgvector + LangChain)
spark-iceberg-demo.ipynb Spark Iceberg table operations

Read-Only Mount

Shared notebooks are read-only. To edit, copy them to your work/ directory first.

Environment Variables

The following environment variables are injected into every user notebook container by the JupyterHub spawner:

Variable Default Value Purpose
TRINO_HOST akko-trino Trino coordinator hostname
TRINO_PORT 8080 Trino HTTP port
SPARK_REMOTE sc://akko-akko-spark-connect:15002 Spark Connect gRPC endpoint
OLLAMA_HOST http://akko-akko-ollama:11434 Ollama LLM API
LITELLM_HOST http://akko-akko-litellm:4000 LiteLLM AI gateway
MLFLOW_TRACKING_URI http://akko-akko-mlflow:5000 MLflow tracking server
MINIO_ENDPOINT http://akko-minio:9000 object storage S3 endpoint
POSTGRES_AKKO_PASSWORD (from Kubernetes Secret) PostgreSQL password
POLARIS_ROOT_SECRET (from Kubernetes Secret) Polaris OAuth2 secret
MINIO_ROOT_USER (from Kubernetes Secret) object storage access key
MINIO_ROOT_PASSWORD (from Kubernetes Secret) object storage secret key
DBT_PROFILES_DIR /home/{user}/work/dbt dbt profiles location
NB_USER (dynamic) Username from Keycloak

Authentication & RBAC

JupyterHub uses Keycloak SSO via the GenericOAuthenticator (OpenID Connect authorization code flow).

  • Client ID: jupyterhub
  • Scopes: openid, profile, email
  • Username claim: preferred_username
  • Auto-login: enabled (redirects directly to Keycloak)
  • Admin users: configurable via JUPYTERHUB_ADMIN_USER env var

Keycloak Groups

The 5 AKKO realm roles (akko-admin, akko-engineer, akko-analyst, akko-user, akko-viewer) control access across the AKKO platform. JupyterHub currently allows all authenticated users (allow_all = True), but the Keycloak roles propagate to downstream services like Trino for fine-grained RBAC.

Resource Limits

Resource Production Default Dev Override
Memory limit 4 GB per user pod 2 GB
Memory guarantee 1 GB 512 MB
CPU limit 2 cores 1 core
CPU guarantee 0.5 0.1
Storage 10 Gi (dynamic PVC) 1 Gi

Resource Management

Idle Server Culling

JupyterHub automatically shuts down inactive notebook servers to free cluster resources (CPU, RAM). This is handled by the built-in jupyterhub-idle-culler.

Parameter Default Description
cull.enabled true Enable automatic idle server shutdown
cull.timeout 1800 Seconds of inactivity before shutdown (30 min)
cull.every 300 How often to check for idle servers (5 min)
cull.maxAge 28800 Maximum server lifetime in seconds (8h), even if active

To change these values, edit values-dev.yaml (or your production values file):

jupyterhub:
  cull:
    enabled: true
    timeout: 3600       # 1 hour of inactivity
    every: 300          # check every 5 minutes
    maxAge: 43200       # 12 hours max lifetime

Then apply:

helm upgrade akko helm/akko/ -n akko -f <your-values-file>.yaml \
  --set-file akko-keycloak.realm.data=<realm-file>.json

User Data is Preserved

Culling only deletes the pod (compute resources). User files are stored on a PersistentVolumeClaim (claim-<username>) that persists across restarts. When the user logs in again, a new pod is created and the PVC is re-mounted automatically.

Production Recommendations

  • Data Scientists running long ML training: increase timeout to 7200 (2h) and maxAge to 86400 (24h).
  • Cost-sensitive environments: reduce timeout to 600 (10 min).
  • Workshops / demos: set maxAge to 14400 (4h) to ensure cleanup after the session.

Known Issues

Important Gotchas

  • CHOWN_HOME_OPTS=-R breaks with read-only volumes: The notebook image uses CHOWN_HOME=no and a custom before-notebook.d hook that does chown 2>/dev/null || true.
  • jupyterlab-drawio: Never install -- it forces JupyterLab 3.
  • jupyterlab-lsp: Must be >= 5 for JupyterLab 4 compatibility.
  • Keycloak emailVerified: All test users must have emailVerified=true in Keycloak, otherwise oauth2-proxy rejects them.