JupyterHub¶
Overview¶
JupyterHub provides the multi-user notebook environment for AKKO, accessible at
https://lab.akko.local. It authenticates users via
Keycloak SSO and spawns individual Kubernetes pods (akko-notebook) per user,
each with a rich data engineering and analytics toolkit pre-installed.
Architecture¶
Keycloak SSO
|
+--------+--------+
| JupyterHub | (z2jh k8s-hub image)
| lab.akko.local |
+--------+--------+
| KubeSpawner
+-----------+-----------+
| | |
+---------+ +---------+ +---------+
| alice | | bob | | carol | (akko-notebook containers)
| notebook| | notebook| | notebook|
+---------+ +---------+ +---------+
- JupyterHub runs as a central hub service behind Traefik.
- KubeSpawner creates one
akko-notebookpod per authenticated user. - Each user pod gets 4 GB RAM and 2 CPUs max (production defaults).
- User work persists in a dynamic PersistentVolumeClaim (10 Gi default).
Available Kernels¶
| Kernel | Version | Notes |
|---|---|---|
| Python 3 | scipy-notebook base | Default kernel, full data stack |
| R (IRkernel) | System R + tidyverse, sf, DBI, RPostgres | Geospatial-ready |
| Julia (IJulia) | 1.11.3 | DataFrames, CSV packages pre-installed |
| Scala 2.13 (Almond) | 0.14.0-RC15 | For Spark Scala developers, Ammonite REPL |
Pre-installed Python Packages¶
The akko-notebook image ships with a comprehensive data engineering stack:
Data Engineering & Lakehouse¶
| Package | Purpose |
|---|---|
pyspark 3.5.1 |
Spark Connect client |
trino |
Trino Python DB-API driver |
pyiceberg[s3fs] |
Apache Iceberg Python client |
duckdb |
Embedded analytical database |
polars |
Fast DataFrame library |
pandas |
DataFrame library (from scipy-notebook base) |
pyarrow |
Columnar in-memory format |
dbt-core + dbt-trino |
Data transformation framework |
sqlalchemy + sqlalchemy-trino |
SQL toolkit with Trino dialect |
boto3 |
AWS/object storage S3 SDK |
psycopg2-binary |
PostgreSQL driver |
Analytics & Visualization¶
| Package | Purpose |
|---|---|
altair |
Declarative statistical visualization |
plotly |
Interactive charts |
folium |
Leaflet.js maps |
geopandas |
Geospatial DataFrames |
great-expectations |
Data quality validation |
AI & LLM¶
| Package | Purpose |
|---|---|
jupyter-ai[all] |
AI assistant in JupyterLab |
langchain + langchain-ollama |
LLM orchestration framework |
langchain-community |
Community integrations |
JupyterLab Extensions¶
| Extension | Purpose |
|---|---|
jupyterlab-git |
Git integration |
jupyterlab-lsp |
Language Server Protocol |
jupyterlab-execute-time |
Cell execution time display |
jupyterlab-code-formatter |
Black + isort formatting |
jupyterlab-favorites |
Favorite files sidebar |
jupyterlab-mermaid |
Render Mermaid diagrams (architecture, flowcharts, ER diagrams) directly in notebooks |
jupyter-resource-usage |
RAM/CPU usage indicator |
ipywidgets |
Interactive widgets |
code-server (VS Code in Browser)¶
Each notebook container includes code-server,
providing a full VS Code experience in the browser. It is accessible via the
JupyterLab launcher or the /code-server proxy path.
Pre-installed extensions:
- Languages: Python, R, Julia, Scala (Almond kernel), Go
- Data: SQLTools (PostgreSQL + Trino drivers), Rainbow CSV, Data Table viewer
- dbt: dbt Power User
- DevOps: Docker, REST Client, GitLens, Error Lens
- Quarto: Quarto extension for authoring reports
- Diagrams: Mermaid (inline diagrams in notebooks and Markdown cells)
- Theme: Dracula + Material Icon Theme
System-wide Extensions
Extensions are installed at /opt/code-server-extensions (system-wide,
immutable) and symlinked into each user's home directory at container startup
via a before-notebook.d hook.
Quarto Publishing¶
Quarto (v1.6.43) is pre-installed for authoring and
rendering data reports. Rendered HTML output can be written to the
published-reports volume, which is shared with the AKKO Docs service at
https://docs.akko.local.
# Render a Quarto report from a notebook terminal
quarto render 04-akko-banking-report.qmd --to html \
--output-dir ~/published-reports/
Shared Notebooks¶
The host notebooks/ directory is mounted read-only into each user
container at /home/{username}/work/notebooks/. This provides shared reference
notebooks:
| Notebook | Description |
|---|---|
akko-banking-demo.ipynb |
End-to-end banking demo (Spark Connect + Trino) |
rag-pipeline-demo.ipynb |
RAG pipeline (Ollama + pgvector + LangChain) |
spark-iceberg-demo.ipynb |
Spark Iceberg table operations |
Read-Only Mount
Shared notebooks are read-only. To edit, copy them to your work/ directory
first.
Environment Variables¶
The following environment variables are injected into every user notebook container by the JupyterHub spawner:
| Variable | Default Value | Purpose |
|---|---|---|
TRINO_HOST |
akko-trino |
Trino coordinator hostname |
TRINO_PORT |
8080 |
Trino HTTP port |
SPARK_REMOTE |
sc://akko-akko-spark-connect:15002 |
Spark Connect gRPC endpoint |
OLLAMA_HOST |
http://akko-akko-ollama:11434 |
Ollama LLM API |
LITELLM_HOST |
http://akko-akko-litellm:4000 |
LiteLLM AI gateway |
MLFLOW_TRACKING_URI |
http://akko-akko-mlflow:5000 |
MLflow tracking server |
MINIO_ENDPOINT |
http://akko-minio:9000 |
object storage S3 endpoint |
POSTGRES_AKKO_PASSWORD |
(from Kubernetes Secret) | PostgreSQL password |
POLARIS_ROOT_SECRET |
(from Kubernetes Secret) | Polaris OAuth2 secret |
MINIO_ROOT_USER |
(from Kubernetes Secret) | object storage access key |
MINIO_ROOT_PASSWORD |
(from Kubernetes Secret) | object storage secret key |
DBT_PROFILES_DIR |
/home/{user}/work/dbt |
dbt profiles location |
NB_USER |
(dynamic) | Username from Keycloak |
Authentication & RBAC¶
JupyterHub uses Keycloak SSO via the GenericOAuthenticator (OpenID Connect
authorization code flow).
- Client ID:
jupyterhub - Scopes:
openid,profile,email - Username claim:
preferred_username - Auto-login: enabled (redirects directly to Keycloak)
- Admin users: configurable via
JUPYTERHUB_ADMIN_USERenv var
Keycloak Groups
The 5 AKKO realm roles (akko-admin, akko-engineer, akko-analyst,
akko-user, akko-viewer) control access across the AKKO platform.
JupyterHub currently allows all authenticated users (allow_all = True),
but the Keycloak roles propagate to downstream services like Trino for
fine-grained RBAC.
Resource Limits¶
| Resource | Production Default | Dev Override |
|---|---|---|
| Memory limit | 4 GB per user pod | 2 GB |
| Memory guarantee | 1 GB | 512 MB |
| CPU limit | 2 cores | 1 core |
| CPU guarantee | 0.5 | 0.1 |
| Storage | 10 Gi (dynamic PVC) | 1 Gi |
Resource Management¶
Idle Server Culling¶
JupyterHub automatically shuts down inactive notebook servers to free cluster
resources (CPU, RAM). This is handled by the built-in jupyterhub-idle-culler.
| Parameter | Default | Description |
|---|---|---|
cull.enabled |
true |
Enable automatic idle server shutdown |
cull.timeout |
1800 |
Seconds of inactivity before shutdown (30 min) |
cull.every |
300 |
How often to check for idle servers (5 min) |
cull.maxAge |
28800 |
Maximum server lifetime in seconds (8h), even if active |
To change these values, edit values-dev.yaml (or your production values file):
jupyterhub:
cull:
enabled: true
timeout: 3600 # 1 hour of inactivity
every: 300 # check every 5 minutes
maxAge: 43200 # 12 hours max lifetime
Then apply:
helm upgrade akko helm/akko/ -n akko -f <your-values-file>.yaml \
--set-file akko-keycloak.realm.data=<realm-file>.json
User Data is Preserved
Culling only deletes the pod (compute resources). User files are stored
on a PersistentVolumeClaim (claim-<username>) that persists across restarts.
When the user logs in again, a new pod is created and the PVC is re-mounted
automatically.
Production Recommendations
- Data Scientists running long ML training: increase
timeoutto7200(2h) andmaxAgeto86400(24h). - Cost-sensitive environments: reduce
timeoutto600(10 min). - Workshops / demos: set
maxAgeto14400(4h) to ensure cleanup after the session.
Known Issues¶
Important Gotchas
CHOWN_HOME_OPTS=-Rbreaks with read-only volumes: The notebook image usesCHOWN_HOME=noand a custombefore-notebook.dhook that doeschown 2>/dev/null || true.jupyterlab-drawio: Never install -- it forces JupyterLab 3.jupyterlab-lsp: Must be >= 5 for JupyterLab 4 compatibility.- Keycloak
emailVerified: All test users must haveemailVerified=truein Keycloak, otherwise oauth2-proxy rejects them.