Data Scientist Starter Kit¶
This guide covers the machine learning and AI workflows available on AKKO: notebooks, experiment tracking, local LLM inference, RAG pipelines, and report publishing.
JupyterHub Environment¶
AKKO provides a fully-featured notebook environment with multiple language kernels and IDE options.
Available Kernels¶
| Kernel | Version | Key Libraries |
|---|---|---|
| Python 3 | 3.11+ | pandas, scikit-learn, pyspark, trino, langchain, matplotlib, seaborn |
| R | 4.x | tidyverse, ggplot2, DBI |
| Julia | 1.x | DataFrames, Plots |
IDE Options¶
- JupyterLab -- default notebook interface with extensions
- code-server -- VS Code in the browser (accessible from the JupyterHub launcher)
Pre-installed ML Libraries¶
# Machine Learning
import sklearn
import xgboost
import lightgbm
# Deep Learning
import torch
# Data Processing
import pandas as pd
import polars as pl
from pyspark.sql import SparkSession
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
Connecting to Data¶
# Spark Connect (Iceberg tables)
spark = SparkSession.builder.remote("sc://spark-connect:15002").getOrCreate()
df_spark = spark.sql("SELECT * FROM polaris.banking.transactions")
# Trino (federated SQL)
from trino.dbapi import connect
conn = connect(host="trino", port=8080, user="alice")
# PostgreSQL (direct)
import psycopg2
import os
conn = psycopg2.connect(
host="postgresql-data",
dbname="akko_data",
user="akko",
password=os.environ["POSTGRES_AKKO_PASSWORD"]
)
MLflow Experiment Tracking¶
MLflow tracks experiments, models, and artifacts. It is pre-configured with MinIO for artifact storage and PostgreSQL for metadata.
Logging an Experiment¶
import mlflow
mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("credit-risk-model")
with mlflow.start_run(run_name="xgboost-v1"):
# Log parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 6)
mlflow.log_param("learning_rate", 0.1)
# Train model
import xgboost as xgb
model = xgb.XGBClassifier(n_estimators=100, max_depth=6, learning_rate=0.1)
model.fit(X_train, y_train)
# Log metrics
from sklearn.metrics import accuracy_score, f1_score
y_pred = model.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
mlflow.log_metric("f1_score", f1_score(y_test, y_pred))
# Log model
mlflow.xgboost.log_model(model, "model")
Model Registry¶
# Register a model
mlflow.register_model("runs:/<run_id>/model", "credit-risk-model")
# Load a registered model
model = mlflow.pyfunc.load_model("models:/credit-risk-model/Production")
predictions = model.predict(new_data)
MLflow UI¶
Access the MLflow tracking UI at https://mlflow.<domain> to compare experiments, view metrics, and manage the model registry.
Local LLM Inference with Ollama + LiteLLM¶
AKKO runs LLMs locally using Ollama, with LiteLLM as a unified API gateway.
Available Models¶
| Model | Size | Use Case |
|---|---|---|
| Qwen 2.5-Coder:7B | ~4.7 GB | Code generation, SQL assistance |
| Qwen 2.5:3B | ~2 GB | General text generation |
| nomic-embed-text | ~274 MB | Text embeddings (RAG) |
Using LiteLLM (OpenAI-Compatible API)¶
from openai import OpenAI
client = OpenAI(
base_url="http://litellm:4000/v1",
api_key="sk-akko" # configured in LiteLLM
)
response = client.chat.completions.create(
model="qwen2.5-coder:7b",
messages=[
{"role": "system", "content": "You are a helpful SQL assistant."},
{"role": "user", "content": "Write a query to find the top 10 customers by revenue."}
]
)
print(response.choices[0].message.content)
Using Ollama Directly¶
import requests
import os
response = requests.post(
f"http://{os.environ['OLLAMA_HOST']}/api/generate",
json={
"model": "qwen2.5:3b",
"prompt": "Explain what Apache Iceberg is in 3 sentences.",
"stream": False
}
)
print(response.json()["response"])
Jupyter AI Integration¶
Jupyter AI is pre-installed. Use the chat sidebar in JupyterLab to interact with local models directly within your notebook environment.
RAG Pipeline (pgvector + Ollama + LangChain)¶
AKKO includes a complete Retrieval-Augmented Generation stack, running entirely on-premises.
Architecture¶
Documents --> Embeddings (nomic-embed-text) --> pgvector (PostgreSQL)
|
User Query --> Embedding --> Similarity Search --------+
|
Context + Query --> Ollama (Qwen 2.5) --> Response
Building a RAG Pipeline¶
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import PGVector
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
# 1. Initialize embeddings
embeddings = OllamaEmbeddings(
base_url=f"http://{os.environ['OLLAMA_HOST']}",
model="nomic-embed-text"
)
# 2. Connect to pgvector
CONNECTION_STRING = (
f"postgresql://akko:{os.environ['POSTGRES_AKKO_PASSWORD']}"
f"@postgresql-data:5432/akko_data"
)
vectorstore = PGVector(
connection_string=CONNECTION_STRING,
embedding_function=embeddings,
collection_name="documents"
)
# 3. Add documents
from langchain.document_loaders import TextLoader
loader = TextLoader("report.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
vectorstore.add_documents(chunks)
# 4. Query with RAG
llm = Ollama(
base_url=f"http://{os.environ['OLLAMA_HOST']}",
model="qwen2.5:3b"
)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
result = qa.invoke("What were the key findings in the report?")
print(result["result"])
Pre-built Demo¶
Check shared/rag-pipeline-demo.ipynb in JupyterHub for a complete working example.
Publishing Reports with Quarto¶
Quarto is pre-installed in the notebook image. Create publication-quality reports from your analyses.
Creating a Report¶
Create a .qmd file in JupyterLab:
---
title: "Credit Risk Analysis"
author: "Data Science Team"
format:
html:
theme: cosmo
toc: true
---
## Summary
This report analyzes credit risk patterns across our portfolio.
```{python}
import pandas as pd
df = pd.read_sql("SELECT * FROM transactions", conn)
df.describe()
Publishing to AKKO Docs¶
Rendered reports are published to the published-reports volume, accessible through the AKKO Docs service at https://docs.<domain>.
See shared/04-akko-banking-report.qmd for a complete example.