Data Scientist Starter Kit¶

This guide covers the machine learning and AI workflows available on AKKO: notebooks, experiment tracking, local LLM inference, RAG pipelines, and report publishing.

JupyterHub Environment¶

AKKO provides a fully-featured notebook environment with multiple language kernels and IDE options.

Available Kernels¶

Kernel	Version	Key Libraries
Python 3	3.11+	pandas, scikit-learn, pyspark, trino, langchain, matplotlib, seaborn
R	4.x	tidyverse, ggplot2, DBI
Julia	1.x	DataFrames, Plots

IDE Options¶

JupyterLab -- default notebook interface with extensions
code-server -- VS Code in the browser (accessible from the JupyterHub launcher)

Pre-installed ML Libraries¶

# Machine Learning
import sklearn
import xgboost
import lightgbm

# Deep Learning
import torch

# Data Processing
import pandas as pd
import polars as pl
from pyspark.sql import SparkSession

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

Connecting to Data¶

# Spark Connect (Iceberg tables)
spark = SparkSession.builder.remote("sc://spark-connect:15002").getOrCreate()
df_spark = spark.sql("SELECT * FROM polaris.banking.transactions")

# Trino (federated SQL)
from trino.dbapi import connect
conn = connect(host="trino", port=8080, user="alice")

# PostgreSQL (direct)
import psycopg2
import os
conn = psycopg2.connect(
    host="postgresql-data",
    dbname="akko_data",
    user="akko",
    password=os.environ["POSTGRES_AKKO_PASSWORD"]
)

MLflow Experiment Tracking¶

MLflow tracks experiments, models, and artifacts. It is pre-configured with MinIO for artifact storage and PostgreSQL for metadata.

Logging an Experiment¶

import mlflow

mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("credit-risk-model")

with mlflow.start_run(run_name="xgboost-v1"):
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 6)
    mlflow.log_param("learning_rate", 0.1)

    # Train model
    import xgboost as xgb
    model = xgb.XGBClassifier(n_estimators=100, max_depth=6, learning_rate=0.1)
    model.fit(X_train, y_train)

    # Log metrics
    from sklearn.metrics import accuracy_score, f1_score
    y_pred = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("f1_score", f1_score(y_test, y_pred))

    # Log model
    mlflow.xgboost.log_model(model, "model")

Model Registry¶

# Register a model
mlflow.register_model("runs:/<run_id>/model", "credit-risk-model")

# Load a registered model
model = mlflow.pyfunc.load_model("models:/credit-risk-model/Production")
predictions = model.predict(new_data)

MLflow UI¶

Access the MLflow tracking UI at https://mlflow.<domain> to compare experiments, view metrics, and manage the model registry.

Local LLM Inference with Ollama + LiteLLM¶

AKKO runs LLMs locally using Ollama, with LiteLLM as a unified API gateway.

Available Models¶

Model	Size	Use Case
Qwen 2.5-Coder:7B	~4.7 GB	Code generation, SQL assistance
Qwen 2.5:3B	~2 GB	General text generation
nomic-embed-text	~274 MB	Text embeddings (RAG)

Using LiteLLM (OpenAI-Compatible API)¶

from openai import OpenAI

client = OpenAI(
    base_url="http://litellm:4000/v1",
    api_key="sk-akko"  # configured in LiteLLM
)

response = client.chat.completions.create(
    model="qwen2.5-coder:7b",
    messages=[
        {"role": "system", "content": "You are a helpful SQL assistant."},
        {"role": "user", "content": "Write a query to find the top 10 customers by revenue."}
    ]
)
print(response.choices[0].message.content)

Using Ollama Directly¶

import requests
import os

response = requests.post(
    f"http://{os.environ['OLLAMA_HOST']}/api/generate",
    json={
        "model": "qwen2.5:3b",
        "prompt": "Explain what Apache Iceberg is in 3 sentences.",
        "stream": False
    }
)
print(response.json()["response"])

Jupyter AI Integration¶

Jupyter AI is pre-installed. Use the chat sidebar in JupyterLab to interact with local models directly within your notebook environment.

RAG Pipeline (pgvector + Ollama + LangChain)¶

AKKO includes a complete Retrieval-Augmented Generation stack, running entirely on-premises.

Architecture¶

Documents --> Embeddings (nomic-embed-text) --> pgvector (PostgreSQL)
                                                      |
User Query --> Embedding --> Similarity Search --------+
                                    |
                              Context + Query --> Ollama (Qwen 2.5) --> Response

Building a RAG Pipeline¶

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import PGVector
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

# 1. Initialize embeddings
embeddings = OllamaEmbeddings(
    base_url=f"http://{os.environ['OLLAMA_HOST']}",
    model="nomic-embed-text"
)

# 2. Connect to pgvector
CONNECTION_STRING = (
    f"postgresql://akko:{os.environ['POSTGRES_AKKO_PASSWORD']}"
    f"@postgresql-data:5432/akko_data"
)

vectorstore = PGVector(
    connection_string=CONNECTION_STRING,
    embedding_function=embeddings,
    collection_name="documents"
)

# 3. Add documents
from langchain.document_loaders import TextLoader
loader = TextLoader("report.txt")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
vectorstore.add_documents(chunks)

# 4. Query with RAG
llm = Ollama(
    base_url=f"http://{os.environ['OLLAMA_HOST']}",
    model="qwen2.5:3b"
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)

result = qa.invoke("What were the key findings in the report?")
print(result["result"])

Pre-built Demo¶

Check shared/rag-pipeline-demo.ipynb in JupyterHub for a complete working example.

Publishing Reports with Quarto¶

Quarto is pre-installed in the notebook image. Create publication-quality reports from your analyses.

Creating a Report¶

Create a .qmd file in JupyterLab:

---
title: "Credit Risk Analysis"
author: "Data Science Team"
format:
  html:
    theme: cosmo
    toc: true
---

## Summary

This report analyzes credit risk patterns across our portfolio.

```{python}
import pandas as pd
df = pd.read_sql("SELECT * FROM transactions", conn)
df.describe()

### Rendering

```bash
quarto render report.qmd

Publishing to AKKO Docs¶

Rendered reports are published to the published-reports volume, accessible through the AKKO Docs service at https://docs.<domain>.

See shared/04-akko-banking-report.qmd for a complete example.