Skip to content

Ollama

Local LLM inference server for AI-powered data exploration.

URL https://ollama.akko.local
Helm sub-chart akko-ollama (custom)

Overview

Ollama runs open-weight LLMs locally on your infrastructure. No API keys, no data leaving your network. Combined with LiteLLM (unified AI gateway), jupyter-ai in notebooks, and pgvector in PostgreSQL, it enables fully private AI workflows including RAG (Retrieval-Augmented Generation) pipelines.


Models

AKKO ships with three pre-pulled models:

Model Size Purpose
qwen2.5-coder:7b 4.7 GB Code generation and technical assistance
qwen2.5:3b 2.0 GB General chat and text generation
nomic-embed-text 274 MB Text embeddings (768 dimensions) for RAG pipelines

The ollama-init sidecar pulls these models on first startup. This can take several minutes depending on your internet connection.


Architecture

  JupyterHub (jupyter-ai)
        |
  LiteLLM (:4000)  ──→  Ollama (:11434)
  (OpenAI-compatible       (model inference)
   API gateway)
        |
  PostgreSQL (pgvector)
  (embedding storage)
  • Ollama handles model loading and inference
  • LiteLLM provides an OpenAI-compatible API layer, routing requests to the appropriate Ollama model
  • jupyter-ai in notebooks connects to LiteLLM for AI-assisted coding
  • pgvector in PostgreSQL stores embeddings for RAG pipelines

Using Ollama from Notebooks

Direct API Access

import requests

response = requests.post(
    "http://akko-akko-ollama:11434/api/generate",
    json={
        "model": "qwen2.5:3b",
        "prompt": "Explain Apache Iceberg in one paragraph.",
        "stream": False,
    },
)
print(response.json()["response"])

Via LiteLLM (OpenAI-Compatible)

from openai import OpenAI

client = OpenAI(
    base_url="http://akko-akko-litellm:4000/v1",
    api_key="akko-dev-litellm-key",
)

response = client.chat.completions.create(
    model="ollama/qwen2.5:3b",
    messages=[{"role": "user", "content": "What is a lakehouse?"}],
)
print(response.choices[0].message.content)

Via jupyter-ai

The %%ai magic command in JupyterLab connects to LiteLLM:

%%ai ollama:qwen2.5:3b
Explain the difference between Spark and Trino.

Embeddings for RAG

import requests

response = requests.post(
    "http://akko-akko-ollama:11434/api/embed",
    json={
        "model": "nomic-embed-text",
        "input": "Apache Iceberg is an open table format for analytic datasets.",
    },
)
embedding = response.json()["embeddings"][0]
# Store in PostgreSQL with pgvector

Healthcheck

The Ollama container image does not include curl or wget. The healthcheck uses a TCP probe:

healthcheck:
  test: ["CMD", "bash", "-c", "exec 3<>/dev/tcp/localhost/11434"]

Note

Use CMD (not CMD-SHELL) because the Ollama image's /bin/sh does not support /dev/tcp.


Resource Requirements

Model Minimum RAM Recommended
nomic-embed-text 1 GB 2 GB
qwen2.5:3b 4 GB 6 GB
qwen2.5-coder:7b 8 GB 10 GB

Docker Desktop memory

Loading qwen2.5-coder:7b requires at least 10 GB of available RAM in Docker Desktop. If Ollama crashes or restarts, increase the memory allocation.


Known Issues

No tool calling with small models

Small models like TinyLlama do not support MCP/tool calling. Use qwen2.5-coder:7b or qwen2.5:3b for function calling capabilities.

Model pull takes time

The qwen2.5-coder:7b model is 4.7 GB. The ollama-init sidecar can take 5+ minutes to download on first startup. Do not interrupt the process.

GPU acceleration

Ollama supports GPU acceleration via NVIDIA CUDA. On macOS with Apple Silicon, Metal acceleration is used automatically. On Linux, pass --gpus all in the Docker configuration to enable CUDA.