LiteLLM — AI Gateway¶
LiteLLM provides an OpenAI-compatible API gateway that unifies access to all LLM backends in AKKO. Instead of calling Ollama directly, services and notebooks use LiteLLM's standard API — making it trivial to swap backends (Ollama, vLLM, cloud APIs) without changing application code.
Architecture¶
Notebooks / Cockpit / Applications
|
+-----v------+
| LiteLLM | OpenAI-compatible API (port 4000)
| (AI Gateway) |
+-----+------+
|
+-----v------+
| Ollama | Local LLM runtime (port 11434)
| qwen2.5-coder:7b, qwen2.5:3b, nomic-embed-text
+------------+
Configured Models¶
| Model | Size | Use Case |
|---|---|---|
qwen2.5-coder:7b |
4.7 GB | Code generation, text-to-SQL, MCP tool calling |
qwen2.5:3b |
2.0 GB | General chat, jupyter-ai conversations |
nomic-embed-text |
137 MB | RAG embeddings (pgvector) |
Usage¶
From Notebooks (jupyter-ai)¶
jupyter-ai is pre-configured to use LiteLLM as its backend:
# jupyter-ai uses JUPYTER_AI_DEFAULT_PROVIDER and JUPYTER_AI_DEFAULT_MODEL
# These are set automatically by the JupyterHub spawner
Direct API Calls¶
from openai import OpenAI
client = OpenAI(
base_url="http://akko-akko-litellm:4000/v1",
api_key="akko-dev-litellm-key" # master_key from values
)
response = client.chat.completions.create(
model="qwen2.5-coder:7b",
messages=[{"role": "user", "content": "Write a SQL query to find top customers"}]
)
print(response.choices[0].message.content)
Text-to-SQL Example¶
response = client.chat.completions.create(
model="qwen2.5-coder:7b",
messages=[{
"role": "system",
"content": "You are a SQL expert. Generate Trino SQL queries for an Iceberg lakehouse."
}, {
"role": "user",
"content": "Show me the top 10 customers by total balance"
}]
)
URLs¶
| Mode | URL |
|---|---|
| Kubernetes (k3d) | https://llm.akko.local |
Configuration¶
Kubernetes (Helm)¶
akko-litellm:
enabled: true
ollamaHost: "akko-akko-ollama"
config:
general_settings:
master_key: "your-secret-key"
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi # Dev override: 1Gi (LiteLLM OOMs below 1Gi under load)
Memory Requirement
LiteLLM requires at least 1 Gi memory limit. It will be OOMKilled at 512Mi.
Authentication¶
LiteLLM is protected by OAuth2-Proxy (ForwardAuth middleware). Users must be authenticated via Keycloak SSO to access the API through the ingress.
For internal service-to-service calls (within the cluster), use the master_key for authentication.
Troubleshooting¶
LiteLLM OOMKilled (CrashLoopBackOff)¶
Symptoms: The LiteLLM pod enters CrashLoopBackOff status. kubectl describe pod shows OOMKilled as the last termination reason.
Cause: LiteLLM requires at least 1 Gi of memory. If the memory limit is set below this threshold (e.g., 512Mi), the kernel OOM-killer terminates the process.
Solution:
# Check current memory limits
kubectl get pod -n akko -l app.kubernetes.io/name=akko-litellm -o jsonpath='{.items[0].spec.containers[0].resources}'
# Ensure memory limit is at least 1Gi in your values file
# akko-litellm.resources.limits.memory: 1Gi
helm upgrade akko helm/akko/ -n akko -f helm/examples/values-dev.yaml \
--set akko-litellm.resources.limits.memory=1Gi
Ollama Backend Unreachable¶
Symptoms: API calls to LiteLLM return Connection refused or 502 Bad Gateway. Logs show Connection error to ollama backend.
Cause: The Ollama pod is not running, not ready, or the ollamaHost value in the Helm chart does not match the actual Ollama service name.
Solution:
# Verify Ollama pod is running
kubectl get pods -n akko -l app.kubernetes.io/name=akko-ollama
# Check Ollama service endpoint
kubectl get svc -n akko | grep ollama
# Check LiteLLM logs for connection errors
kubectl logs -n akko deploy/akko-akko-litellm --tail=50
# Verify the ollamaHost value matches the service name
kubectl get cm -n akko -l app.kubernetes.io/name=akko-litellm -o yaml | grep -i ollama
API Key Validation Failure¶
Symptoms: Requests return 401 Unauthorized or Authentication Error: Invalid API Key. Internal services that previously worked start failing.
Cause: The master_key configured in LiteLLM does not match the key used by the calling service. This often happens after a Helm upgrade where the key was regenerated or overridden.
Solution:
# Check the current master_key in the LiteLLM config
kubectl get secret -n akko -l app.kubernetes.io/name=akko-litellm -o yaml
# Verify notebook/service environment variables match
kubectl exec -n akko deploy/akko-jupyterhub -- env | grep LITELLM
# Restart LiteLLM after fixing the key
kubectl rollout restart -n akko deploy/akko-akko-litellm
Model Not Found¶
Symptoms: API calls return Model not found: <model-name>. The LiteLLM UI shows no available models.
Cause: The model name in the request does not match any model configured in LiteLLM's routing config, or Ollama has not finished pulling the model yet.
Solution:
# List models available in Ollama
kubectl exec -n akko deploy/akko-akko-ollama -- ollama list
# Check LiteLLM config for registered models
kubectl logs -n akko deploy/akko-akko-litellm --tail=100 | grep -i model
# Verify the LiteLLM config map lists the expected models
kubectl get cm -n akko -l app.kubernetes.io/name=akko-litellm -o yaml
Slow Response Times¶
Symptoms: API calls take 30+ seconds to return. Timeouts occur on larger prompts. Notebooks appear to hang during AI-assisted tasks.
Cause: The Ollama backend is running on CPU (expected in dev), the model is too large for available resources, or multiple concurrent requests are queuing up because Ollama processes them sequentially.
Solution:
# Check Ollama resource usage
kubectl top pod -n akko -l app.kubernetes.io/name=akko-ollama
# Check if Ollama is CPU-bound (high CPU, no GPU)
kubectl describe pod -n akko -l app.kubernetes.io/name=akko-ollama | grep -A5 "Resources"
# Consider using a smaller model (qwen2.5:3b instead of 7b) for faster responses
# Or switch to vLLM with GPU in production for high throughput