vLLM — High-Throughput LLM Inference¶
vLLM is a high-throughput LLM inference server designed for production GPU environments. It replaces Ollama when GPU hardware is available, serving an OpenAI-compatible API with continuous batching and optimized GPU memory utilization. LiteLLM automatically routes to vLLM when enabled.
Architecture¶
Notebooks / Cockpit / Applications
|
+-----v------+
| LiteLLM | OpenAI-compatible API (port 4000)
| (AI Gateway) |
+-----+------+
|
+-----v------+
| vLLM | GPU inference (port 8000)
| Qwen/Qwen2.5-7B-Instruct
+------------+
In development (no GPU), LiteLLM routes to Ollama instead. In production (GPU available), LiteLLM routes to vLLM. The switch is transparent — application code never changes.
Dev: LiteLLM → Ollama (CPU, multiple small models)
Prod: LiteLLM → vLLM (GPU, single high-throughput model)
URLs¶
| Mode | URL |
|---|---|
| Kubernetes (production) | Internal only — accessed via LiteLLM gateway |
vLLM does not have its own ingress. All requests go through LiteLLM (https://llm.akko.local), which routes to vLLM internally.
Usage¶
From Notebooks (via LiteLLM)¶
No code changes needed. The same LiteLLM API works whether the backend is Ollama or vLLM:
from openai import OpenAI
client = OpenAI(
base_url="http://akko-akko-litellm:4000/v1",
api_key="akko-dev-litellm-key",
)
response = client.chat.completions.create(
model="akko-coder", # Routed to vLLM in production
messages=[{"role": "user", "content": "Write a SQL query for customer segmentation"}],
)
print(response.choices[0].message.content)
Direct vLLM API (internal)¶
For services that need to bypass LiteLLM:
from openai import OpenAI
client = OpenAI(
base_url="http://akko-akko-vllm:8000/v1",
api_key="EMPTY", # vLLM does not require auth internally
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Explain Apache Iceberg"}],
)
Configuration¶
Dev Environment (values-dev.yaml)¶
vLLM is disabled in dev — Ollama handles inference on CPU:
Production Environment (values-production.yaml)¶
vLLM is enabled with GPU resources:
akko-vllm:
enabled: true
replicaCount: 1
model:
name: "Qwen/Qwen2.5-7B-Instruct"
maxModelLen: 4096
gpuMemoryUtilization: 0.9
akko-litellm:
vllmBackend:
enabled: true
host: "akko-akko-vllm"
port: 8000
model: "Qwen/Qwen2.5-7B-Instruct"
Model Configuration¶
| Parameter | Default | Description |
|---|---|---|
model.name |
Qwen/Qwen2.5-7B-Instruct |
HuggingFace model ID |
model.maxModelLen |
4096 |
Maximum sequence length |
model.gpuMemoryUtilization |
0.9 |
Fraction of GPU memory to use (0.0-1.0) |
Key Features¶
| Feature | Description |
|---|---|
| Continuous Batching | Dynamically batches incoming requests for maximum throughput |
| PagedAttention | Efficient GPU memory management, serves more concurrent requests |
| OpenAI-Compatible API | Drop-in replacement — same API as OpenAI, GPT4All, Ollama |
| HuggingFace Models | Load any model from HuggingFace Hub by ID |
| GPU Memory Optimization | Configurable utilization ratio to balance throughput and model size |
| Tensor Parallelism | Scale across multiple GPUs for larger models |
Dev vs Production¶
| Aspect | Dev (Ollama) | Production (vLLM) |
|---|---|---|
| Hardware | CPU (Apple Silicon / x86) | NVIDIA GPU (CUDA) |
| Models | Multiple small models (3B-7B) | Single optimized model (7B+) |
| Throughput | Low (1-5 tokens/s) | High (50-200+ tokens/s) |
| Batching | Sequential | Continuous batching |
| Memory | System RAM | GPU VRAM (16 Gi+ recommended) |
| Helm value | akko-vllm.enabled: false |
akko-vllm.enabled: true |
Resource Requirements¶
| Resource | Minimum | Recommended |
|---|---|---|
| GPU | 1x NVIDIA (16 GB VRAM) | 1x NVIDIA A100 (40/80 GB) |
| RAM | 8 Gi | 16 Gi |
| GPU Memory Utilization | 0.8 | 0.9 |
GPU Required
vLLM requires an NVIDIA GPU with CUDA support. It will not start on CPU-only nodes. For development without a GPU, use Ollama instead (akko-vllm.enabled: false).
Model Download
On first startup, vLLM downloads the model from HuggingFace Hub. For Qwen/Qwen2.5-7B-Instruct, this is approximately 15 GB. Ensure sufficient disk space and network bandwidth.
Healthcheck¶
vLLM exposes a /health endpoint used by Kubernetes probes:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Model loading takes time
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
Slow Startup
vLLM takes 60-120 seconds to load a 7B model into GPU memory. The initialDelaySeconds is set accordingly to avoid premature restarts.
Troubleshooting¶
GPU Not Detected¶
Symptoms: The vLLM pod fails to start. Logs show RuntimeError: No CUDA GPUs are available or torch.cuda.is_available() returns False. The pod enters CrashLoopBackOff.
Cause: The Kubernetes node does not have an NVIDIA GPU, the NVIDIA device plugin is not installed, or the pod spec is missing the nvidia.com/gpu resource request.
Solution:
# Check if NVIDIA device plugin is running
kubectl get pods -n kube-system | grep nvidia
# Verify GPU resources are advertised by nodes
kubectl describe nodes | grep -A5 "nvidia.com/gpu"
# Check vLLM pod events for scheduling failures
kubectl describe pod -n akko -l app.kubernetes.io/name=akko-vllm | grep -A10 "Events"
# Verify the pod requests GPU resources
kubectl get pod -n akko -l app.kubernetes.io/name=akko-vllm -o jsonpath='{.items[0].spec.containers[0].resources}'
# If no GPU is available, disable vLLM and use Ollama instead
# helm upgrade akko helm/akko/ -n akko --set akko-vllm.enabled=false
CUDA Version Mismatch¶
Symptoms: vLLM pod starts but crashes immediately. Logs show CUDA driver version is insufficient for CUDA runtime version or CUDA error: no kernel image is available for execution on the device.
Cause: The CUDA runtime version in the vLLM container image does not match the CUDA driver version installed on the host node. This typically happens with older GPU drivers or newer container images.
Solution:
# Check host CUDA driver version
kubectl exec -n akko -l app.kubernetes.io/name=akko-vllm -- nvidia-smi 2>/dev/null || echo "nvidia-smi not available"
# Check CUDA version in the vLLM container
kubectl logs -n akko deploy/akko-akko-vllm --tail=20 | grep -i "cuda\|driver"
# Verify GPU compute capability matches the model requirements
kubectl exec -n akko deploy/akko-akko-vllm -- python3 -c "import torch; print(torch.cuda.get_device_capability())" 2>/dev/null
# Solutions:
# 1. Update the NVIDIA driver on the host node
# 2. Use a vLLM image built for your CUDA version
# 3. Pin the vLLM image tag to a version compatible with your driver
Model Download Timeout¶
Symptoms: vLLM pod stays in Running state but never becomes Ready. Logs show Downloading model... with slow or stalled progress. The readiness probe eventually fails and the pod restarts.
Cause: The HuggingFace model download is slow due to network bandwidth limitations, a proxy blocking large downloads, or a missing HF_TOKEN for gated models.
Solution:
# Check download progress in logs
kubectl logs -n akko deploy/akko-akko-vllm --tail=30 | grep -i "download\|model\|huggingface"
# Verify network connectivity from the pod
kubectl exec -n akko deploy/akko-akko-vllm -- curl -sI https://huggingface.co
# Increase readiness probe timeout to allow for slow downloads
# In values: initialDelaySeconds: 300, failureThreshold: 30
# For gated models, ensure HF_TOKEN is set
kubectl get secret -n akko -l app.kubernetes.io/name=akko-vllm -o yaml | grep HF_TOKEN
# Consider pre-downloading the model to a PVC to avoid repeated downloads
Out of VRAM¶
Symptoms: vLLM crashes with torch.cuda.OutOfMemoryError: CUDA out of memory or ValueError: The model's max seq len is too large. The pod restarts repeatedly.
Cause: The model requires more GPU VRAM than available. The gpuMemoryUtilization ratio is set too high (leaving no room for KV cache), or maxModelLen is too large for the available memory.
Solution:
# Check GPU memory usage
kubectl exec -n akko deploy/akko-akko-vllm -- nvidia-smi 2>/dev/null
# Check current configuration
kubectl get cm -n akko -l app.kubernetes.io/name=akko-vllm -o yaml | grep -i "memory\|model_len\|gpu"
# Reduce GPU memory utilization and max sequence length
# In your values file:
# akko-vllm.model.gpuMemoryUtilization: 0.8 (down from 0.9)
# akko-vllm.model.maxModelLen: 2048 (down from 4096)
helm upgrade akko helm/akko/ -n akko -f helm/examples/values-dev.yaml \
--set akko-vllm.model.gpuMemoryUtilization=0.8 \
--set akko-vllm.model.maxModelLen=2048
# Or switch to a smaller model (e.g., 3B instead of 7B)
Slow Startup (Model Loading)¶
Symptoms: vLLM pod is Running but not Ready for several minutes. Health endpoint returns errors. LiteLLM routes fail because the vLLM backend is not yet available.
Cause: Loading a 7B+ model into GPU memory takes 60-120+ seconds. The default initialDelaySeconds may be insufficient for larger models or slower hardware.
Solution:
# Check pod age and readiness status
kubectl get pods -n akko -l app.kubernetes.io/name=akko-vllm -o wide
# Follow startup logs in real time
kubectl logs -n akko deploy/akko-akko-vllm -f --tail=20
# Check if the health endpoint responds
kubectl exec -n akko deploy/akko-akko-vllm -- curl -s http://localhost:8000/health || echo "Not ready yet"
# If the pod keeps restarting due to probe failures, increase the delay
# In values: livenessProbe.initialDelaySeconds: 300
# In values: readinessProbe.initialDelaySeconds: 120, failureThreshold: 30
# Verify LiteLLM handles the vLLM startup gracefully
kubectl logs -n akko deploy/akko-akko-litellm --tail=20 | grep -i "vllm\|backend\|error"