Ollama
Ollama makes it easy to run large language models (LLMs) locally on your own hardware. It provides a simple API for running models like Llama, Mistral, and others without sending your data to external services.
Why Run LLMs Locally?
Privacy: Your conversations never leave your machine
Cost: No API fees or usage limits
Customization: Fine-tune models for your specific needs
Offline: Works without internet connection
Control: Choose exactly which models to run
System Requirements
Minimum Specs
- CPU: Modern multi-core processor
- RAM: 8GB (for 7B parameter models)
- Storage: 10GB+ for models
- GPU: Optional but highly recommended
Recommended Specs
- CPU: 8+ cores
- RAM: 16GB+ (32GB for larger models)
- GPU: NVIDIA (12GB+ VRAM) or AMD (16GB+ VRAM)
- Storage: NVMe SSD for faster model loading
GPU Acceleration
NVIDIA: Best support, use CUDA
- RTX 3060 (12GB): Good for 7B-13B models
- RTX 4070 (12GB): Better performance
- RTX 4090 (24GB): Can run 30B+ models
AMD: Use ROCm (Linux only)
- RX 7800 XT (16GB): Good for 7B-13B models
- RX 7900 XTX (24GB): Can run larger models
Apple Silicon: Native support
- M1/M2 with 16GB+: Good for 7B models
- M1/M2 Max/Ultra: Can run larger models
Installation with Docker
Basic Setup (CPU Only)
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
restart: unless-stopped
volumes:
ollama-data:
With NVIDIA GPU
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
runtime: nvidia
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
environment:
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
restart: unless-stopped
volumes:
ollama-data:
With AMD GPU (ROCm)
version: '3.8'
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
group_add:
- video
environment:
- HSA_OVERRIDE_GFX_VERSION=11.0.0 # Adjust for your GPU
restart: unless-stopped
volumes:
ollama-data:
Start the service:
docker compose up -d
Using Ollama
Pull a Model
# Pull a model
docker exec ollama ollama pull llama3.2
# Pull specific size
docker exec ollama ollama pull llama3.2:7b
docker exec ollama ollama pull llama3.2:13b
List Models
docker exec ollama ollama list
Run a Model
# Interactive chat
docker exec -it ollama ollama run llama3.2
# Single prompt
docker exec ollama ollama run llama3.2 "Explain quantum computing"
Remove a Model
docker exec ollama ollama rm llama3.2
Popular Models
General Purpose
Llama 3.2 (Meta):
- Sizes: 1B, 3B, 7B, 13B, 70B
- Best all-around performance
- Good for most tasks
Mistral (Mistral AI):
- Sizes: 7B, 8x7B (Mixtral)
- Excellent reasoning
- Fast inference
Qwen 2.5 (Alibaba):
- Sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
- Strong multilingual support
- Good coding abilities
Specialized Models
CodeLlama (Meta):
- Optimized for code generation
- Supports multiple languages
- Good for programming tasks
Phi-3 (Microsoft):
- Small but capable (3.8B)
- Runs on modest hardware
- Good for simple tasks
DeepSeek-Coder (DeepSeek):
- Excellent code generation
- Supports 80+ languages
- Strong debugging capabilities
Model Sizes Explained
1B-3B: Fast, runs on CPU, good for simple tasks
7B: Sweet spot for most users, good quality/speed balance
13B: Better quality, needs more resources
30B+: Best quality, requires powerful GPU
API Usage
Ollama provides an OpenAI-compatible API:
Generate Completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Chat Completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"stream": false
}'
With Streaming
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Tell me a story",
"stream": true
}'
Integration with Applications
Open WebUI
See Open WebUI for a ChatGPT-like interface.
Python
import requests
def chat(prompt):
response = requests.post('http://localhost:11434/api/generate',
json={
'model': 'llama3.2',
'prompt': prompt,
'stream': False
})
return response.json()['response']
print(chat("What is the capital of France?"))
JavaScript
async function chat(prompt) {
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama3.2',
prompt: prompt,
stream: false
})
});
const data = await response.json();
return data.response;
}
chat("What is the capital of France?").then(console.log);
Performance Optimization
Model Quantization
Quantized models use less memory with minimal quality loss:
- Q4: 4-bit quantization, ~50% size reduction
- Q5: 5-bit quantization, better quality
- Q8: 8-bit quantization, minimal quality loss
Example: llama3.2:7b-q4_0 vs llama3.2:7b
Context Length
Longer context = more memory:
# Default context (2048 tokens)
ollama run llama3.2
# Extended context (4096 tokens)
ollama run llama3.2 --ctx-size 4096
GPU Layers
Control how much of the model runs on GPU:
# All layers on GPU (default if GPU available)
ollama run llama3.2
# Partial GPU offload (if limited VRAM)
ollama run llama3.2 --gpu-layers 20
Concurrent Requests
Ollama handles multiple requests efficiently:
- Batches requests when possible
- Shares model weights in memory
- Queues requests if resources are limited
Monitoring
Check GPU Usage
NVIDIA:
nvidia-smi
AMD:
rocm-smi
Check Model Loading
docker logs ollama
API Health Check
curl http://localhost:11434/api/tags
Troubleshooting
Out of Memory
Symptoms: Model fails to load or crashes
Solutions:
- Use smaller model (7B instead of 13B)
- Use quantized version (Q4 instead of full precision)
- Reduce context length
- Close other applications
- Add more RAM/VRAM
Slow Inference
Causes:
- Running on CPU instead of GPU
- Insufficient VRAM (swapping to RAM)
- Large context window
Solutions:
- Verify GPU is being used
- Use smaller model
- Reduce context length
- Use quantized model
GPU Not Detected
NVIDIA:
# Check NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
# Verify in Ollama
docker logs ollama | grep -i gpu
AMD:
# Check ROCm
docker run --rm --device=/dev/kfd --device=/dev/dri rocm/rocm-terminal rocm-smi
# Verify in Ollama
docker logs ollama | grep -i rocm
Best Practices
Start Small: Begin with 7B models, scale up as needed
Use Quantization: Q4/Q5 models offer great quality/performance balance
Monitor Resources: Watch RAM/VRAM usage to avoid OOM errors
Cache Models: Keep frequently used models downloaded
Batch Requests: Send multiple prompts together when possible
Set Timeouts: Implement timeouts for long-running requests
Model Selection Guide
For Chat/Assistance: Llama 3.2 7B or Mistral 7B
For Coding: CodeLlama 13B or DeepSeek-Coder 6.7B
For Low Resources: Phi-3 3.8B or Qwen 2.5 3B
For Best Quality: Llama 3.2 70B or Qwen 2.5 72B (requires powerful GPU)
For Multilingual: Qwen 2.5 or Mistral
Privacy Considerations
Completely Local: No data leaves your machine
No Telemetry: Ollama doesn't phone home
Audit Models: Open-source models can be inspected
Control Access: Use firewall to restrict API access
Secure Storage: Models stored locally on your disk
Related Topics:
- Open WebUI - Web interface for Ollama
- Self-Hosting a Home Server - Complete homelab guide
- Docker - Container platform