Ollama

Ollama makes it easy to run large language models (LLMs) locally on your own hardware. It provides a simple API for running models like Llama, Mistral, and others without sending your data to external services.

Why Run LLMs Locally?

Privacy: Your conversations never leave your machine
Cost: No API fees or usage limits
Customization: Fine-tune models for your specific needs
Offline: Works without internet connection
Control: Choose exactly which models to run

System Requirements

Minimum Specs

CPU: Modern multi-core processor
RAM: 8GB (for 7B parameter models)
Storage: 10GB+ for models
GPU: Optional but highly recommended

Recommended Specs

CPU: 8+ cores
RAM: 16GB+ (32GB for larger models)
GPU: NVIDIA (12GB+ VRAM) or AMD (16GB+ VRAM)
Storage: NVMe SSD for faster model loading

GPU Acceleration

NVIDIA: Best support, use CUDA

RTX 3060 (12GB): Good for 7B-13B models
RTX 4070 (12GB): Better performance
RTX 4090 (24GB): Can run 30B+ models

AMD: Use ROCm (Linux only)

RX 7800 XT (16GB): Good for 7B-13B models
RX 7900 XTX (24GB): Can run larger models

Apple Silicon: Native support

M1/M2 with 16GB+: Good for 7B models
M1/M2 Max/Ultra: Can run larger models

Installation with Docker

Basic Setup (CPU Only)

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    restart: unless-stopped

volumes:
  ollama-data:

With NVIDIA GPU

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    runtime: nvidia
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    restart: unless-stopped

volumes:
  ollama-data:

With AMD GPU (ROCm)

version: '3.8'

services:
  ollama:
    image: ollama/ollama:rocm
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.0.0  # Adjust for your GPU
    restart: unless-stopped

volumes:
  ollama-data:

Start the service:

docker compose up -d

Using Ollama

Pull a Model

# Pull a model
docker exec ollama ollama pull llama3.2

# Pull specific size
docker exec ollama ollama pull llama3.2:7b
docker exec ollama ollama pull llama3.2:13b

List Models

docker exec ollama ollama list

Run a Model

# Interactive chat
docker exec -it ollama ollama run llama3.2

# Single prompt
docker exec ollama ollama run llama3.2 "Explain quantum computing"

Remove a Model

docker exec ollama ollama rm llama3.2

Popular Models

General Purpose

Llama 3.2 (Meta):

Sizes: 1B, 3B, 7B, 13B, 70B
Best all-around performance
Good for most tasks

Mistral (Mistral AI):

Sizes: 7B, 8x7B (Mixtral)
Excellent reasoning
Fast inference

Qwen 2.5 (Alibaba):

Sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
Strong multilingual support
Good coding abilities

Specialized Models

CodeLlama (Meta):

Optimized for code generation
Supports multiple languages
Good for programming tasks

Phi-3 (Microsoft):

Small but capable (3.8B)
Runs on modest hardware
Good for simple tasks

DeepSeek-Coder (DeepSeek):

Excellent code generation
Supports 80+ languages
Strong debugging capabilities

Model Sizes Explained

1B-3B: Fast, runs on CPU, good for simple tasks
7B: Sweet spot for most users, good quality/speed balance
13B: Better quality, needs more resources
30B+: Best quality, requires powerful GPU

API Usage

Ollama provides an OpenAI-compatible API:

Generate Completion

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat Completion

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": false
}'

With Streaming

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Tell me a story",
  "stream": true
}'

Integration with Applications

Open WebUI

See Open WebUI for a ChatGPT-like interface.

Python

import requests

def chat(prompt):
    response = requests.post('http://localhost:11434/api/generate', 
        json={
            'model': 'llama3.2',
            'prompt': prompt,
            'stream': False
        })
    return response.json()['response']

print(chat("What is the capital of France?"))

JavaScript

async function chat(prompt) {
    const response = await fetch('http://localhost:11434/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            model: 'llama3.2',
            prompt: prompt,
            stream: false
        })
    });
    const data = await response.json();
    return data.response;
}

chat("What is the capital of France?").then(console.log);

Performance Optimization

Model Quantization

Quantized models use less memory with minimal quality loss:

Q4: 4-bit quantization, ~50% size reduction
Q5: 5-bit quantization, better quality
Q8: 8-bit quantization, minimal quality loss

Example: llama3.2:7b-q4_0 vs llama3.2:7b

Context Length

Longer context = more memory:

# Default context (2048 tokens)
ollama run llama3.2

# Extended context (4096 tokens)
ollama run llama3.2 --ctx-size 4096

GPU Layers

Control how much of the model runs on GPU:

# All layers on GPU (default if GPU available)
ollama run llama3.2

# Partial GPU offload (if limited VRAM)
ollama run llama3.2 --gpu-layers 20

Concurrent Requests

Ollama handles multiple requests efficiently:

Batches requests when possible
Shares model weights in memory
Queues requests if resources are limited

Monitoring

Check GPU Usage

NVIDIA:

nvidia-smi

AMD:

rocm-smi

Check Model Loading

docker logs ollama

API Health Check

curl http://localhost:11434/api/tags

Troubleshooting

Out of Memory

Symptoms: Model fails to load or crashes

Solutions:

Use smaller model (7B instead of 13B)
Use quantized version (Q4 instead of full precision)
Reduce context length
Close other applications
Add more RAM/VRAM

Slow Inference

Causes:

Running on CPU instead of GPU
Insufficient VRAM (swapping to RAM)
Large context window

Solutions:

Verify GPU is being used
Use smaller model
Reduce context length
Use quantized model

GPU Not Detected

NVIDIA:

# Check NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

# Verify in Ollama
docker logs ollama | grep -i gpu

AMD:

# Check ROCm
docker run --rm --device=/dev/kfd --device=/dev/dri rocm/rocm-terminal rocm-smi

# Verify in Ollama
docker logs ollama | grep -i rocm

Best Practices

Start Small: Begin with 7B models, scale up as needed
Use Quantization: Q4/Q5 models offer great quality/performance balance
Monitor Resources: Watch RAM/VRAM usage to avoid OOM errors
Cache Models: Keep frequently used models downloaded
Batch Requests: Send multiple prompts together when possible
Set Timeouts: Implement timeouts for long-running requests

Model Selection Guide

For Chat/Assistance: Llama 3.2 7B or Mistral 7B
For Coding: CodeLlama 13B or DeepSeek-Coder 6.7B
For Low Resources: Phi-3 3.8B or Qwen 2.5 3B
For Best Quality: Llama 3.2 70B or Qwen 2.5 72B (requires powerful GPU)
For Multilingual: Qwen 2.5 or Mistral

Privacy Considerations

Completely Local: No data leaves your machine
No Telemetry: Ollama doesn't phone home
Audit Models: Open-source models can be inspected
Control Access: Use firewall to restrict API access
Secure Storage: Models stored locally on your disk

Related Topics:

Open WebUI - Web interface for Ollama
Self-Hosting a Home Server - Complete homelab guide
Docker - Container platform