Ollama

Ollama makes it easy to run large language models (LLMs) locally on your own hardware. It provides a simple API for running models like Llama, Mistral, and others without sending your data to external services.

Why Run LLMs Locally?

Privacy: Your conversations never leave your machine
Cost: No API fees or usage limits
Customization: Fine-tune models for your specific needs
Offline: Works without internet connection
Control: Choose exactly which models to run

System Requirements

Minimum Specs

GPU Acceleration

NVIDIA: Best support, use CUDA

AMD: Use ROCm (Linux only)

Apple Silicon: Native support

Installation with Docker

Basic Setup (CPU Only)

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    restart: unless-stopped

volumes:
  ollama-data:

With NVIDIA GPU

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    runtime: nvidia
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
    restart: unless-stopped

volumes:
  ollama-data:

With AMD GPU (ROCm)

version: '3.8'

services:
  ollama:
    image: ollama/ollama:rocm
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.0.0  # Adjust for your GPU
    restart: unless-stopped

volumes:
  ollama-data:

Start the service:

docker compose up -d

Using Ollama

Pull a Model

# Pull a model
docker exec ollama ollama pull llama3.2

# Pull specific size
docker exec ollama ollama pull llama3.2:7b
docker exec ollama ollama pull llama3.2:13b

List Models

docker exec ollama ollama list

Run a Model

# Interactive chat
docker exec -it ollama ollama run llama3.2

# Single prompt
docker exec ollama ollama run llama3.2 "Explain quantum computing"

Remove a Model

docker exec ollama ollama rm llama3.2

General Purpose

Llama 3.2 (Meta):

Mistral (Mistral AI):

Qwen 2.5 (Alibaba):

Specialized Models

CodeLlama (Meta):

Phi-3 (Microsoft):

DeepSeek-Coder (DeepSeek):

Model Sizes Explained

1B-3B: Fast, runs on CPU, good for simple tasks
7B: Sweet spot for most users, good quality/speed balance
13B: Better quality, needs more resources
30B+: Best quality, requires powerful GPU

API Usage

Ollama provides an OpenAI-compatible API:

Generate Completion

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat Completion

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": false
}'

With Streaming

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Tell me a story",
  "stream": true
}'

Integration with Applications

Open WebUI

See Open WebUI for a ChatGPT-like interface.

Python

import requests

def chat(prompt):
    response = requests.post('http://localhost:11434/api/generate', 
        json={
            'model': 'llama3.2',
            'prompt': prompt,
            'stream': False
        })
    return response.json()['response']

print(chat("What is the capital of France?"))

JavaScript

async function chat(prompt) {
    const response = await fetch('http://localhost:11434/api/generate', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
            model: 'llama3.2',
            prompt: prompt,
            stream: false
        })
    });
    const data = await response.json();
    return data.response;
}

chat("What is the capital of France?").then(console.log);

Performance Optimization

Model Quantization

Quantized models use less memory with minimal quality loss:

Example: llama3.2:7b-q4_0 vs llama3.2:7b

Context Length

Longer context = more memory:

# Default context (2048 tokens)
ollama run llama3.2

# Extended context (4096 tokens)
ollama run llama3.2 --ctx-size 4096

GPU Layers

Control how much of the model runs on GPU:

# All layers on GPU (default if GPU available)
ollama run llama3.2

# Partial GPU offload (if limited VRAM)
ollama run llama3.2 --gpu-layers 20

Concurrent Requests

Ollama handles multiple requests efficiently:

Monitoring

Check GPU Usage

NVIDIA:

nvidia-smi

AMD:

rocm-smi

Check Model Loading

docker logs ollama

API Health Check

curl http://localhost:11434/api/tags

Troubleshooting

Out of Memory

Symptoms: Model fails to load or crashes

Solutions:

Slow Inference

Causes:

Solutions:

GPU Not Detected

NVIDIA:

# Check NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

# Verify in Ollama
docker logs ollama | grep -i gpu

AMD:

# Check ROCm
docker run --rm --device=/dev/kfd --device=/dev/dri rocm/rocm-terminal rocm-smi

# Verify in Ollama
docker logs ollama | grep -i rocm

Best Practices

Start Small: Begin with 7B models, scale up as needed
Use Quantization: Q4/Q5 models offer great quality/performance balance
Monitor Resources: Watch RAM/VRAM usage to avoid OOM errors
Cache Models: Keep frequently used models downloaded
Batch Requests: Send multiple prompts together when possible
Set Timeouts: Implement timeouts for long-running requests

Model Selection Guide

For Chat/Assistance: Llama 3.2 7B or Mistral 7B
For Coding: CodeLlama 13B or DeepSeek-Coder 6.7B
For Low Resources: Phi-3 3.8B or Qwen 2.5 3B
For Best Quality: Llama 3.2 70B or Qwen 2.5 72B (requires powerful GPU)
For Multilingual: Qwen 2.5 or Mistral

Privacy Considerations

Completely Local: No data leaves your machine
No Telemetry: Ollama doesn't phone home
Audit Models: Open-source models can be inspected
Control Access: Use firewall to restrict API access
Secure Storage: Models stored locally on your disk


Related Topics: