DevToolBoxGRATIS
Blog

Ollama Complete Guide 2026: Run LLMs Locally — Installation, Models, API & Best Practices

18 min readdi DevToolBox

Ollama is an open-source tool that lets you run large language models (LLMs) locally on your own machine. Whether you care about data privacy, want to eliminate API costs, or need offline AI capabilities, Ollama makes it simple to download, run, and manage models like Llama 3, Mistral, Code Llama, Phi-3, and Gemma 2 with a single command. This guide covers everything from installation to production deployment.

TL;DR

Ollama lets you run LLMs locally with one command. Install it on macOS/Linux/Windows, run "ollama run llama3" to start chatting, use the REST API at localhost:11434 for app integration, and create custom models with Modelfiles. It supports GPU acceleration via CUDA, Metal, and ROCm, and works great with LangChain, LlamaIndex, and Open WebUI.

Key Takeaways
  • Ollama supports 100+ models including Llama 3, Mistral, Code Llama, Phi-3, and Gemma 2
  • Installation is a single command on macOS and Linux; Docker available for all platforms
  • REST API at localhost:11434 provides /api/generate, /api/chat, and /api/embeddings endpoints
  • Custom Modelfiles let you tune parameters, set system prompts, and create specialized models
  • GPU acceleration via CUDA (NVIDIA), Metal (Apple Silicon), and ROCm (AMD) dramatically improves performance
  • 7B models need 8GB RAM, 13B models need 16GB, and 70B models need 64GB minimum

What Is Ollama and Why Run LLMs Locally?

Ollama is a lightweight, open-source framework for running large language models on your local machine. It wraps llama.cpp with an easy-to-use CLI and REST API, handling model downloads, quantization, GPU acceleration, and memory management automatically.

Running LLMs locally offers three major advantages over cloud APIs. First, complete data privacy — your prompts and outputs never leave your machine, making it safe for proprietary code, legal documents, and medical records. Second, zero API costs — run unlimited queries without per-token billing. Third, low-latency inference — no network round trips means faster responses, especially for interactive use cases.

Ollama has become the de facto standard for local LLM inference in 2026, with over 200,000 GitHub stars and integration with every major AI framework. It runs on macOS, Linux, and Windows, and supports NVIDIA, Apple Silicon, and AMD GPUs out of the box.

Installation Guide

macOS (Intel & Apple Silicon)

Ollama has first-class support for macOS with automatic Metal GPU acceleration on Apple Silicon. The install takes under a minute.

# Option 1: Download from ollama.com (recommended)
# Visit https://ollama.com/download and install the .dmg

# Option 2: Install via Homebrew
brew install ollama

# Start the Ollama service
ollama serve

# In a new terminal, run your first model
ollama run llama3

# Verify installation
ollama --version
# ollama version 0.6.2

Linux

On Linux, the official install script handles everything including NVIDIA CUDA driver detection. Ollama runs as a systemd service for automatic startup.

# One-line install (detects NVIDIA CUDA automatically)
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama as a systemd service
sudo systemctl enable ollama
sudo systemctl start ollama

# Check service status
sudo systemctl status ollama

# Run a model
ollama run llama3

# View logs for debugging
journalctl -u ollama -f

Windows

Ollama now has a native Windows installer with GPU support for NVIDIA and AMD cards. WSL2 is no longer required.

# Download the Windows installer from ollama.com/download
# Run OllamaSetup.exe — it installs as a Windows service

# After installation, open PowerShell or Command Prompt
ollama run llama3

# The API is available at http://localhost:11434
# Ollama runs in the system tray on Windows

Docker (All Platforms)

Docker is the most portable option and works on macOS, Linux, and Windows. It supports NVIDIA GPU passthrough on Linux with the NVIDIA Container Toolkit.

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# With NVIDIA GPU support (requires nvidia-container-toolkit)
docker run -d --gpus=all -v ollama:/root/.ollama \
  -p 11434:11434 --name ollama ollama/ollama

# Run a model inside the container
docker exec -it ollama ollama run llama3

# Pull a model without interactive session
docker exec ollama ollama pull mistral

Running Models

Ollama provides a model library with 100+ pre-built models. The ollama run command downloads a model (if needed) and starts an interactive chat session.

Popular Models

# General purpose — Meta Llama 3 (8B, fast and capable)
ollama run llama3

# Mistral 7B — excellent reasoning, multilingual
ollama run mistral

# Code Llama — optimized for code generation
ollama run codellama

# Microsoft Phi-3 — small but powerful (3.8B)
ollama run phi3

# Google Gemma 2 — strong general performance (9B)
ollama run gemma2

# DeepSeek Coder V2 — top coding model
ollama run deepseek-coder-v2

# Llama 3 70B — near GPT-4 quality (needs 64GB RAM)
ollama run llama3:70b

# Multimodal — LLaVA (vision + text)
ollama run llava
# Then provide an image: /path/to/image.jpg What is in this image?

Model Performance Comparison

Most models come in multiple sizes. Smaller models are faster but less capable. Larger models produce better output but require more resources. The table below compares performance across popular models.

ModelParametersSize (Q4)Speed (tok/s)*QualityBest For
Phi-3 Mini3.8B2.3 GB~65GoodEdge, mobile, quick tasks
Mistral 7B7B4.1 GB~48Very GoodGeneral chat, multilingual
Llama 3 8B8B4.7 GB~45Very GoodAll-around, reasoning
Gemma 2 9B9B5.4 GB~38ExcellentInstruction following
Code Llama 13B13B7.4 GB~28ExcellentCode generation, review
DeepSeek Coder33B19 GB~14OutstandingAdvanced coding tasks
Llama 3 70B70B39 GB~8OutstandingComplex reasoning, analysis

* Approximate tokens/second on Apple M3 Max 64GB with Metal acceleration. Actual speed varies by hardware and quantization.

Model Management

Ollama provides commands to list, download, remove, and inspect your local models. Efficient model management helps you save disk space and keep your environment organized.

# List all downloaded models
ollama list
# NAME              ID            SIZE    MODIFIED
# llama3:latest     a6990ed6be41  4.7 GB  2 hours ago
# mistral:latest    61e88e884507  4.1 GB  3 days ago

# Download a model without running it
ollama pull codellama:13b

# Pull a specific quantization variant
ollama pull llama3:8b-instruct-q5_K_M

# Remove a model to free disk space
ollama rm mistral

# Show model details (parameters, template, license)
ollama show llama3
ollama show llama3 --modelfile  # view the Modelfile

# Copy a model (useful before customizing)
ollama cp llama3 my-llama3

# Create a custom model from a Modelfile
ollama create my-assistant -f ./Modelfile

# List currently running models and their resource usage
ollama ps
# NAME      ID       SIZE     PROCESSOR  UNTIL
# llama3    a6990e   6.7 GB   100% GPU   4 minutes

Ollama REST API

Ollama exposes a REST API on localhost:11434 that you can use to integrate LLMs into any application. The API is compatible with the OpenAI chat completions format, making it a drop-in replacement for many existing tools. All endpoints support streaming responses by default.

/api/generate — Text Generation

# Simple text generation
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain Docker in 3 sentences",
  "stream": false
}'

# With parameters for precise control
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a Python function to merge two sorted lists",
  "stream": false,
  "options": {
    "temperature": 0.2,
    "top_p": 0.9,
    "num_predict": 500
  }
}'

# Response structure
# {
#   "model": "llama3",
#   "response": "Here is a Python function...",
#   "done": true,
#   "total_duration": 1234567890,
#   "eval_count": 142,
#   "eval_duration": 987654321
# }

/api/chat — Conversational Chat

# Multi-turn conversation with system prompt
curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    { "role": "system", "content": "You are a senior DevOps engineer." },
    { "role": "user", "content": "How do I set up a CI/CD pipeline?" },
    { "role": "assistant", "content": "A CI/CD pipeline typically..." },
    { "role": "user", "content": "Show me a GitHub Actions example." }
  ],
  "stream": false
}'
# Node.js / TypeScript streaming client
async function chat(prompt: string): Promise<void> {
  const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3",
      messages: [{ role: "user", content: prompt }],
    }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const chunk = JSON.parse(decoder.decode(value));
    process.stdout.write(chunk.message.content);
  }
}

chat("Explain async/await in TypeScript");

/api/embeddings — Vector Embeddings

Generate vector embeddings for text, useful for RAG (Retrieval-Augmented Generation), semantic search, and document similarity. Embeddings convert text into numerical vectors that capture semantic meaning.

# Generate embeddings for text
curl http://localhost:11434/api/embeddings -d '{
  "model": "llama3",
  "prompt": "Ollama is a tool for running LLMs locally"
}'
# Response: { "embedding": [0.123, -0.456, 0.789, ...] }

# Python: semantic search with embeddings
import requests
import numpy as np

def get_embedding(text: str) -> np.ndarray:
    resp = requests.post("http://localhost:11434/api/embeddings",
        json={"model": "llama3", "prompt": text})
    return np.array(resp.json()["embedding"])

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Compare document similarity
doc_emb = get_embedding("Docker containers isolate applications")
query_emb = get_embedding("How to run apps in isolation?")
score = cosine_similarity(doc_emb, query_emb)
print(f"Similarity: {score:.4f}")  # ~0.85

Custom Modelfile Creation

A Modelfile is like a Dockerfile for LLMs. It defines the base model, parameters, system prompt, and template. This lets you create specialized models for specific use cases like code review, SQL generation, or customer support.

Modelfile Directives

The key directives in a Modelfile are FROM (base model), PARAMETER (inference settings), SYSTEM (system prompt), and TEMPLATE (prompt format). You can also use ADAPTER to apply LoRA weights and LICENSE to include model licensing information.

# Modelfile for a code review assistant
FROM codellama:13b

# Set inference parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"
PARAMETER repeat_penalty 1.1

# Define the system prompt
SYSTEM """
You are an expert code reviewer. Analyze code for:
- Bugs and potential errors
- Performance issues and optimization opportunities
- Security vulnerabilities (injection, XSS, etc.)
- Code style and best practices
Provide actionable feedback with specific line references.
Rate severity as: Critical, Warning, or Suggestion.
"""

# Custom prompt template (optional)
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
# Build and run the custom model
ollama create code-reviewer -f ./Modelfile
ollama run code-reviewer

# Another example: a SQL query assistant
# --- sql-helper.Modelfile ---
FROM llama3
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
SYSTEM """
You are a PostgreSQL expert. Generate optimized SQL queries.
Always explain your queries and suggest indexes when beneficial.
Output format: SQL query first, then explanation.
Use CTEs for complex queries. Avoid SELECT *.
"""

# Import a GGUF model from HuggingFace
# --- import.Modelfile ---
FROM ./my-finetuned-model.gguf
PARAMETER temperature 0.5
SYSTEM "You are a helpful assistant."

# Apply a LoRA adapter to a base model
# FROM llama3
# ADAPTER ./my-lora-adapter.gguf

GPU Acceleration

Ollama automatically detects and uses available GPUs. GPU acceleration dramatically reduces inference time — a 7B model generates tokens 5-10x faster on GPU compared to CPU-only. Here is how GPU support works on each platform.

Apple Metal (macOS)

Apple Silicon Macs (M1/M2/M3/M4) automatically use Metal for GPU acceleration. No additional setup is needed. The unified memory architecture means the GPU can access all system RAM, giving Apple Silicon a unique advantage for running larger models.

# Check Metal GPU usage on macOS
ollama ps
# NAME      ID        SIZE     PROCESSOR    UNTIL
# llama3    a6990e    6.7 GB   100% GPU     4 minutes from now

# Apple Silicon performance reference (M3 Max 64GB):
# Llama 3 8B:   ~45 tokens/sec
# Llama 3 70B:  ~8 tokens/sec
# Phi-3 Mini:   ~65 tokens/sec

# Monitor memory pressure in Activity Monitor
# or use: memory_pressure

NVIDIA CUDA (Linux/Windows)

NVIDIA GPUs require CUDA drivers (version 11.7 or higher). Ollama automatically detects CUDA and offloads model layers to the GPU. For GPUs with limited VRAM, Ollama can split the model between GPU and CPU memory.

# Verify NVIDIA GPU detection
nvidia-smi

# Check Ollama GPU usage
ollama ps
# NAME      ID        SIZE    PROCESSOR
# llama3    a6990e    6.7 GB  100% GPU

# Partial GPU offload (when VRAM is limited)
# Offload only 20 layers to GPU, rest stays on CPU
OLLAMA_NUM_GPU=20 ollama run llama3:70b

# Force CPU-only mode
OLLAMA_NUM_GPU=0 ollama run llama3

# Monitor GPU memory during inference
watch -n 1 nvidia-smi

AMD ROCm (Linux)

AMD GPUs are supported via ROCm 5.7+ on Linux. Supported cards include RX 7900 XTX, RX 7900 XT, RX 6900 XT, RX 6800 XT, and Radeon Pro W6800. Performance is comparable to NVIDIA for most workloads.

# Install ROCm for AMD GPUs (Ubuntu)
# Follow: https://rocm.docs.amd.com/en/latest/deploy/linux/

# Run Ollama with ROCm Docker image
docker run -d --device /dev/kfd --device /dev/dri \
  -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama:rocm

Memory Requirements

The amount of RAM (or VRAM) needed depends on the model size and quantization level. As a rule of thumb, you need roughly 1GB of memory per billion parameters for Q4 quantized models. Here are the practical requirements.

Model SizeMin RAMRecommended RAMGPU VRAMUse Case
1-3B (Phi-3 Mini, TinyLlama)4 GB8 GB4 GBEdge devices, quick prototyping
7-8B (Llama 3, Mistral)8 GB16 GB8 GBGeneral use, coding, chat
13B (Code Llama 13B)16 GB24 GB12 GBComplex reasoning, code review
33-34B (DeepSeek, Code Llama 34B)32 GB48 GB24 GBAdvanced analysis, long context
70B (Llama 3 70B)64 GB96 GB48 GBNear GPT-4 quality tasks

Memory requirements are for Q4_K_M quantization. Higher quantization (Q5, Q8) uses more memory but produces slightly better output. Context window size also adds to memory usage — each 1K tokens of context requires approximately 0.5-1 GB additional memory for 7B models.

Environment Variables

Ollama behavior can be customized through environment variables. These are especially useful for server deployments and Docker configurations.

# Key Ollama environment variables

# OLLAMA_HOST — bind address (default: 127.0.0.1:11434)
OLLAMA_HOST=0.0.0.0:11434          # listen on all interfaces

# OLLAMA_MODELS — custom model storage directory
OLLAMA_MODELS=/mnt/ssd/ollama-models  # use a fast SSD

# OLLAMA_ORIGINS — allowed CORS origins
OLLAMA_ORIGINS="http://localhost:3000,https://myapp.com"

# OLLAMA_NUM_PARALLEL — concurrent request handling
OLLAMA_NUM_PARALLEL=4              # handle 4 requests at once

# OLLAMA_MAX_LOADED_MODELS — models kept in memory
OLLAMA_MAX_LOADED_MODELS=2         # keep 2 models loaded

# OLLAMA_KEEP_ALIVE — how long models stay loaded
OLLAMA_KEEP_ALIVE=10m              # unload after 10 minutes

# OLLAMA_NUM_GPU — GPU layer count
OLLAMA_NUM_GPU=99                  # all layers on GPU (default)
OLLAMA_NUM_GPU=0                   # CPU only

# Linux systemd: /etc/systemd/system/ollama.service
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0"
# Environment="OLLAMA_MODELS=/data/models"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama

Integrations

Ollama works seamlessly with popular AI frameworks and tools. Its OpenAI-compatible API means most libraries that work with OpenAI also work with Ollama by changing the base URL.

LangChain

LangChain provides a native Ollama integration for building RAG pipelines, agents, and chains with local models.

# pip install langchain-ollama langchain-chroma
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Basic text generation
llm = OllamaLLM(model="llama3")
response = llm.invoke("Explain Kubernetes in simple terms")
print(response)

# Build a RAG pipeline with local embeddings
embeddings = OllamaEmbeddings(model="llama3")
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
docs = splitter.split_documents(my_documents)
vectorstore = Chroma.from_documents(docs, embeddings)

# Query the knowledge base
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
results = retriever.invoke("How to deploy with Docker?")

LlamaIndex

LlamaIndex supports Ollama for building knowledge retrieval systems over your own documents.

# pip install llama-index-llms-ollama llama-index-embeddings-ollama
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

# Configure Ollama as default LLM and embedding model
Settings.llm = Ollama(model="llama3", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="llama3")

# Load documents and build index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query your documents
query_engine = index.as_query_engine()
response = query_engine.query("What are the main API endpoints?")
print(response)

Open WebUI

Open WebUI provides a ChatGPT-like web interface for Ollama with multi-model support, conversation history, document upload, and web search integration.

# Run Open WebUI with Docker (auto-connects to Ollama)
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

# Access the web UI at http://localhost:3000
# Features:
#   - Multi-model chat with model switching
#   - Conversation history and search
#   - Document upload for RAG
#   - Web search integration
#   - User accounts and admin panel
#   - Custom model presets and system prompts

Ollama vs Alternatives

Here is how Ollama compares to other popular tools for running LLMs locally. Each tool has different strengths depending on your use case.

FeatureOllamaLM Studiollama.cppGPT4All
Ease of UseExcellentExcellentAdvancedGood
REST APIBuilt-in (OpenAI compat)Built-inOptional serverBuilt-in
GUICLI only*Full GUINoneFull GUI
Docker SupportOfficial imagesCommunityCommunityNone
Model Library100+ curated modelsHuggingFace browseManual GGUF filesCurated list
GPU SupportCUDA/Metal/ROCmCUDA/MetalCUDA/Metal/ROCm/VulkanCUDA/Metal
CustomizationModelfile systemUI settingsFull CLI controlLimited
Server / Team UseNative multi-userLocal onlyOptional serverLocal only
LicenseMITProprietaryMITMIT
Best ForDevelopers, DevOps, teamsBeginners, explorationPower users, custom buildsDesktop users

* Ollama pairs with Open WebUI for a full graphical experience comparable to LM Studio.

Performance Tuning

Fine-tune inference parameters to balance speed, quality, and resource usage. The right settings depend on your use case — code generation needs low temperature for precision, while creative writing benefits from higher randomness.

Key Parameters

# Temperature controls randomness (0.0 = deterministic, 2.0 = very random)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a haiku about programming",
  "options": {
    "temperature": 0.8,
    "top_p": 0.95,
    "top_k": 40,
    "num_predict": 200,
    "num_ctx": 4096,
    "repeat_penalty": 1.1,
    "num_gpu": 99,
    "num_thread": 8
  }
}'

# Parameter reference:
# temperature 0.0-0.3 → factual answers, code generation
# temperature 0.4-0.7 → balanced, general conversation
# temperature 0.8-1.5 → creative writing, brainstorming
#
# num_ctx: context window (default 2048, max depends on model)
#   Higher = more context but more memory and slower
#   Llama 3 supports up to 8192 tokens
#
# num_gpu: GPU layer count (99 = all layers, 0 = CPU only)
# num_thread: CPU threads (default = auto-detect)
# top_p: nucleus sampling (0.9 = consider top 90% probability)
# top_k: limits selection to top K tokens (40 is a good default)
# repeat_penalty: penalize repetition (1.0 = off, 1.1 = moderate)

Running Ollama as a Team Server

Ollama can serve multiple users on a network by binding to all interfaces instead of just localhost. This turns a single powerful machine into a shared AI inference server for your entire team.

# Bind Ollama to all interfaces for network access
OLLAMA_HOST=0.0.0.0 ollama serve

# Linux: make it permanent via systemd
sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0"
# Add: Environment="OLLAMA_ORIGINS=*"
sudo systemctl restart ollama

# Team members connect from their machines
curl http://your-server-ip:11434/api/chat -d '{
  "model": "llama3",
  "messages": [{"role": "user", "content": "Hello from remote"}],
  "stream": false
}'

# Nginx reverse proxy with SSL (recommended)
# server {
#     listen 443 ssl;
#     server_name ollama.yourcompany.com;
#     ssl_certificate /etc/letsencrypt/live/ollama.yourcompany.com/fullchain.pem;
#     ssl_certificate_key /etc/letsencrypt/live/ollama.yourcompany.com/privkey.pem;
#     location / {
#         proxy_pass http://localhost:11434;
#         proxy_set_header Host \$host;
#         proxy_buffering off;
#         proxy_read_timeout 300s;
#     }
# }

Docker Deployment for Production

For production environments, use Docker Compose to run Ollama with Open WebUI and proper resource management. This setup provides a self-hosted ChatGPT alternative for your organization.

# docker-compose.yml for production
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=15m
      - OLLAMA_NUM_PARALLEL=4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:
  webui_data:
# Deploy and pre-pull models
docker compose up -d

# Pull models for the team
docker exec ollama ollama pull llama3
docker exec ollama ollama pull codellama:13b
docker exec ollama ollama pull mistral

# Verify everything is running
docker compose ps
curl http://localhost:11434/api/tags  # list available models

Troubleshooting Common Issues

Here are solutions to the most common problems when running Ollama.

# Problem: "Error: model requires more system memory"
# Solution: Use a smaller model or quantization
ollama run llama3:8b-instruct-q4_0  # smallest variant

# Problem: "connection refused" on localhost:11434
# Solution: Start the Ollama service
ollama serve              # macOS/Linux (foreground)
sudo systemctl start ollama  # Linux (background)

# Problem: Slow generation speed (CPU only)
# Solution: Verify GPU is being used
ollama ps  # check Processor column
# If showing "100% CPU", reinstall GPU drivers

# Problem: Model not found
# Solution: Check available models and pull
ollama list              # see downloaded models
ollama pull llama3       # download if missing

# Problem: CORS errors from web app
# Solution: Set OLLAMA_ORIGINS
OLLAMA_ORIGINS="http://localhost:3000" ollama serve

# Problem: Out of disk space
# Solution: Remove unused models and move storage
ollama rm unused-model
OLLAMA_MODELS=/mnt/large-drive/ollama ollama serve

Best Practices

  • Start with smaller models (7B) for development, scale up for production quality assessment
  • Use quantized models (Q4_K_M) for the best balance of quality and speed
  • Set appropriate context windows — larger contexts use more memory linearly
  • Monitor GPU memory with nvidia-smi or Activity Monitor on macOS
  • Use the keep_alive parameter to control model loading/unloading behavior
  • Create custom Modelfiles for each use case with specific system prompts
  • Pin model versions in production to avoid unexpected behavior changes on updates
  • Use streaming responses in your applications for better perceived latency
  • Implement request queuing for multi-user servers to avoid memory pressure
  • Test with representative workloads before choosing a model for production

Frequently Asked Questions

What hardware do I need to run Ollama?

For 7B models, you need at least 8GB RAM. Apple Silicon Macs with 16GB or more are ideal. NVIDIA GPUs with 8GB+ VRAM also work great. For 70B models, you need 64GB RAM or a GPU with 48GB VRAM.

Is Ollama free to use?

Yes, Ollama is completely free and open-source under the MIT license. There are no usage limits, API costs, or subscription fees. You can use it for personal and commercial projects.

How does Ollama compare to ChatGPT?

Ollama runs models locally on your machine, while ChatGPT runs on OpenAI servers. Local models are typically less capable than GPT-4 but offer complete privacy, zero cost, and no rate limits. Llama 3 70B approaches GPT-4 quality for many tasks.

Can I use Ollama for code generation?

Yes. Code Llama, DeepSeek Coder, and StarCoder2 are excellent coding models available through Ollama. They support code completion, explanation, debugging, and generation in dozens of programming languages.

Does Ollama support fine-tuning?

Ollama does not support fine-tuning directly. However, you can import fine-tuned GGUF models created with other tools like Unsloth or Axolotl. Use Modelfiles to customize behavior through system prompts and parameter tuning.

Can Ollama run multiple models simultaneously?

Yes, Ollama can load multiple models in memory if you have enough RAM or VRAM. Each model occupies memory independently. Use the keep_alive parameter to control how long models stay loaded after the last request.

How do I update Ollama and my models?

On macOS, download the latest version from ollama.com. On Linux, re-run the install script: curl -fsSL https://ollama.com/install.sh | sh. For Docker, pull the latest image. Update models with: ollama pull modelname.

Is my data private when using Ollama?

Yes, completely. All inference happens on your local machine. No data is sent to external servers. No telemetry is collected. This makes Ollama ideal for processing sensitive documents, proprietary code, and confidential business data.

𝕏 Twitterin LinkedIn
È stato utile?

Resta aggiornato

Ricevi consigli dev e nuovi strumenti ogni settimana.

Niente spam. Cancella quando vuoi.

Prova questi strumenti correlati

{ }JSON Formatter#Hash GeneratorMDMarkdown to HTML

Articoli correlati

Tutorial Docker Compose: Dalle basi agli stack pronti per la produzione

Tutorial completo Docker Compose: sintassi docker-compose.yml, servizi, reti, volumi, variabili d'ambiente, healthcheck ed esempi con Node.js/Python/WordPress.

Docker Best Practices: 20 consigli per container in produzione

Padroneggia Docker con 20 best practice essenziali: build multi-stage, sicurezza, ottimizzazione immagini e CI/CD.

GraphQL vs REST API: Quale usare nel 2026?

Confronto approfondito tra GraphQL e REST API con esempi di codice. Differenze architetturali, pattern di fetching, caching e criteri di scelta.