DevToolBoxFREE
BlogAdvertise

Guía Completa de Ollama 2026: Ejecutar LLMs Localmente — Instalación, Modelos, API y Mejores Prácticas

18 min de lecturapor DevToolBox

Ollama es una herramienta open-source para ejecutar modelos de lenguaje grandes (LLMs) directamente en tu equipo. Si quieres privacidad, cero costes por API u opciones offline, Ollama permite descargar, ejecutar y gestionar modelos como Llama 3, Mistral, Code Llama, Phi-3 y Gemma 2 con comandos simples.

Resumen rapido

Instala Ollama en macOS, Linux, Windows o Docker; ejecuta ollama run llama3 para empezar; usa ollama pull para descargar modelos sin abrir un chat; integra apps con la API REST en localhost:11434; y crea modelos personalizados con Modelfiles.

Respuestas rapidas sobre comandos de Ollama

Ollama soporta fine-tuning?

Ollama no entrena ni ajusta modelos directamente. El flujo practico es hacer fine-tuning con herramientas como Unsloth, Axolotl o llama.cpp, exportar un modelo GGUF o adapter y despues importarlo en Ollama con un Modelfile usando FROM o ADAPTER y ollama create.

Como ejecuto un modelo con Ollama?

Usa ollama run nombre-del-modelo. Por ejemplo, ollama run llama3 descarga el modelo si hace falta y abre un chat interactivo. Para automatizar, ejecuta primero ollama pull nombre-del-modelo y despues llama a la API REST en localhost:11434.

Cual es el comando oficial para descargar modelos?

Usa ollama pull nombre-del-modelo para descargar o actualizar un modelo sin iniciar una sesion de chat. Usa ollama list para ver modelos descargados, ollama show nombre-del-modelo para detalles y ollama rm nombre-del-modelo para eliminarlo.

Puntos clave
  • Ollama soporta mas de 100 modelos, incluidos Llama 3, Mistral, Code Llama, Phi-3 y Gemma 2
  • La instalacion es sencilla en macOS y Linux; Docker funciona en todas las plataformas
  • La API REST en localhost:11434 ofrece endpoints como /api/generate, /api/chat y /api/embeddings
  • Los Modelfiles permiten ajustar parametros, prompts de sistema y modelos especializados
  • La aceleracion GPU con CUDA, Metal y ROCm mejora mucho el rendimiento
  • Los modelos 7B suelen necesitar 8 GB de RAM; 13B requiere 16 GB y 70B al menos 64 GB

Que es Ollama y por que ejecutar LLMs localmente?

Ollama es un framework ligero para ejecutar modelos de lenguaje en tu maquina local. Envuelve llama.cpp con una CLI y una API REST faciles de usar, gestionando descargas, cuantizacion, GPU y memoria.

Ejecutar LLMs localmente ofrece tres ventajas: tus prompts no salen de tu equipo, no pagas por token y reduces la latencia al evitar viajes de red.

Para desarrolladores, Ollama es una forma practica de probar modelos, crear asistentes internos, construir RAG local y desplegar inferencia privada sin depender de un proveedor cloud.

Guia de instalacion

macOS (Intel y Apple Silicon)

Ollama tiene soporte de primera clase en macOS y usa Metal automaticamente en Apple Silicon. La instalacion suele tardar menos de un minuto.

# Option 1: Download from ollama.com (recommended)
# Visit https://ollama.com/download and install the .dmg

# Option 2: Install via Homebrew
brew install ollama

# Start the Ollama service
ollama serve

# In a new terminal, run your first model
ollama run llama3

# Verify installation
ollama --version
# ollama version 0.6.2

Linux

En Linux, el script oficial instala Ollama y detecta CUDA cuando hay GPU NVIDIA. Tambien puede ejecutarse como servicio systemd.

# One-line install (detects NVIDIA CUDA automatically)
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama as a systemd service
sudo systemctl enable ollama
sudo systemctl start ollama

# Check service status
sudo systemctl status ollama

# Run a model
ollama run llama3

# View logs for debugging
journalctl -u ollama -f

Windows

Ollama incluye instalador nativo para Windows con soporte GPU. WSL2 ya no es obligatorio para el uso normal.

# Download the Windows installer from ollama.com/download
# Run OllamaSetup.exe — it installs as a Windows service

# After installation, open PowerShell or Command Prompt
ollama run llama3

# The API is available at http://localhost:11434
# Ollama runs in the system tray on Windows

Docker (todas las plataformas)

Docker es la opcion mas portable para macOS, Linux y Windows. En Linux puede usar GPU NVIDIA con NVIDIA Container Toolkit.

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# With NVIDIA GPU support (requires nvidia-container-toolkit)
docker run -d --gpus=all -v ollama:/root/.ollama \
  -p 11434:11434 --name ollama ollama/ollama

# Run a model inside the container
docker exec -it ollama ollama run llama3

# Pull a model without interactive session
docker exec ollama ollama pull mistral

Ejecutar modelos

La libreria de modelos de Ollama incluye opciones listas para usar. El comando ollama run descarga el modelo si falta y abre una sesion interactiva.

Modelos populares

# General purpose — Meta Llama 3 (8B, fast and capable)
ollama run llama3

# Mistral 7B — excellent reasoning, multilingual
ollama run mistral

# Code Llama — optimized for code generation
ollama run codellama

# Microsoft Phi-3 — small but powerful (3.8B)
ollama run phi3

# Google Gemma 2 — strong general performance (9B)
ollama run gemma2

# DeepSeek Coder V2 — top coding model
ollama run deepseek-coder-v2

# Llama 3 70B — near GPT-4 quality (needs 64GB RAM)
ollama run llama3:70b

# Multimodal — LLaVA (vision + text)
ollama run llava
# Then provide an image: /path/to/image.jpg What is in this image?

Comparacion de rendimiento de modelos

Los modelos pequenos son rapidos y consumen menos memoria; los grandes producen mejores respuestas pero requieren mas RAM o VRAM.

ModelParametersSize (Q4)Speed (tok/s)*QualityBest For
Phi-3 Mini3.8B2.3 GB~65GoodEdge, mobile, quick tasks
Mistral 7B7B4.1 GB~48Very GoodGeneral chat, multilingual
Llama 3 8B8B4.7 GB~45Very GoodAll-around, reasoning
Gemma 2 9B9B5.4 GB~38ExcellentInstruction following
Code Llama 13B13B7.4 GB~28ExcellentCode generation, review
DeepSeek Coder33B19 GB~14OutstandingAdvanced coding tasks
Llama 3 70B70B39 GB~8OutstandingComplex reasoning, analysis

* Approximate tokens/second on Apple M3 Max 64GB with Metal acceleration. Actual speed varies by hardware and quantization.

Gestion de modelos

Ollama permite listar, descargar, eliminar e inspeccionar modelos locales para mantener el entorno limpio y ahorrar espacio en disco.

# List all downloaded models
ollama list
# NAME              ID            SIZE    MODIFIED
# llama3:latest     a6990ed6be41  4.7 GB  2 hours ago
# mistral:latest    61e88e884507  4.1 GB  3 days ago

# Download a model without running it
ollama pull codellama:13b

# Pull a specific quantization variant
ollama pull llama3:8b-instruct-q5_K_M

# Remove a model to free disk space
ollama rm mistral

# Show model details (parameters, template, license)
ollama show llama3
ollama show llama3 --modelfile  # view the Modelfile

# Copy a model (useful before customizing)
ollama cp llama3 my-llama3

# Create a custom model from a Modelfile
ollama create my-assistant -f ./Modelfile

# List currently running models and their resource usage
ollama ps
# NAME      ID       SIZE     PROCESSOR  UNTIL
# llama3    a6990e   6.7 GB   100% GPU   4 minutes

API REST de Ollama

Ollama expone una API REST en localhost:11434 para integrar modelos locales en aplicaciones. Los endpoints soportan streaming por defecto y pueden sustituir muchos flujos compatibles con OpenAI.

/api/generate — generacion de texto

# Simple text generation
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain Docker in 3 sentences",
  "stream": false
}'

# With parameters for precise control
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a Python function to merge two sorted lists",
  "stream": false,
  "options": {
    "temperature": 0.2,
    "top_p": 0.9,
    "num_predict": 500
  }
}'

# Response structure
# {
#   "model": "llama3",
#   "response": "Here is a Python function...",
#   "done": true,
#   "total_duration": 1234567890,
#   "eval_count": 142,
#   "eval_duration": 987654321
# }

/api/chat — chat conversacional

# Multi-turn conversation with system prompt
curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    { "role": "system", "content": "You are a senior DevOps engineer." },
    { "role": "user", "content": "How do I set up a CI/CD pipeline?" },
    { "role": "assistant", "content": "A CI/CD pipeline typically..." },
    { "role": "user", "content": "Show me a GitHub Actions example." }
  ],
  "stream": false
}'
# Node.js / TypeScript streaming client
async function chat(prompt: string): Promise<void> {
  const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3",
      messages: [{ role: "user", content: prompt }],
    }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const chunk = JSON.parse(decoder.decode(value));
    process.stdout.write(chunk.message.content);
  }
}

chat("Explain async/await in TypeScript");

/api/embeddings — embeddings vectoriales

Los embeddings convierten texto en vectores numericos para busqueda semantica, RAG y comparacion de documentos.

# Generate embeddings for text
curl http://localhost:11434/api/embeddings -d '{
  "model": "llama3",
  "prompt": "Ollama is a tool for running LLMs locally"
}'
# Response: { "embedding": [0.123, -0.456, 0.789, ...] }

# Python: semantic search with embeddings
import requests
import numpy as np

def get_embedding(text: str) -> np.ndarray:
    resp = requests.post("http://localhost:11434/api/embeddings",
        json={"model": "llama3", "prompt": text})
    return np.array(resp.json()["embedding"])

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Compare document similarity
doc_emb = get_embedding("Docker containers isolate applications")
query_emb = get_embedding("How to run apps in isolation?")
score = cosine_similarity(doc_emb, query_emb)
print(f"Similarity: {score:.4f}")  # ~0.85

Crear Modelfiles personalizados

Un Modelfile funciona como un Dockerfile para LLMs: define el modelo base, parametros, prompt de sistema y plantilla.

Directivas de Modelfile

Las directivas clave son FROM, PARAMETER, SYSTEM, TEMPLATE y ADAPTER para aplicar pesos LoRA o importar modelos afinados.

# Modelfile for a code review assistant
FROM codellama:13b

# Set inference parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"
PARAMETER repeat_penalty 1.1

# Define the system prompt
SYSTEM """
You are an expert code reviewer. Analyze code for:
- Bugs and potential errors
- Performance issues and optimization opportunities
- Security vulnerabilities (injection, XSS, etc.)
- Code style and best practices
Provide actionable feedback with specific line references.
Rate severity as: Critical, Warning, or Suggestion.
"""

# Custom prompt template (optional)
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """
# Build and run the custom model
ollama create code-reviewer -f ./Modelfile
ollama run code-reviewer

# Another example: a SQL query assistant
# --- sql-helper.Modelfile ---
FROM llama3
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
SYSTEM """
You are a PostgreSQL expert. Generate optimized SQL queries.
Always explain your queries and suggest indexes when beneficial.
Output format: SQL query first, then explanation.
Use CTEs for complex queries. Avoid SELECT *.
"""

# Import a GGUF model from HuggingFace
# --- import.Modelfile ---
FROM ./my-finetuned-model.gguf
PARAMETER temperature 0.5
SYSTEM "You are a helpful assistant."

# Apply a LoRA adapter to a base model
# FROM llama3
# ADAPTER ./my-lora-adapter.gguf

GPU Acceleration

Ollama automatically detects and uses available GPUs. GPU acceleration dramatically reduces inference time — a 7B model generates tokens 5-10x faster on GPU compared to CPU-only. Here is how GPU support works on each platform.

Apple Metal (macOS)

Apple Silicon Macs (M1/M2/M3/M4) automatically use Metal for GPU acceleration. No additional setup is needed. The unified memory architecture means the GPU can access all system RAM, giving Apple Silicon a unique advantage for running larger models.

# Check Metal GPU usage on macOS
ollama ps
# NAME      ID        SIZE     PROCESSOR    UNTIL
# llama3    a6990e    6.7 GB   100% GPU     4 minutes from now

# Apple Silicon performance reference (M3 Max 64GB):
# Llama 3 8B:   ~45 tokens/sec
# Llama 3 70B:  ~8 tokens/sec
# Phi-3 Mini:   ~65 tokens/sec

# Monitor memory pressure in Activity Monitor
# or use: memory_pressure

NVIDIA CUDA (Linux/Windows)

NVIDIA GPUs require CUDA drivers (version 11.7 or higher). Ollama automatically detects CUDA and offloads model layers to the GPU. For GPUs with limited VRAM, Ollama can split the model between GPU and CPU memory.

# Verify NVIDIA GPU detection
nvidia-smi

# Check Ollama GPU usage
ollama ps
# NAME      ID        SIZE    PROCESSOR
# llama3    a6990e    6.7 GB  100% GPU

# Partial GPU offload (when VRAM is limited)
# Offload only 20 layers to GPU, rest stays on CPU
OLLAMA_NUM_GPU=20 ollama run llama3:70b

# Force CPU-only mode
OLLAMA_NUM_GPU=0 ollama run llama3

# Monitor GPU memory during inference
watch -n 1 nvidia-smi

AMD ROCm (Linux)

AMD GPUs are supported via ROCm 5.7+ on Linux. Supported cards include RX 7900 XTX, RX 7900 XT, RX 6900 XT, RX 6800 XT, and Radeon Pro W6800. Performance is comparable to NVIDIA for most workloads.

# Install ROCm for AMD GPUs (Ubuntu)
# Follow: https://rocm.docs.amd.com/en/latest/deploy/linux/

# Run Ollama with ROCm Docker image
docker run -d --device /dev/kfd --device /dev/dri \
  -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama:rocm

Memory Requirements

The amount of RAM (or VRAM) needed depends on the model size and quantization level. As a rule of thumb, you need roughly 1GB of memory per billion parameters for Q4 quantized models. Here are the practical requirements.

Model SizeMin RAMRecommended RAMGPU VRAMUse Case
1-3B (Phi-3 Mini, TinyLlama)4 GB8 GB4 GBEdge devices, quick prototyping
7-8B (Llama 3, Mistral)8 GB16 GB8 GBGeneral use, coding, chat
13B (Code Llama 13B)16 GB24 GB12 GBComplex reasoning, code review
33-34B (DeepSeek, Code Llama 34B)32 GB48 GB24 GBAdvanced analysis, long context
70B (Llama 3 70B)64 GB96 GB48 GBNear GPT-4 quality tasks

Memory requirements are for Q4_K_M quantization. Higher quantization (Q5, Q8) uses more memory but produces slightly better output. Context window size also adds to memory usage — each 1K tokens of context requires approximately 0.5-1 GB additional memory for 7B models.

Environment Variables

Ollama behavior can be customized through environment variables. These are especially useful for server deployments and Docker configurations.

# Key Ollama environment variables

# OLLAMA_HOST — bind address (default: 127.0.0.1:11434)
OLLAMA_HOST=0.0.0.0:11434          # listen on all interfaces

# OLLAMA_MODELS — custom model storage directory
OLLAMA_MODELS=/mnt/ssd/ollama-models  # use a fast SSD

# OLLAMA_ORIGINS — allowed CORS origins
OLLAMA_ORIGINS="http://localhost:3000,https://myapp.com"

# OLLAMA_NUM_PARALLEL — concurrent request handling
OLLAMA_NUM_PARALLEL=4              # handle 4 requests at once

# OLLAMA_MAX_LOADED_MODELS — models kept in memory
OLLAMA_MAX_LOADED_MODELS=2         # keep 2 models loaded

# OLLAMA_KEEP_ALIVE — how long models stay loaded
OLLAMA_KEEP_ALIVE=10m              # unload after 10 minutes

# OLLAMA_NUM_GPU — GPU layer count
OLLAMA_NUM_GPU=99                  # all layers on GPU (default)
OLLAMA_NUM_GPU=0                   # CPU only

# Linux systemd: /etc/systemd/system/ollama.service
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0"
# Environment="OLLAMA_MODELS=/data/models"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama

Integrations

Ollama works seamlessly with popular AI frameworks and tools. Its OpenAI-compatible API means most libraries that work with OpenAI also work with Ollama by changing the base URL.

LangChain

LangChain provides a native Ollama integration for building RAG pipelines, agents, and chains with local models.

# pip install langchain-ollama langchain-chroma
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Basic text generation
llm = OllamaLLM(model="llama3")
response = llm.invoke("Explain Kubernetes in simple terms")
print(response)

# Build a RAG pipeline with local embeddings
embeddings = OllamaEmbeddings(model="llama3")
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
docs = splitter.split_documents(my_documents)
vectorstore = Chroma.from_documents(docs, embeddings)

# Query the knowledge base
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
results = retriever.invoke("How to deploy with Docker?")

LlamaIndex

LlamaIndex supports Ollama for building knowledge retrieval systems over your own documents.

# pip install llama-index-llms-ollama llama-index-embeddings-ollama
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

# Configure Ollama as default LLM and embedding model
Settings.llm = Ollama(model="llama3", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="llama3")

# Load documents and build index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query your documents
query_engine = index.as_query_engine()
response = query_engine.query("What are the main API endpoints?")
print(response)

Open WebUI

Open WebUI provides a ChatGPT-like web interface for Ollama with multi-model support, conversation history, document upload, and web search integration.

# Run Open WebUI with Docker (auto-connects to Ollama)
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

# Access the web UI at http://localhost:3000
# Features:
#   - Multi-model chat with model switching
#   - Conversation history and search
#   - Document upload for RAG
#   - Web search integration
#   - User accounts and admin panel
#   - Custom model presets and system prompts

Ollama vs Alternatives

Here is how Ollama compares to other popular tools for running LLMs locally. Each tool has different strengths depending on your use case.

FeatureOllamaLM Studiollama.cppGPT4All
Ease of UseExcellentExcellentAdvancedGood
REST APIBuilt-in (OpenAI compat)Built-inOptional serverBuilt-in
GUICLI only*Full GUINoneFull GUI
Docker SupportOfficial imagesCommunityCommunityNone
Model Library100+ curated modelsHuggingFace browseManual GGUF filesCurated list
GPU SupportCUDA/Metal/ROCmCUDA/MetalCUDA/Metal/ROCm/VulkanCUDA/Metal
CustomizationModelfile systemUI settingsFull CLI controlLimited
Server / Team UseNative multi-userLocal onlyOptional serverLocal only
LicenseMITProprietaryMITMIT
Best ForDevelopers, DevOps, teamsBeginners, explorationPower users, custom buildsDesktop users

* Ollama pairs with Open WebUI for a full graphical experience comparable to LM Studio.

Performance Tuning

Fine-tune inference parameters to balance speed, quality, and resource usage. The right settings depend on your use case — code generation needs low temperature for precision, while creative writing benefits from higher randomness.

Key Parameters

# Temperature controls randomness (0.0 = deterministic, 2.0 = very random)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a haiku about programming",
  "options": {
    "temperature": 0.8,
    "top_p": 0.95,
    "top_k": 40,
    "num_predict": 200,
    "num_ctx": 4096,
    "repeat_penalty": 1.1,
    "num_gpu": 99,
    "num_thread": 8
  }
}'

# Parameter reference:
# temperature 0.0-0.3 → factual answers, code generation
# temperature 0.4-0.7 → balanced, general conversation
# temperature 0.8-1.5 → creative writing, brainstorming
#
# num_ctx: context window (default 2048, max depends on model)
#   Higher = more context but more memory and slower
#   Llama 3 supports up to 8192 tokens
#
# num_gpu: GPU layer count (99 = all layers, 0 = CPU only)
# num_thread: CPU threads (default = auto-detect)
# top_p: nucleus sampling (0.9 = consider top 90% probability)
# top_k: limits selection to top K tokens (40 is a good default)
# repeat_penalty: penalize repetition (1.0 = off, 1.1 = moderate)

Running Ollama as a Team Server

Ollama can serve multiple users on a network by binding to all interfaces instead of just localhost. This turns a single powerful machine into a shared AI inference server for your entire team.

# Bind Ollama to all interfaces for network access
OLLAMA_HOST=0.0.0.0 ollama serve

# Linux: make it permanent via systemd
sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0"
# Add: Environment="OLLAMA_ORIGINS=*"
sudo systemctl restart ollama

# Team members connect from their machines
curl http://your-server-ip:11434/api/chat -d '{
  "model": "llama3",
  "messages": [{"role": "user", "content": "Hello from remote"}],
  "stream": false
}'

# Nginx reverse proxy with SSL (recommended)
# server {
#     listen 443 ssl;
#     server_name ollama.yourcompany.com;
#     ssl_certificate /etc/letsencrypt/live/ollama.yourcompany.com/fullchain.pem;
#     ssl_certificate_key /etc/letsencrypt/live/ollama.yourcompany.com/privkey.pem;
#     location / {
#         proxy_pass http://localhost:11434;
#         proxy_set_header Host \$host;
#         proxy_buffering off;
#         proxy_read_timeout 300s;
#     }
# }

Docker Deployment for Production

For production environments, use Docker Compose to run Ollama with Open WebUI and proper resource management. This setup provides a self-hosted ChatGPT alternative for your organization.

# docker-compose.yml for production
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=15m
      - OLLAMA_NUM_PARALLEL=4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:
  webui_data:
# Deploy and pre-pull models
docker compose up -d

# Pull models for the team
docker exec ollama ollama pull llama3
docker exec ollama ollama pull codellama:13b
docker exec ollama ollama pull mistral

# Verify everything is running
docker compose ps
curl http://localhost:11434/api/tags  # list available models

Troubleshooting Common Issues

Here are solutions to the most common problems when running Ollama.

# Problem: "Error: model requires more system memory"
# Solution: Use a smaller model or quantization
ollama run llama3:8b-instruct-q4_0  # smallest variant

# Problem: "connection refused" on localhost:11434
# Solution: Start the Ollama service
ollama serve              # macOS/Linux (foreground)
sudo systemctl start ollama  # Linux (background)

# Problem: Slow generation speed (CPU only)
# Solution: Verify GPU is being used
ollama ps  # check Processor column
# If showing "100% CPU", reinstall GPU drivers

# Problem: Model not found
# Solution: Check available models and pull
ollama list              # see downloaded models
ollama pull llama3       # download if missing

# Problem: CORS errors from web app
# Solution: Set OLLAMA_ORIGINS
OLLAMA_ORIGINS="http://localhost:3000" ollama serve

# Problem: Out of disk space
# Solution: Remove unused models and move storage
ollama rm unused-model
OLLAMA_MODELS=/mnt/large-drive/ollama ollama serve

Best Practices

  • Start with smaller models (7B) for development, scale up for production quality assessment
  • Use quantized models (Q4_K_M) for the best balance of quality and speed
  • Set appropriate context windows — larger contexts use more memory linearly
  • Monitor GPU memory with nvidia-smi or Activity Monitor on macOS
  • Use the keep_alive parameter to control model loading/unloading behavior
  • Create custom Modelfiles for each use case with specific system prompts
  • Pin model versions in production to avoid unexpected behavior changes on updates
  • Use streaming responses in your applications for better perceived latency
  • Implement request queuing for multi-user servers to avoid memory pressure
  • Test with representative workloads before choosing a model for production

Preguntas frecuentes

Que hardware necesito para ejecutar Ollama?

Para modelos 7B, lo practico es tener al menos 8 GB de RAM. Un Mac Apple Silicon con 16 GB o una GPU NVIDIA con 8 GB de VRAM funciona muy bien. Para modelos 70B, necesitas mucha mas memoria, normalmente 64 GB o mas.

Ollama es gratis?

Si. Ollama es gratis y open-source bajo licencia MIT. No hay cuotas por uso ni costes por token.

Como se compara Ollama con ChatGPT?

Ollama ejecuta modelos en tu equipo; ChatGPT usa servidores externos. Los modelos locales suelen ser menos potentes que GPT-4, pero ofrecen privacidad, coste cero por consulta y control total.

Puedo usar Ollama para programacion?

Si. Modelos como Code Llama, DeepSeek Coder y StarCoder2 sirven para explicar codigo, depurar, generar funciones y revisar cambios.

Ollama soporta fine-tuning?

Ollama no hace fine-tuning directamente, pero puede importar modelos GGUF o adapters creados con otras herramientas y ejecutarlos con Modelfiles.

Ollama puede ejecutar varios modelos a la vez?

Si, siempre que tengas suficiente RAM o VRAM. Usa keep_alive para controlar cuanto tiempo quedan cargados los modelos.

Como actualizo Ollama y los modelos?

Actualiza Ollama desde ollama.com o con el instalador correspondiente. Para actualizar modelos, usa ollama pull nombre-del-modelo.

Mis datos son privados con Ollama?

Si. La inferencia se ejecuta localmente y tus prompts no se envian a servidores externos, lo que lo hace util para codigo privado y documentos sensibles.

¿Fue útil?

Stay Updated

Get weekly dev tips and new tool announcements.

No spam. Unsubscribe anytime.

Partner Picks

Sponsor this article

Place your product next to this developer topic with tracked clicks.

Ask about article sponsorship

Prueba estas herramientas relacionadas

This site uses cookies for analytics and to display ads. By continuing to browse, you agree. Privacy Policy