Ollama es una herramienta open-source para ejecutar modelos de lenguaje grandes (LLMs) directamente en tu equipo. Si quieres privacidad, cero costes por API u opciones offline, Ollama permite descargar, ejecutar y gestionar modelos como Llama 3, Mistral, Code Llama, Phi-3 y Gemma 2 con comandos simples.
Instala Ollama en macOS, Linux, Windows o Docker; ejecuta ollama run llama3 para empezar; usa ollama pull para descargar modelos sin abrir un chat; integra apps con la API REST en localhost:11434; y crea modelos personalizados con Modelfiles.
Respuestas rapidas sobre comandos de Ollama
Ollama soporta fine-tuning?
Ollama no entrena ni ajusta modelos directamente. El flujo practico es hacer fine-tuning con herramientas como Unsloth, Axolotl o llama.cpp, exportar un modelo GGUF o adapter y despues importarlo en Ollama con un Modelfile usando FROM o ADAPTER y ollama create.
Como ejecuto un modelo con Ollama?
Usa ollama run nombre-del-modelo. Por ejemplo, ollama run llama3 descarga el modelo si hace falta y abre un chat interactivo. Para automatizar, ejecuta primero ollama pull nombre-del-modelo y despues llama a la API REST en localhost:11434.
Cual es el comando oficial para descargar modelos?
Usa ollama pull nombre-del-modelo para descargar o actualizar un modelo sin iniciar una sesion de chat. Usa ollama list para ver modelos descargados, ollama show nombre-del-modelo para detalles y ollama rm nombre-del-modelo para eliminarlo.
- Ollama soporta mas de 100 modelos, incluidos Llama 3, Mistral, Code Llama, Phi-3 y Gemma 2
- La instalacion es sencilla en macOS y Linux; Docker funciona en todas las plataformas
- La API REST en localhost:11434 ofrece endpoints como /api/generate, /api/chat y /api/embeddings
- Los Modelfiles permiten ajustar parametros, prompts de sistema y modelos especializados
- La aceleracion GPU con CUDA, Metal y ROCm mejora mucho el rendimiento
- Los modelos 7B suelen necesitar 8 GB de RAM; 13B requiere 16 GB y 70B al menos 64 GB
Que es Ollama y por que ejecutar LLMs localmente?
Ollama es un framework ligero para ejecutar modelos de lenguaje en tu maquina local. Envuelve llama.cpp con una CLI y una API REST faciles de usar, gestionando descargas, cuantizacion, GPU y memoria.
Ejecutar LLMs localmente ofrece tres ventajas: tus prompts no salen de tu equipo, no pagas por token y reduces la latencia al evitar viajes de red.
Para desarrolladores, Ollama es una forma practica de probar modelos, crear asistentes internos, construir RAG local y desplegar inferencia privada sin depender de un proveedor cloud.
Guia de instalacion
macOS (Intel y Apple Silicon)
Ollama tiene soporte de primera clase en macOS y usa Metal automaticamente en Apple Silicon. La instalacion suele tardar menos de un minuto.
# Option 1: Download from ollama.com (recommended)
# Visit https://ollama.com/download and install the .dmg
# Option 2: Install via Homebrew
brew install ollama
# Start the Ollama service
ollama serve
# In a new terminal, run your first model
ollama run llama3
# Verify installation
ollama --version
# ollama version 0.6.2Linux
En Linux, el script oficial instala Ollama y detecta CUDA cuando hay GPU NVIDIA. Tambien puede ejecutarse como servicio systemd.
# One-line install (detects NVIDIA CUDA automatically)
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama as a systemd service
sudo systemctl enable ollama
sudo systemctl start ollama
# Check service status
sudo systemctl status ollama
# Run a model
ollama run llama3
# View logs for debugging
journalctl -u ollama -fWindows
Ollama incluye instalador nativo para Windows con soporte GPU. WSL2 ya no es obligatorio para el uso normal.
# Download the Windows installer from ollama.com/download
# Run OllamaSetup.exe — it installs as a Windows service
# After installation, open PowerShell or Command Prompt
ollama run llama3
# The API is available at http://localhost:11434
# Ollama runs in the system tray on WindowsDocker (todas las plataformas)
Docker es la opcion mas portable para macOS, Linux y Windows. En Linux puede usar GPU NVIDIA con NVIDIA Container Toolkit.
# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama
# With NVIDIA GPU support (requires nvidia-container-toolkit)
docker run -d --gpus=all -v ollama:/root/.ollama \
-p 11434:11434 --name ollama ollama/ollama
# Run a model inside the container
docker exec -it ollama ollama run llama3
# Pull a model without interactive session
docker exec ollama ollama pull mistralEjecutar modelos
La libreria de modelos de Ollama incluye opciones listas para usar. El comando ollama run descarga el modelo si falta y abre una sesion interactiva.
Modelos populares
# General purpose — Meta Llama 3 (8B, fast and capable)
ollama run llama3
# Mistral 7B — excellent reasoning, multilingual
ollama run mistral
# Code Llama — optimized for code generation
ollama run codellama
# Microsoft Phi-3 — small but powerful (3.8B)
ollama run phi3
# Google Gemma 2 — strong general performance (9B)
ollama run gemma2
# DeepSeek Coder V2 — top coding model
ollama run deepseek-coder-v2
# Llama 3 70B — near GPT-4 quality (needs 64GB RAM)
ollama run llama3:70b
# Multimodal — LLaVA (vision + text)
ollama run llava
# Then provide an image: /path/to/image.jpg What is in this image?Comparacion de rendimiento de modelos
Los modelos pequenos son rapidos y consumen menos memoria; los grandes producen mejores respuestas pero requieren mas RAM o VRAM.
| Model | Parameters | Size (Q4) | Speed (tok/s)* | Quality | Best For |
|---|---|---|---|---|---|
| Phi-3 Mini | 3.8B | 2.3 GB | ~65 | Good | Edge, mobile, quick tasks |
| Mistral 7B | 7B | 4.1 GB | ~48 | Very Good | General chat, multilingual |
| Llama 3 8B | 8B | 4.7 GB | ~45 | Very Good | All-around, reasoning |
| Gemma 2 9B | 9B | 5.4 GB | ~38 | Excellent | Instruction following |
| Code Llama 13B | 13B | 7.4 GB | ~28 | Excellent | Code generation, review |
| DeepSeek Coder | 33B | 19 GB | ~14 | Outstanding | Advanced coding tasks |
| Llama 3 70B | 70B | 39 GB | ~8 | Outstanding | Complex reasoning, analysis |
* Approximate tokens/second on Apple M3 Max 64GB with Metal acceleration. Actual speed varies by hardware and quantization.
Gestion de modelos
Ollama permite listar, descargar, eliminar e inspeccionar modelos locales para mantener el entorno limpio y ahorrar espacio en disco.
# List all downloaded models
ollama list
# NAME ID SIZE MODIFIED
# llama3:latest a6990ed6be41 4.7 GB 2 hours ago
# mistral:latest 61e88e884507 4.1 GB 3 days ago
# Download a model without running it
ollama pull codellama:13b
# Pull a specific quantization variant
ollama pull llama3:8b-instruct-q5_K_M
# Remove a model to free disk space
ollama rm mistral
# Show model details (parameters, template, license)
ollama show llama3
ollama show llama3 --modelfile # view the Modelfile
# Copy a model (useful before customizing)
ollama cp llama3 my-llama3
# Create a custom model from a Modelfile
ollama create my-assistant -f ./Modelfile
# List currently running models and their resource usage
ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# llama3 a6990e 6.7 GB 100% GPU 4 minutesAPI REST de Ollama
Ollama expone una API REST en localhost:11434 para integrar modelos locales en aplicaciones. Los endpoints soportan streaming por defecto y pueden sustituir muchos flujos compatibles con OpenAI.
/api/generate — generacion de texto
# Simple text generation
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain Docker in 3 sentences",
"stream": false
}'
# With parameters for precise control
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a Python function to merge two sorted lists",
"stream": false,
"options": {
"temperature": 0.2,
"top_p": 0.9,
"num_predict": 500
}
}'
# Response structure
# {
# "model": "llama3",
# "response": "Here is a Python function...",
# "done": true,
# "total_duration": 1234567890,
# "eval_count": 142,
# "eval_duration": 987654321
# }/api/chat — chat conversacional
# Multi-turn conversation with system prompt
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{ "role": "system", "content": "You are a senior DevOps engineer." },
{ "role": "user", "content": "How do I set up a CI/CD pipeline?" },
{ "role": "assistant", "content": "A CI/CD pipeline typically..." },
{ "role": "user", "content": "Show me a GitHub Actions example." }
],
"stream": false
}'# Node.js / TypeScript streaming client
async function chat(prompt: string): Promise<void> {
const response = await fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama3",
messages: [{ role: "user", content: prompt }],
}),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = JSON.parse(decoder.decode(value));
process.stdout.write(chunk.message.content);
}
}
chat("Explain async/await in TypeScript");/api/embeddings — embeddings vectoriales
Los embeddings convierten texto en vectores numericos para busqueda semantica, RAG y comparacion de documentos.
# Generate embeddings for text
curl http://localhost:11434/api/embeddings -d '{
"model": "llama3",
"prompt": "Ollama is a tool for running LLMs locally"
}'
# Response: { "embedding": [0.123, -0.456, 0.789, ...] }
# Python: semantic search with embeddings
import requests
import numpy as np
def get_embedding(text: str) -> np.ndarray:
resp = requests.post("http://localhost:11434/api/embeddings",
json={"model": "llama3", "prompt": text})
return np.array(resp.json()["embedding"])
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Compare document similarity
doc_emb = get_embedding("Docker containers isolate applications")
query_emb = get_embedding("How to run apps in isolation?")
score = cosine_similarity(doc_emb, query_emb)
print(f"Similarity: {score:.4f}") # ~0.85Crear Modelfiles personalizados
Un Modelfile funciona como un Dockerfile para LLMs: define el modelo base, parametros, prompt de sistema y plantilla.
Directivas de Modelfile
Las directivas clave son FROM, PARAMETER, SYSTEM, TEMPLATE y ADAPTER para aplicar pesos LoRA o importar modelos afinados.
# Modelfile for a code review assistant
FROM codellama:13b
# Set inference parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"
PARAMETER repeat_penalty 1.1
# Define the system prompt
SYSTEM """
You are an expert code reviewer. Analyze code for:
- Bugs and potential errors
- Performance issues and optimization opportunities
- Security vulnerabilities (injection, XSS, etc.)
- Code style and best practices
Provide actionable feedback with specific line references.
Rate severity as: Critical, Warning, or Suggestion.
"""
# Custom prompt template (optional)
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """# Build and run the custom model
ollama create code-reviewer -f ./Modelfile
ollama run code-reviewer
# Another example: a SQL query assistant
# --- sql-helper.Modelfile ---
FROM llama3
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
SYSTEM """
You are a PostgreSQL expert. Generate optimized SQL queries.
Always explain your queries and suggest indexes when beneficial.
Output format: SQL query first, then explanation.
Use CTEs for complex queries. Avoid SELECT *.
"""
# Import a GGUF model from HuggingFace
# --- import.Modelfile ---
FROM ./my-finetuned-model.gguf
PARAMETER temperature 0.5
SYSTEM "You are a helpful assistant."
# Apply a LoRA adapter to a base model
# FROM llama3
# ADAPTER ./my-lora-adapter.ggufGPU Acceleration
Ollama automatically detects and uses available GPUs. GPU acceleration dramatically reduces inference time — a 7B model generates tokens 5-10x faster on GPU compared to CPU-only. Here is how GPU support works on each platform.
Apple Metal (macOS)
Apple Silicon Macs (M1/M2/M3/M4) automatically use Metal for GPU acceleration. No additional setup is needed. The unified memory architecture means the GPU can access all system RAM, giving Apple Silicon a unique advantage for running larger models.
# Check Metal GPU usage on macOS
ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# llama3 a6990e 6.7 GB 100% GPU 4 minutes from now
# Apple Silicon performance reference (M3 Max 64GB):
# Llama 3 8B: ~45 tokens/sec
# Llama 3 70B: ~8 tokens/sec
# Phi-3 Mini: ~65 tokens/sec
# Monitor memory pressure in Activity Monitor
# or use: memory_pressureNVIDIA CUDA (Linux/Windows)
NVIDIA GPUs require CUDA drivers (version 11.7 or higher). Ollama automatically detects CUDA and offloads model layers to the GPU. For GPUs with limited VRAM, Ollama can split the model between GPU and CPU memory.
# Verify NVIDIA GPU detection
nvidia-smi
# Check Ollama GPU usage
ollama ps
# NAME ID SIZE PROCESSOR
# llama3 a6990e 6.7 GB 100% GPU
# Partial GPU offload (when VRAM is limited)
# Offload only 20 layers to GPU, rest stays on CPU
OLLAMA_NUM_GPU=20 ollama run llama3:70b
# Force CPU-only mode
OLLAMA_NUM_GPU=0 ollama run llama3
# Monitor GPU memory during inference
watch -n 1 nvidia-smiAMD ROCm (Linux)
AMD GPUs are supported via ROCm 5.7+ on Linux. Supported cards include RX 7900 XTX, RX 7900 XT, RX 6900 XT, RX 6800 XT, and Radeon Pro W6800. Performance is comparable to NVIDIA for most workloads.
# Install ROCm for AMD GPUs (Ubuntu)
# Follow: https://rocm.docs.amd.com/en/latest/deploy/linux/
# Run Ollama with ROCm Docker image
docker run -d --device /dev/kfd --device /dev/dri \
-v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama:rocmMemory Requirements
The amount of RAM (or VRAM) needed depends on the model size and quantization level. As a rule of thumb, you need roughly 1GB of memory per billion parameters for Q4 quantized models. Here are the practical requirements.
| Model Size | Min RAM | Recommended RAM | GPU VRAM | Use Case |
|---|---|---|---|---|
| 1-3B (Phi-3 Mini, TinyLlama) | 4 GB | 8 GB | 4 GB | Edge devices, quick prototyping |
| 7-8B (Llama 3, Mistral) | 8 GB | 16 GB | 8 GB | General use, coding, chat |
| 13B (Code Llama 13B) | 16 GB | 24 GB | 12 GB | Complex reasoning, code review |
| 33-34B (DeepSeek, Code Llama 34B) | 32 GB | 48 GB | 24 GB | Advanced analysis, long context |
| 70B (Llama 3 70B) | 64 GB | 96 GB | 48 GB | Near GPT-4 quality tasks |
Memory requirements are for Q4_K_M quantization. Higher quantization (Q5, Q8) uses more memory but produces slightly better output. Context window size also adds to memory usage — each 1K tokens of context requires approximately 0.5-1 GB additional memory for 7B models.
Environment Variables
Ollama behavior can be customized through environment variables. These are especially useful for server deployments and Docker configurations.
# Key Ollama environment variables
# OLLAMA_HOST — bind address (default: 127.0.0.1:11434)
OLLAMA_HOST=0.0.0.0:11434 # listen on all interfaces
# OLLAMA_MODELS — custom model storage directory
OLLAMA_MODELS=/mnt/ssd/ollama-models # use a fast SSD
# OLLAMA_ORIGINS — allowed CORS origins
OLLAMA_ORIGINS="http://localhost:3000,https://myapp.com"
# OLLAMA_NUM_PARALLEL — concurrent request handling
OLLAMA_NUM_PARALLEL=4 # handle 4 requests at once
# OLLAMA_MAX_LOADED_MODELS — models kept in memory
OLLAMA_MAX_LOADED_MODELS=2 # keep 2 models loaded
# OLLAMA_KEEP_ALIVE — how long models stay loaded
OLLAMA_KEEP_ALIVE=10m # unload after 10 minutes
# OLLAMA_NUM_GPU — GPU layer count
OLLAMA_NUM_GPU=99 # all layers on GPU (default)
OLLAMA_NUM_GPU=0 # CPU only
# Linux systemd: /etc/systemd/system/ollama.service
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0"
# Environment="OLLAMA_MODELS=/data/models"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollamaIntegrations
Ollama works seamlessly with popular AI frameworks and tools. Its OpenAI-compatible API means most libraries that work with OpenAI also work with Ollama by changing the base URL.
LangChain
LangChain provides a native Ollama integration for building RAG pipelines, agents, and chains with local models.
# pip install langchain-ollama langchain-chroma
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Basic text generation
llm = OllamaLLM(model="llama3")
response = llm.invoke("Explain Kubernetes in simple terms")
print(response)
# Build a RAG pipeline with local embeddings
embeddings = OllamaEmbeddings(model="llama3")
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
)
docs = splitter.split_documents(my_documents)
vectorstore = Chroma.from_documents(docs, embeddings)
# Query the knowledge base
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
results = retriever.invoke("How to deploy with Docker?")LlamaIndex
LlamaIndex supports Ollama for building knowledge retrieval systems over your own documents.
# pip install llama-index-llms-ollama llama-index-embeddings-ollama
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
# Configure Ollama as default LLM and embedding model
Settings.llm = Ollama(model="llama3", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="llama3")
# Load documents and build index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query your documents
query_engine = index.as_query_engine()
response = query_engine.query("What are the main API endpoints?")
print(response)Open WebUI
Open WebUI provides a ChatGPT-like web interface for Ollama with multi-model support, conversation history, document upload, and web search integration.
# Run Open WebUI with Docker (auto-connects to Ollama)
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:main
# Access the web UI at http://localhost:3000
# Features:
# - Multi-model chat with model switching
# - Conversation history and search
# - Document upload for RAG
# - Web search integration
# - User accounts and admin panel
# - Custom model presets and system promptsOllama vs Alternatives
Here is how Ollama compares to other popular tools for running LLMs locally. Each tool has different strengths depending on your use case.
| Feature | Ollama | LM Studio | llama.cpp | GPT4All |
|---|---|---|---|---|
| Ease of Use | Excellent | Excellent | Advanced | Good |
| REST API | Built-in (OpenAI compat) | Built-in | Optional server | Built-in |
| GUI | CLI only* | Full GUI | None | Full GUI |
| Docker Support | Official images | Community | Community | None |
| Model Library | 100+ curated models | HuggingFace browse | Manual GGUF files | Curated list |
| GPU Support | CUDA/Metal/ROCm | CUDA/Metal | CUDA/Metal/ROCm/Vulkan | CUDA/Metal |
| Customization | Modelfile system | UI settings | Full CLI control | Limited |
| Server / Team Use | Native multi-user | Local only | Optional server | Local only |
| License | MIT | Proprietary | MIT | MIT |
| Best For | Developers, DevOps, teams | Beginners, exploration | Power users, custom builds | Desktop users |
* Ollama pairs with Open WebUI for a full graphical experience comparable to LM Studio.
Performance Tuning
Fine-tune inference parameters to balance speed, quality, and resource usage. The right settings depend on your use case — code generation needs low temperature for precision, while creative writing benefits from higher randomness.
Key Parameters
# Temperature controls randomness (0.0 = deterministic, 2.0 = very random)
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a haiku about programming",
"options": {
"temperature": 0.8,
"top_p": 0.95,
"top_k": 40,
"num_predict": 200,
"num_ctx": 4096,
"repeat_penalty": 1.1,
"num_gpu": 99,
"num_thread": 8
}
}'
# Parameter reference:
# temperature 0.0-0.3 → factual answers, code generation
# temperature 0.4-0.7 → balanced, general conversation
# temperature 0.8-1.5 → creative writing, brainstorming
#
# num_ctx: context window (default 2048, max depends on model)
# Higher = more context but more memory and slower
# Llama 3 supports up to 8192 tokens
#
# num_gpu: GPU layer count (99 = all layers, 0 = CPU only)
# num_thread: CPU threads (default = auto-detect)
# top_p: nucleus sampling (0.9 = consider top 90% probability)
# top_k: limits selection to top K tokens (40 is a good default)
# repeat_penalty: penalize repetition (1.0 = off, 1.1 = moderate)Running Ollama as a Team Server
Ollama can serve multiple users on a network by binding to all interfaces instead of just localhost. This turns a single powerful machine into a shared AI inference server for your entire team.
# Bind Ollama to all interfaces for network access
OLLAMA_HOST=0.0.0.0 ollama serve
# Linux: make it permanent via systemd
sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0"
# Add: Environment="OLLAMA_ORIGINS=*"
sudo systemctl restart ollama
# Team members connect from their machines
curl http://your-server-ip:11434/api/chat -d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Hello from remote"}],
"stream": false
}'
# Nginx reverse proxy with SSL (recommended)
# server {
# listen 443 ssl;
# server_name ollama.yourcompany.com;
# ssl_certificate /etc/letsencrypt/live/ollama.yourcompany.com/fullchain.pem;
# ssl_certificate_key /etc/letsencrypt/live/ollama.yourcompany.com/privkey.pem;
# location / {
# proxy_pass http://localhost:11434;
# proxy_set_header Host \$host;
# proxy_buffering off;
# proxy_read_timeout 300s;
# }
# }Docker Deployment for Production
For production environments, use Docker Compose to run Ollama with Open WebUI and proper resource management. This setup provides a self-hosted ChatGPT alternative for your organization.
# docker-compose.yml for production
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=15m
- OLLAMA_NUM_PARALLEL=4
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama_data:
webui_data:# Deploy and pre-pull models
docker compose up -d
# Pull models for the team
docker exec ollama ollama pull llama3
docker exec ollama ollama pull codellama:13b
docker exec ollama ollama pull mistral
# Verify everything is running
docker compose ps
curl http://localhost:11434/api/tags # list available modelsTroubleshooting Common Issues
Here are solutions to the most common problems when running Ollama.
# Problem: "Error: model requires more system memory"
# Solution: Use a smaller model or quantization
ollama run llama3:8b-instruct-q4_0 # smallest variant
# Problem: "connection refused" on localhost:11434
# Solution: Start the Ollama service
ollama serve # macOS/Linux (foreground)
sudo systemctl start ollama # Linux (background)
# Problem: Slow generation speed (CPU only)
# Solution: Verify GPU is being used
ollama ps # check Processor column
# If showing "100% CPU", reinstall GPU drivers
# Problem: Model not found
# Solution: Check available models and pull
ollama list # see downloaded models
ollama pull llama3 # download if missing
# Problem: CORS errors from web app
# Solution: Set OLLAMA_ORIGINS
OLLAMA_ORIGINS="http://localhost:3000" ollama serve
# Problem: Out of disk space
# Solution: Remove unused models and move storage
ollama rm unused-model
OLLAMA_MODELS=/mnt/large-drive/ollama ollama serveBest Practices
- Start with smaller models (7B) for development, scale up for production quality assessment
- Use quantized models (Q4_K_M) for the best balance of quality and speed
- Set appropriate context windows — larger contexts use more memory linearly
- Monitor GPU memory with nvidia-smi or Activity Monitor on macOS
- Use the keep_alive parameter to control model loading/unloading behavior
- Create custom Modelfiles for each use case with specific system prompts
- Pin model versions in production to avoid unexpected behavior changes on updates
- Use streaming responses in your applications for better perceived latency
- Implement request queuing for multi-user servers to avoid memory pressure
- Test with representative workloads before choosing a model for production
Preguntas frecuentes
Que hardware necesito para ejecutar Ollama?
Para modelos 7B, lo practico es tener al menos 8 GB de RAM. Un Mac Apple Silicon con 16 GB o una GPU NVIDIA con 8 GB de VRAM funciona muy bien. Para modelos 70B, necesitas mucha mas memoria, normalmente 64 GB o mas.
Ollama es gratis?
Si. Ollama es gratis y open-source bajo licencia MIT. No hay cuotas por uso ni costes por token.
Como se compara Ollama con ChatGPT?
Ollama ejecuta modelos en tu equipo; ChatGPT usa servidores externos. Los modelos locales suelen ser menos potentes que GPT-4, pero ofrecen privacidad, coste cero por consulta y control total.
Puedo usar Ollama para programacion?
Si. Modelos como Code Llama, DeepSeek Coder y StarCoder2 sirven para explicar codigo, depurar, generar funciones y revisar cambios.
Ollama soporta fine-tuning?
Ollama no hace fine-tuning directamente, pero puede importar modelos GGUF o adapters creados con otras herramientas y ejecutarlos con Modelfiles.
Ollama puede ejecutar varios modelos a la vez?
Si, siempre que tengas suficiente RAM o VRAM. Usa keep_alive para controlar cuanto tiempo quedan cargados los modelos.
Como actualizo Ollama y los modelos?
Actualiza Ollama desde ollama.com o con el instalador correspondiente. Para actualizar modelos, usa ollama pull nombre-del-modelo.
Mis datos son privados con Ollama?
Si. La inferencia se ejecuta localmente y tus prompts no se envian a servidores externos, lo que lo hace util para codigo privado y documentos sensibles.