Ollama 是一款开源工具,让你在本地机器上运行大语言模型(LLM)。无论你关注数据隐私、想要消除 API 成本,还是需要离线 AI 能力,Ollama 都能让你用一条命令下载和运行 Llama 3、Mistral、Code Llama 等模型。本指南涵盖从安装到生产部署的全部内容。
Ollama 让你用一条命令在本地运行 LLM。在 macOS/Linux/Windows 上安装后,运行 "ollama run llama3" 即可开始对话,使用 localhost:11434 的 REST API 集成到应用中,通过 Modelfile 创建自定义模型。支持 CUDA、Metal 和 ROCm GPU 加速。
- Ollama 支持 100+ 模型,包括 Llama 3、Mistral、Code Llama、Phi-3 和 Gemma 2
- 在 macOS 和 Linux 上一条命令安装;所有平台可用 Docker
- REST API 位于 localhost:11434,提供 /api/generate、/api/chat 和 /api/embeddings 端点
- 自定义 Modelfile 可调整参数、设置系统提示词和创建专用模型
- GPU 加速支持 CUDA(NVIDIA)、Metal(Apple Silicon)和 ROCm(AMD)
- 7B 模型需要 8GB 内存,13B 需要 16GB,70B 需要至少 64GB
什么是 Ollama?为什么要在本地运行 LLM?
Ollama 是一个轻量级开源框架,用于在本地机器上运行大语言模型。它封装了 llama.cpp,提供易用的 CLI 和 REST API,自动处理模型下载、量化、GPU 加速和内存管理。
本地运行 LLM 有三大优势:数据完全私密不会离开你的机器,零 API 成本无限查询,低延迟推理无需网络往返。
Ollama 已成为 2026 年本地 LLM 推理的事实标准,拥有超过 200,000 个 GitHub star,与所有主流 AI 框架集成。支持 macOS、Linux 和 Windows。
安装指南
macOS(Intel 和 Apple Silicon)
Ollama 对 macOS 有一流支持,Apple Silicon 上自动启用 Metal GPU 加速。安装不到一分钟。
# Option 1: Download from ollama.com (recommended)
# Visit https://ollama.com/download and install the .dmg
# Option 2: Install via Homebrew
brew install ollama
# Start the Ollama service
ollama serve
# In a new terminal, run your first model
ollama run llama3
# Verify installation
ollama --version
# ollama version 0.6.2Linux
在 Linux 上,官方安装脚本会处理一切,包括 NVIDIA CUDA 驱动检测。Ollama 作为 systemd 服务运行。
# One-line install (detects NVIDIA CUDA automatically)
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama as a systemd service
sudo systemctl enable ollama
sudo systemctl start ollama
# Check service status
sudo systemctl status ollama
# Run a model
ollama run llama3
# View logs for debugging
journalctl -u ollama -fWindows
Ollama 现在有原生 Windows 安装程序,支持 NVIDIA 和 AMD 显卡的 GPU 加速。不再需要 WSL2。
# Download the Windows installer from ollama.com/download
# Run OllamaSetup.exe — it installs as a Windows service
# After installation, open PowerShell or Command Prompt
ollama run llama3
# The API is available at http://localhost:11434
# Ollama runs in the system tray on WindowsDocker(所有平台)
Docker 是最便携的选择,适用于 macOS、Linux 和 Windows。Linux 上支持 NVIDIA GPU 透传。
# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama
# With NVIDIA GPU support (requires nvidia-container-toolkit)
docker run -d --gpus=all -v ollama:/root/.ollama \
-p 11434:11434 --name ollama ollama/ollama
# Run a model inside the container
docker exec -it ollama ollama run llama3
# Pull a model without interactive session
docker exec ollama ollama pull mistral运行模型
Ollama 提供包含 100+ 预构建模型的模型库。ollama run 命令会下载模型(如需要)并启动交互式对话。
热门模型
# General purpose — Meta Llama 3 (8B, fast and capable)
ollama run llama3
# Mistral 7B — excellent reasoning, multilingual
ollama run mistral
# Code Llama — optimized for code generation
ollama run codellama
# Microsoft Phi-3 — small but powerful (3.8B)
ollama run phi3
# Google Gemma 2 — strong general performance (9B)
ollama run gemma2
# DeepSeek Coder V2 — top coding model
ollama run deepseek-coder-v2
# Llama 3 70B — near GPT-4 quality (needs 64GB RAM)
ollama run llama3:70b
# Multimodal — LLaVA (vision + text)
ollama run llava
# Then provide an image: /path/to/image.jpg What is in this image?模型性能对比
大多数模型有多种大小。较小的模型更快但能力较弱,较大的模型输出更好但需要更多资源。
| Model | Parameters | Size (Q4) | Speed (tok/s)* | Quality | Best For |
|---|---|---|---|---|---|
| Phi-3 Mini | 3.8B | 2.3 GB | ~65 | Good | Edge, mobile, quick tasks |
| Mistral 7B | 7B | 4.1 GB | ~48 | Very Good | General chat, multilingual |
| Llama 3 8B | 8B | 4.7 GB | ~45 | Very Good | All-around, reasoning |
| Gemma 2 9B | 9B | 5.4 GB | ~38 | Excellent | Instruction following |
| Code Llama 13B | 13B | 7.4 GB | ~28 | Excellent | Code generation, review |
| DeepSeek Coder | 33B | 19 GB | ~14 | Outstanding | Advanced coding tasks |
| Llama 3 70B | 70B | 39 GB | ~8 | Outstanding | Complex reasoning, analysis |
* Approximate tokens/second on Apple M3 Max 64GB with Metal acceleration. Actual speed varies by hardware and quantization.
模型管理
Ollama 提供命令来列出、下载、删除和检查本地模型。高效的模型管理有助于节省磁盘空间。
# List all downloaded models
ollama list
# NAME ID SIZE MODIFIED
# llama3:latest a6990ed6be41 4.7 GB 2 hours ago
# mistral:latest 61e88e884507 4.1 GB 3 days ago
# Download a model without running it
ollama pull codellama:13b
# Pull a specific quantization variant
ollama pull llama3:8b-instruct-q5_K_M
# Remove a model to free disk space
ollama rm mistral
# Show model details (parameters, template, license)
ollama show llama3
ollama show llama3 --modelfile # view the Modelfile
# Copy a model (useful before customizing)
ollama cp llama3 my-llama3
# Create a custom model from a Modelfile
ollama create my-assistant -f ./Modelfile
# List currently running models and their resource usage
ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# llama3 a6990e 6.7 GB 100% GPU 4 minutesOllama REST API
Ollama 在 localhost:11434 暴露 REST API,可用于将 LLM 集成到任何应用中。API 兼容 OpenAI 聊天格式,默认支持流式响应。
/api/generate — 文本生成
# Simple text generation
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain Docker in 3 sentences",
"stream": false
}'
# With parameters for precise control
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a Python function to merge two sorted lists",
"stream": false,
"options": {
"temperature": 0.2,
"top_p": 0.9,
"num_predict": 500
}
}'
# Response structure
# {
# "model": "llama3",
# "response": "Here is a Python function...",
# "done": true,
# "total_duration": 1234567890,
# "eval_count": 142,
# "eval_duration": 987654321
# }/api/chat — 对话聊天
# Multi-turn conversation with system prompt
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{ "role": "system", "content": "You are a senior DevOps engineer." },
{ "role": "user", "content": "How do I set up a CI/CD pipeline?" },
{ "role": "assistant", "content": "A CI/CD pipeline typically..." },
{ "role": "user", "content": "Show me a GitHub Actions example." }
],
"stream": false
}'# Node.js / TypeScript streaming client
async function chat(prompt: string): Promise<void> {
const response = await fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama3",
messages: [{ role: "user", content: prompt }],
}),
});
const reader = response.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = JSON.parse(decoder.decode(value));
process.stdout.write(chunk.message.content);
}
}
chat("Explain async/await in TypeScript");/api/embeddings — 向量嵌入
生成文本的向量嵌入,用于 RAG(检索增强生成)、语义搜索和文档相似度计算。
# Generate embeddings for text
curl http://localhost:11434/api/embeddings -d '{
"model": "llama3",
"prompt": "Ollama is a tool for running LLMs locally"
}'
# Response: { "embedding": [0.123, -0.456, 0.789, ...] }
# Python: semantic search with embeddings
import requests
import numpy as np
def get_embedding(text: str) -> np.ndarray:
resp = requests.post("http://localhost:11434/api/embeddings",
json={"model": "llama3", "prompt": text})
return np.array(resp.json()["embedding"])
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Compare document similarity
doc_emb = get_embedding("Docker containers isolate applications")
query_emb = get_embedding("How to run apps in isolation?")
score = cosine_similarity(doc_emb, query_emb)
print(f"Similarity: {score:.4f}") # ~0.85自定义 Modelfile 创建
Modelfile 类似于 LLM 的 Dockerfile。它定义基础模型、参数、系统提示词和模板,让你为特定用例创建专用模型。
Modelfile 指令
Modelfile 的关键指令包括 FROM(基础模型)、PARAMETER(推理设置)、SYSTEM(系统提示词)和 TEMPLATE(提示词格式)。
# Modelfile for a code review assistant
FROM codellama:13b
# Set inference parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"
PARAMETER repeat_penalty 1.1
# Define the system prompt
SYSTEM """
You are an expert code reviewer. Analyze code for:
- Bugs and potential errors
- Performance issues and optimization opportunities
- Security vulnerabilities (injection, XSS, etc.)
- Code style and best practices
Provide actionable feedback with specific line references.
Rate severity as: Critical, Warning, or Suggestion.
"""
# Custom prompt template (optional)
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """# Build and run the custom model
ollama create code-reviewer -f ./Modelfile
ollama run code-reviewer
# Another example: a SQL query assistant
# --- sql-helper.Modelfile ---
FROM llama3
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
SYSTEM """
You are a PostgreSQL expert. Generate optimized SQL queries.
Always explain your queries and suggest indexes when beneficial.
Output format: SQL query first, then explanation.
Use CTEs for complex queries. Avoid SELECT *.
"""
# Import a GGUF model from HuggingFace
# --- import.Modelfile ---
FROM ./my-finetuned-model.gguf
PARAMETER temperature 0.5
SYSTEM "You are a helpful assistant."
# Apply a LoRA adapter to a base model
# FROM llama3
# ADAPTER ./my-lora-adapter.ggufGPU 加速
Ollama 自动检测并使用可用的 GPU。GPU 加速显著减少推理时间,7B 模型在 GPU 上比纯 CPU 快 5-10 倍。
Apple Metal (macOS)
Apple Silicon Mac(M1/M2/M3/M4)自动使用 Metal 进行 GPU 加速,无需额外设置。统一内存架构使 GPU 可以访问全部系统内存。
# Check Metal GPU usage on macOS
ollama ps
# NAME ID SIZE PROCESSOR UNTIL
# llama3 a6990e 6.7 GB 100% GPU 4 minutes from now
# Apple Silicon performance reference (M3 Max 64GB):
# Llama 3 8B: ~45 tokens/sec
# Llama 3 70B: ~8 tokens/sec
# Phi-3 Mini: ~65 tokens/sec
# Monitor memory pressure in Activity Monitor
# or use: memory_pressureNVIDIA CUDA (Linux/Windows)
NVIDIA GPU 需要 CUDA 驱动(11.7 或更高版本)。Ollama 自动检测 CUDA 并将模型层卸载到 GPU。
# Verify NVIDIA GPU detection
nvidia-smi
# Check Ollama GPU usage
ollama ps
# NAME ID SIZE PROCESSOR
# llama3 a6990e 6.7 GB 100% GPU
# Partial GPU offload (when VRAM is limited)
# Offload only 20 layers to GPU, rest stays on CPU
OLLAMA_NUM_GPU=20 ollama run llama3:70b
# Force CPU-only mode
OLLAMA_NUM_GPU=0 ollama run llama3
# Monitor GPU memory during inference
watch -n 1 nvidia-smiAMD ROCm (Linux)
AMD GPU 通过 Linux 上的 ROCm 5.7+ 支持。支持的显卡包括 RX 7900 XTX、RX 6900 XT 等。
# Install ROCm for AMD GPUs (Ubuntu)
# Follow: https://rocm.docs.amd.com/en/latest/deploy/linux/
# Run Ollama with ROCm Docker image
docker run -d --device /dev/kfd --device /dev/dri \
-v ollama:/root/.ollama -p 11434:11434 \
--name ollama ollama/ollama:rocm内存需求
所需的 RAM(或 VRAM)取决于模型大小和量化级别。经验法则是 Q4 量化模型每十亿参数大约需要 1GB 内存。
| 模型大小 | 最低内存 | 推荐内存 | GPU 显存 | 适用场景 |
|---|---|---|---|---|
| 1-3B (Phi-3 Mini, TinyLlama) | 4 GB | 8 GB | 4 GB | Edge devices, quick prototyping |
| 7-8B (Llama 3, Mistral) | 8 GB | 16 GB | 8 GB | General use, coding, chat |
| 13B (Code Llama 13B) | 16 GB | 24 GB | 12 GB | Complex reasoning, code review |
| 33-34B (DeepSeek, Code Llama 34B) | 32 GB | 48 GB | 24 GB | Advanced analysis, long context |
| 70B (Llama 3 70B) | 64 GB | 96 GB | 48 GB | Near GPT-4 quality tasks |
Memory requirements are for Q4_K_M quantization. Higher quantization (Q5, Q8) uses more memory but produces slightly better output. Context window size also adds to memory usage — each 1K tokens of context requires approximately 0.5-1 GB additional memory for 7B models.
环境变量
Ollama 的行为可以通过环境变量自定义。这对服务器部署和 Docker 配置特别有用。
# Key Ollama environment variables
# OLLAMA_HOST — bind address (default: 127.0.0.1:11434)
OLLAMA_HOST=0.0.0.0:11434 # listen on all interfaces
# OLLAMA_MODELS — custom model storage directory
OLLAMA_MODELS=/mnt/ssd/ollama-models # use a fast SSD
# OLLAMA_ORIGINS — allowed CORS origins
OLLAMA_ORIGINS="http://localhost:3000,https://myapp.com"
# OLLAMA_NUM_PARALLEL — concurrent request handling
OLLAMA_NUM_PARALLEL=4 # handle 4 requests at once
# OLLAMA_MAX_LOADED_MODELS — models kept in memory
OLLAMA_MAX_LOADED_MODELS=2 # keep 2 models loaded
# OLLAMA_KEEP_ALIVE — how long models stay loaded
OLLAMA_KEEP_ALIVE=10m # unload after 10 minutes
# OLLAMA_NUM_GPU — GPU layer count
OLLAMA_NUM_GPU=99 # all layers on GPU (default)
OLLAMA_NUM_GPU=0 # CPU only
# Linux systemd: /etc/systemd/system/ollama.service
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0"
# Environment="OLLAMA_MODELS=/data/models"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama集成
Ollama 与流行的 AI 框架和工具无缝配合。其 OpenAI 兼容 API 意味着大多数支持 OpenAI 的库也可以使用 Ollama。
LangChain
LangChain 提供原生 Ollama 集成,用于构建 RAG 管道、代理和链。
# pip install langchain-ollama langchain-chroma
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Basic text generation
llm = OllamaLLM(model="llama3")
response = llm.invoke("Explain Kubernetes in simple terms")
print(response)
# Build a RAG pipeline with local embeddings
embeddings = OllamaEmbeddings(model="llama3")
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
)
docs = splitter.split_documents(my_documents)
vectorstore = Chroma.from_documents(docs, embeddings)
# Query the knowledge base
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
results = retriever.invoke("How to deploy with Docker?")LlamaIndex
LlamaIndex 支持 Ollama,用于在自有文档上构建知识检索系统。
# pip install llama-index-llms-ollama llama-index-embeddings-ollama
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
# Configure Ollama as default LLM and embedding model
Settings.llm = Ollama(model="llama3", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="llama3")
# Load documents and build index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query your documents
query_engine = index.as_query_engine()
response = query_engine.query("What are the main API endpoints?")
print(response)Open WebUI
Open WebUI 为 Ollama 提供类似 ChatGPT 的网页界面,支持多模型、对话历史、文档上传和网络搜索。
# Run Open WebUI with Docker (auto-connects to Ollama)
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:main
# Access the web UI at http://localhost:3000
# Features:
# - Multi-model chat with model switching
# - Conversation history and search
# - Document upload for RAG
# - Web search integration
# - User accounts and admin panel
# - Custom model presets and system promptsOllama 与替代方案对比
以下是 Ollama 与其他本地运行 LLM 工具的对比。每个工具都有不同的优势。
| 特性 | Ollama | LM Studio | llama.cpp | GPT4All |
|---|---|---|---|---|
| Ease of Use | Excellent | Excellent | Advanced | Good |
| REST API | Built-in (OpenAI compat) | Built-in | Optional server | Built-in |
| GUI | CLI only* | Full GUI | None | Full GUI |
| Docker Support | Official images | Community | Community | None |
| Model Library | 100+ curated models | HuggingFace browse | Manual GGUF files | Curated list |
| GPU Support | CUDA/Metal/ROCm | CUDA/Metal | CUDA/Metal/ROCm/Vulkan | CUDA/Metal |
| Customization | Modelfile system | UI settings | Full CLI control | Limited |
| Server / Team Use | Native multi-user | Local only | Optional server | Local only |
| License | MIT | Proprietary | MIT | MIT |
| Best For | Developers, DevOps, teams | Beginners, exploration | Power users, custom builds | Desktop users |
* Ollama pairs with Open WebUI for a full graphical experience comparable to LM Studio.
性能调优
调整推理参数以平衡速度、质量和资源使用。合适的设置取决于用例 — 代码生成需要低温度以保证精确性。
关键参数
# Temperature controls randomness (0.0 = deterministic, 2.0 = very random)
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a haiku about programming",
"options": {
"temperature": 0.8,
"top_p": 0.95,
"top_k": 40,
"num_predict": 200,
"num_ctx": 4096,
"repeat_penalty": 1.1,
"num_gpu": 99,
"num_thread": 8
}
}'
# Parameter reference:
# temperature 0.0-0.3 → factual answers, code generation
# temperature 0.4-0.7 → balanced, general conversation
# temperature 0.8-1.5 → creative writing, brainstorming
#
# num_ctx: context window (default 2048, max depends on model)
# Higher = more context but more memory and slower
# Llama 3 supports up to 8192 tokens
#
# num_gpu: GPU layer count (99 = all layers, 0 = CPU only)
# num_thread: CPU threads (default = auto-detect)
# top_p: nucleus sampling (0.9 = consider top 90% probability)
# top_k: limits selection to top K tokens (40 is a good default)
# repeat_penalty: penalize repetition (1.0 = off, 1.1 = moderate)将 Ollama 作为团队服务器运行
Ollama 可以通过绑定所有网络接口来为网络上的多个用户提供服务,将一台强大的机器变成共享 AI 推理服务器。
# Bind Ollama to all interfaces for network access
OLLAMA_HOST=0.0.0.0 ollama serve
# Linux: make it permanent via systemd
sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0"
# Add: Environment="OLLAMA_ORIGINS=*"
sudo systemctl restart ollama
# Team members connect from their machines
curl http://your-server-ip:11434/api/chat -d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Hello from remote"}],
"stream": false
}'
# Nginx reverse proxy with SSL (recommended)
# server {
# listen 443 ssl;
# server_name ollama.yourcompany.com;
# ssl_certificate /etc/letsencrypt/live/ollama.yourcompany.com/fullchain.pem;
# ssl_certificate_key /etc/letsencrypt/live/ollama.yourcompany.com/privkey.pem;
# location / {
# proxy_pass http://localhost:11434;
# proxy_set_header Host \$host;
# proxy_buffering off;
# proxy_read_timeout 300s;
# }
# }生产环境 Docker 部署
在生产环境中,使用 Docker Compose 运行 Ollama 和 Open WebUI,提供自托管的 ChatGPT 替代方案。
# docker-compose.yml for production
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=15m
- OLLAMA_NUM_PARALLEL=4
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama_data:
webui_data:# Deploy and pre-pull models
docker compose up -d
# Pull models for the team
docker exec ollama ollama pull llama3
docker exec ollama ollama pull codellama:13b
docker exec ollama ollama pull mistral
# Verify everything is running
docker compose ps
curl http://localhost:11434/api/tags # list available models常见问题排查
以下是运行 Ollama 时最常见问题的解决方案。
# Problem: "Error: model requires more system memory"
# Solution: Use a smaller model or quantization
ollama run llama3:8b-instruct-q4_0 # smallest variant
# Problem: "connection refused" on localhost:11434
# Solution: Start the Ollama service
ollama serve # macOS/Linux (foreground)
sudo systemctl start ollama # Linux (background)
# Problem: Slow generation speed (CPU only)
# Solution: Verify GPU is being used
ollama ps # check Processor column
# If showing "100% CPU", reinstall GPU drivers
# Problem: Model not found
# Solution: Check available models and pull
ollama list # see downloaded models
ollama pull llama3 # download if missing
# Problem: CORS errors from web app
# Solution: Set OLLAMA_ORIGINS
OLLAMA_ORIGINS="http://localhost:3000" ollama serve
# Problem: Out of disk space
# Solution: Remove unused models and move storage
ollama rm unused-model
OLLAMA_MODELS=/mnt/large-drive/ollama ollama serve最佳实践
- 开发时从小模型(7B)开始,生产时扩大规模进行质量评估
- 使用量化模型(Q4_K_M)获得质量和速度的最佳平衡
- 设置适当的上下文窗口 — 更大的上下文线性增加内存使用
- 使用 nvidia-smi 或 macOS 的活动监视器监控 GPU 内存
- 使用 keep_alive 参数控制模型加载/卸载行为
- 为每个用例创建带有特定系统提示词的自定义 Modelfile
- 在生产环境中固定模型版本以避免更新时的意外行为变化
- 在应用中使用流式响应以获得更好的感知延迟
- 为多用户服务器实现请求队列以避免内存压力
- 在选择生产模型之前,使用代表性工作负载进行测试
常见问题
运行 Ollama 需要什么硬件?
运行 7B 模型至少需要 8GB 内存。16GB 以上的 Apple Silicon Mac 是理想选择。8GB+ 显存的 NVIDIA GPU 也很好用。70B 模型需要 64GB 内存或 48GB 显存的 GPU。
Ollama 是免费的吗?
是的,Ollama 完全免费且开源(MIT 许可证)。没有使用限制、API 费用或订阅费。可用于个人和商业项目。
Ollama 与 ChatGPT 相比如何?
Ollama 在本地运行模型,而 ChatGPT 在 OpenAI 服务器上运行。本地模型通常不如 GPT-4,但提供完全隐私、零成本和无速率限制。Llama 3 70B 在许多任务上接近 GPT-4 质量。
Ollama 能用于代码生成吗?
可以。Code Llama、DeepSeek Coder 和 StarCoder2 是通过 Ollama 可用的优秀编程模型,支持代码补全、解释、调试和生成。
Ollama 支持微调吗?
Ollama 不直接支持微调,但可以导入用 Unsloth 或 Axolotl 等工具创建的微调 GGUF 模型。通过 Modelfile 的系统提示词和参数调整来自定义行为。
Ollama 能同时运行多个模型吗?
可以,只要有足够的 RAM 或 VRAM,Ollama 可以同时加载多个模型。使用 keep_alive 参数控制模型在最后一次请求后保持加载的时间。
如何更新 Ollama 和模型?
macOS 上从 ollama.com 下载最新版本。Linux 上重新运行安装脚本。Docker 用户拉取最新镜像。更新模型:ollama pull 模型名。
使用 Ollama 时数据是私密的吗?
完全私密。所有推理都在本地机器上进行,不会发送数据到外部服务器,不收集遥测数据。非常适合处理敏感文档、专有代码和机密业务数据。