Ollama 完全指南 2026：本地运行大语言模型 — 安装、模型、API 与最佳实践

Ollama 是一款开源工具，让你在本地机器上运行大语言模型（LLM）。无论你关注数据隐私、想要消除 API 成本，还是需要离线 AI 能力，Ollama 都能让你用一条命令下载和运行 Llama 3、Mistral、Code Llama 等模型。本指南涵盖从安装到生产部署的全部内容。

概要

Ollama 让你用一条命令在本地运行 LLM。在 macOS/Linux/Windows 上安装后，运行 "ollama run llama3" 即可开始对话，使用 localhost:11434 的 REST API 集成到应用中，通过 Modelfile 创建自定义模型。支持 CUDA、Metal 和 ROCm GPU 加速。

核心要点

Ollama 支持 100+ 模型，包括 Llama 3、Mistral、Code Llama、Phi-3 和 Gemma 2
在 macOS 和 Linux 上一条命令安装；所有平台可用 Docker
REST API 位于 localhost:11434，提供 /api/generate、/api/chat 和 /api/embeddings 端点
自定义 Modelfile 可调整参数、设置系统提示词和创建专用模型
GPU 加速支持 CUDA（NVIDIA）、Metal（Apple Silicon）和 ROCm（AMD）
7B 模型需要 8GB 内存，13B 需要 16GB，70B 需要至少 64GB

什么是 Ollama？为什么要在本地运行 LLM？

Ollama 是一个轻量级开源框架，用于在本地机器上运行大语言模型。它封装了 llama.cpp，提供易用的 CLI 和 REST API，自动处理模型下载、量化、GPU 加速和内存管理。

本地运行 LLM 有三大优势：数据完全私密不会离开你的机器，零 API 成本无限查询，低延迟推理无需网络往返。

Ollama 已成为 2026 年本地 LLM 推理的事实标准，拥有超过 200,000 个 GitHub star，与所有主流 AI 框架集成。支持 macOS、Linux 和 Windows。

安装指南

macOS（Intel 和 Apple Silicon）

Ollama 对 macOS 有一流支持，Apple Silicon 上自动启用 Metal GPU 加速。安装不到一分钟。

# Option 1: Download from ollama.com (recommended)
# Visit https://ollama.com/download and install the .dmg

# Option 2: Install via Homebrew
brew install ollama

# Start the Ollama service
ollama serve

# In a new terminal, run your first model
ollama run llama3

# Verify installation
ollama --version
# ollama version 0.6.2

Linux

在 Linux 上，官方安装脚本会处理一切，包括 NVIDIA CUDA 驱动检测。Ollama 作为 systemd 服务运行。

# One-line install (detects NVIDIA CUDA automatically)
curl -fsSL https://ollama.com/install.sh | sh

# Start Ollama as a systemd service
sudo systemctl enable ollama
sudo systemctl start ollama

# Check service status
sudo systemctl status ollama

# Run a model
ollama run llama3

# View logs for debugging
journalctl -u ollama -f

Windows

Ollama 现在有原生 Windows 安装程序，支持 NVIDIA 和 AMD 显卡的 GPU 加速。不再需要 WSL2。

# Download the Windows installer from ollama.com/download
# Run OllamaSetup.exe — it installs as a Windows service

# After installation, open PowerShell or Command Prompt
ollama run llama3

# The API is available at http://localhost:11434
# Ollama runs in the system tray on Windows

Docker（所有平台）

Docker 是最便携的选择，适用于 macOS、Linux 和 Windows。Linux 上支持 NVIDIA GPU 透传。

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama

# With NVIDIA GPU support (requires nvidia-container-toolkit)
docker run -d --gpus=all -v ollama:/root/.ollama \
  -p 11434:11434 --name ollama ollama/ollama

# Run a model inside the container
docker exec -it ollama ollama run llama3

# Pull a model without interactive session
docker exec ollama ollama pull mistral

运行模型

Ollama 提供包含 100+ 预构建模型的模型库。ollama run 命令会下载模型（如需要）并启动交互式对话。

模型性能对比

大多数模型有多种大小。较小的模型更快但能力较弱，较大的模型输出更好但需要更多资源。

Model	Parameters	Size (Q4)	Speed (tok/s)*	Quality	Best For
Phi-3 Mini	3.8B	2.3 GB	~65	Good	Edge, mobile, quick tasks
Mistral 7B	7B	4.1 GB	~48	Very Good	General chat, multilingual
Llama 3 8B	8B	4.7 GB	~45	Very Good	All-around, reasoning
Gemma 2 9B	9B	5.4 GB	~38	Excellent	Instruction following
Code Llama 13B	13B	7.4 GB	~28	Excellent	Code generation, review
DeepSeek Coder	33B	19 GB	~14	Outstanding	Advanced coding tasks
Llama 3 70B	70B	39 GB	~8	Outstanding	Complex reasoning, analysis

* Approximate tokens/second on Apple M3 Max 64GB with Metal acceleration. Actual speed varies by hardware and quantization.

模型管理

Ollama 提供命令来列出、下载、删除和检查本地模型。高效的模型管理有助于节省磁盘空间。

# List all downloaded models
ollama list
# NAME              ID            SIZE    MODIFIED
# llama3:latest     a6990ed6be41  4.7 GB  2 hours ago
# mistral:latest    61e88e884507  4.1 GB  3 days ago

# Download a model without running it
ollama pull codellama:13b

# Pull a specific quantization variant
ollama pull llama3:8b-instruct-q5_K_M

# Remove a model to free disk space
ollama rm mistral

# Show model details (parameters, template, license)
ollama show llama3
ollama show llama3 --modelfile  # view the Modelfile

# Copy a model (useful before customizing)
ollama cp llama3 my-llama3

# Create a custom model from a Modelfile
ollama create my-assistant -f ./Modelfile

# List currently running models and their resource usage
ollama ps
# NAME      ID       SIZE     PROCESSOR  UNTIL
# llama3    a6990e   6.7 GB   100% GPU   4 minutes

Ollama REST API

Ollama 在 localhost:11434 暴露 REST API，可用于将 LLM 集成到任何应用中。API 兼容 OpenAI 聊天格式，默认支持流式响应。

/api/generate — 文本生成

# Simple text generation
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain Docker in 3 sentences",
  "stream": false
}'

# With parameters for precise control
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a Python function to merge two sorted lists",
  "stream": false,
  "options": {
    "temperature": 0.2,
    "top_p": 0.9,
    "num_predict": 500
  }
}'

# Response structure
# {
#   "model": "llama3",
#   "response": "Here is a Python function...",
#   "done": true,
#   "total_duration": 1234567890,
#   "eval_count": 142,
#   "eval_duration": 987654321
# }

/api/chat — 对话聊天

# Multi-turn conversation with system prompt
curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    { "role": "system", "content": "You are a senior DevOps engineer." },
    { "role": "user", "content": "How do I set up a CI/CD pipeline?" },
    { "role": "assistant", "content": "A CI/CD pipeline typically..." },
    { "role": "user", "content": "Show me a GitHub Actions example." }
  ],
  "stream": false
}'

# Node.js / TypeScript streaming client
async function chat(prompt: string): Promise<void> {
  const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3",
      messages: [{ role: "user", content: prompt }],
    }),
  });

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const chunk = JSON.parse(decoder.decode(value));
    process.stdout.write(chunk.message.content);
  }
}

chat("Explain async/await in TypeScript");

/api/embeddings — 向量嵌入

生成文本的向量嵌入，用于 RAG（检索增强生成）、语义搜索和文档相似度计算。

# Generate embeddings for text
curl http://localhost:11434/api/embeddings -d '{
  "model": "llama3",
  "prompt": "Ollama is a tool for running LLMs locally"
}'
# Response: { "embedding": [0.123, -0.456, 0.789, ...] }

# Python: semantic search with embeddings
import requests
import numpy as np

def get_embedding(text: str) -> np.ndarray:
    resp = requests.post("http://localhost:11434/api/embeddings",
        json={"model": "llama3", "prompt": text})
    return np.array(resp.json()["embedding"])

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Compare document similarity
doc_emb = get_embedding("Docker containers isolate applications")
query_emb = get_embedding("How to run apps in isolation?")
score = cosine_similarity(doc_emb, query_emb)
print(f"Similarity: {score:.4f}")  # ~0.85

自定义 Modelfile 创建

Modelfile 类似于 LLM 的 Dockerfile。它定义基础模型、参数、系统提示词和模板，让你为特定用例创建专用模型。

Modelfile 指令

Modelfile 的关键指令包括 FROM（基础模型）、PARAMETER（推理设置）、SYSTEM（系统提示词）和 TEMPLATE（提示词格式）。

# Modelfile for a code review assistant
FROM codellama:13b

# Set inference parameters
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"
PARAMETER repeat_penalty 1.1

# Define the system prompt
SYSTEM """
You are an expert code reviewer. Analyze code for:
- Bugs and potential errors
- Performance issues and optimization opportunities
- Security vulnerabilities (injection, XSS, etc.)
- Code style and best practices
Provide actionable feedback with specific line references.
Rate severity as: Critical, Warning, or Suggestion.
"""

# Custom prompt template (optional)
TEMPLATE """{{ .System }}
User: {{ .Prompt }}
Assistant: """

# Build and run the custom model
ollama create code-reviewer -f ./Modelfile
ollama run code-reviewer

# Another example: a SQL query assistant
# --- sql-helper.Modelfile ---
FROM llama3
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
SYSTEM """
You are a PostgreSQL expert. Generate optimized SQL queries.
Always explain your queries and suggest indexes when beneficial.
Output format: SQL query first, then explanation.
Use CTEs for complex queries. Avoid SELECT *.
"""

# Import a GGUF model from HuggingFace
# --- import.Modelfile ---
FROM ./my-finetuned-model.gguf
PARAMETER temperature 0.5
SYSTEM "You are a helpful assistant."

# Apply a LoRA adapter to a base model
# FROM llama3
# ADAPTER ./my-lora-adapter.gguf

GPU 加速

Ollama 自动检测并使用可用的 GPU。GPU 加速显著减少推理时间，7B 模型在 GPU 上比纯 CPU 快 5-10 倍。

Apple Metal (macOS)

Apple Silicon Mac（M1/M2/M3/M4）自动使用 Metal 进行 GPU 加速，无需额外设置。统一内存架构使 GPU 可以访问全部系统内存。

# Check Metal GPU usage on macOS
ollama ps
# NAME      ID        SIZE     PROCESSOR    UNTIL
# llama3    a6990e    6.7 GB   100% GPU     4 minutes from now

# Apple Silicon performance reference (M3 Max 64GB):
# Llama 3 8B:   ~45 tokens/sec
# Llama 3 70B:  ~8 tokens/sec
# Phi-3 Mini:   ~65 tokens/sec

# Monitor memory pressure in Activity Monitor
# or use: memory_pressure

NVIDIA CUDA (Linux/Windows)

NVIDIA GPU 需要 CUDA 驱动（11.7 或更高版本）。Ollama 自动检测 CUDA 并将模型层卸载到 GPU。

# Verify NVIDIA GPU detection
nvidia-smi

# Check Ollama GPU usage
ollama ps
# NAME      ID        SIZE    PROCESSOR
# llama3    a6990e    6.7 GB  100% GPU

# Partial GPU offload (when VRAM is limited)
# Offload only 20 layers to GPU, rest stays on CPU
OLLAMA_NUM_GPU=20 ollama run llama3:70b

# Force CPU-only mode
OLLAMA_NUM_GPU=0 ollama run llama3

# Monitor GPU memory during inference
watch -n 1 nvidia-smi

AMD ROCm (Linux)

AMD GPU 通过 Linux 上的 ROCm 5.7+ 支持。支持的显卡包括 RX 7900 XTX、RX 6900 XT 等。

# Install ROCm for AMD GPUs (Ubuntu)
# Follow: https://rocm.docs.amd.com/en/latest/deploy/linux/

# Run Ollama with ROCm Docker image
docker run -d --device /dev/kfd --device /dev/dri \
  -v ollama:/root/.ollama -p 11434:11434 \
  --name ollama ollama/ollama:rocm

内存需求

所需的 RAM（或 VRAM）取决于模型大小和量化级别。经验法则是 Q4 量化模型每十亿参数大约需要 1GB 内存。

模型大小	最低内存	推荐内存	GPU 显存	适用场景
1-3B (Phi-3 Mini, TinyLlama)	4 GB	8 GB	4 GB	Edge devices, quick prototyping
7-8B (Llama 3, Mistral)	8 GB	16 GB	8 GB	General use, coding, chat
13B (Code Llama 13B)	16 GB	24 GB	12 GB	Complex reasoning, code review
33-34B (DeepSeek, Code Llama 34B)	32 GB	48 GB	24 GB	Advanced analysis, long context
70B (Llama 3 70B)	64 GB	96 GB	48 GB	Near GPT-4 quality tasks

Memory requirements are for Q4_K_M quantization. Higher quantization (Q5, Q8) uses more memory but produces slightly better output. Context window size also adds to memory usage — each 1K tokens of context requires approximately 0.5-1 GB additional memory for 7B models.

环境变量

Ollama 的行为可以通过环境变量自定义。这对服务器部署和 Docker 配置特别有用。

# Key Ollama environment variables

# OLLAMA_HOST — bind address (default: 127.0.0.1:11434)
OLLAMA_HOST=0.0.0.0:11434          # listen on all interfaces

# OLLAMA_MODELS — custom model storage directory
OLLAMA_MODELS=/mnt/ssd/ollama-models  # use a fast SSD

# OLLAMA_ORIGINS — allowed CORS origins
OLLAMA_ORIGINS="http://localhost:3000,https://myapp.com"

# OLLAMA_NUM_PARALLEL — concurrent request handling
OLLAMA_NUM_PARALLEL=4              # handle 4 requests at once

# OLLAMA_MAX_LOADED_MODELS — models kept in memory
OLLAMA_MAX_LOADED_MODELS=2         # keep 2 models loaded

# OLLAMA_KEEP_ALIVE — how long models stay loaded
OLLAMA_KEEP_ALIVE=10m              # unload after 10 minutes

# OLLAMA_NUM_GPU — GPU layer count
OLLAMA_NUM_GPU=99                  # all layers on GPU (default)
OLLAMA_NUM_GPU=0                   # CPU only

# Linux systemd: /etc/systemd/system/ollama.service
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0"
# Environment="OLLAMA_MODELS=/data/models"
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama

集成

Ollama 与流行的 AI 框架和工具无缝配合。其 OpenAI 兼容 API 意味着大多数支持 OpenAI 的库也可以使用 Ollama。

LangChain

LangChain 提供原生 Ollama 集成，用于构建 RAG 管道、代理和链。

# pip install langchain-ollama langchain-chroma
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Basic text generation
llm = OllamaLLM(model="llama3")
response = llm.invoke("Explain Kubernetes in simple terms")
print(response)

# Build a RAG pipeline with local embeddings
embeddings = OllamaEmbeddings(model="llama3")
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200
)
docs = splitter.split_documents(my_documents)
vectorstore = Chroma.from_documents(docs, embeddings)

# Query the knowledge base
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
results = retriever.invoke("How to deploy with Docker?")

LlamaIndex

LlamaIndex 支持 Ollama，用于在自有文档上构建知识检索系统。

# pip install llama-index-llms-ollama llama-index-embeddings-ollama
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings

# Configure Ollama as default LLM and embedding model
Settings.llm = Ollama(model="llama3", request_timeout=120)
Settings.embed_model = OllamaEmbedding(model_name="llama3")

# Load documents and build index
documents = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query your documents
query_engine = index.as_query_engine()
response = query_engine.query("What are the main API endpoints?")
print(response)

Open WebUI

Open WebUI 为 Ollama 提供类似 ChatGPT 的网页界面，支持多模型、对话历史、文档上传和网络搜索。

# Run Open WebUI with Docker (auto-connects to Ollama)
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:main

# Access the web UI at http://localhost:3000
# Features:
#   - Multi-model chat with model switching
#   - Conversation history and search
#   - Document upload for RAG
#   - Web search integration
#   - User accounts and admin panel
#   - Custom model presets and system prompts

Ollama 与替代方案对比

以下是 Ollama 与其他本地运行 LLM 工具的对比。每个工具都有不同的优势。

特性	Ollama	LM Studio	llama.cpp	GPT4All
Ease of Use	Excellent	Excellent	Advanced	Good
REST API	Built-in (OpenAI compat)	Built-in	Optional server	Built-in
GUI	CLI only*	Full GUI	None	Full GUI
Docker Support	Official images	Community	Community	None
Model Library	100+ curated models	HuggingFace browse	Manual GGUF files	Curated list
GPU Support	CUDA/Metal/ROCm	CUDA/Metal	CUDA/Metal/ROCm/Vulkan	CUDA/Metal
Customization	Modelfile system	UI settings	Full CLI control	Limited
Server / Team Use	Native multi-user	Local only	Optional server	Local only
License	MIT	Proprietary	MIT	MIT
Best For	Developers, DevOps, teams	Beginners, exploration	Power users, custom builds	Desktop users

* Ollama pairs with Open WebUI for a full graphical experience comparable to LM Studio.

性能调优

调整推理参数以平衡速度、质量和资源使用。合适的设置取决于用例 — 代码生成需要低温度以保证精确性。

关键参数

# Temperature controls randomness (0.0 = deterministic, 2.0 = very random)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a haiku about programming",
  "options": {
    "temperature": 0.8,
    "top_p": 0.95,
    "top_k": 40,
    "num_predict": 200,
    "num_ctx": 4096,
    "repeat_penalty": 1.1,
    "num_gpu": 99,
    "num_thread": 8
  }
}'

# Parameter reference:
# temperature 0.0-0.3 → factual answers, code generation
# temperature 0.4-0.7 → balanced, general conversation
# temperature 0.8-1.5 → creative writing, brainstorming
#
# num_ctx: context window (default 2048, max depends on model)
#   Higher = more context but more memory and slower
#   Llama 3 supports up to 8192 tokens
#
# num_gpu: GPU layer count (99 = all layers, 0 = CPU only)
# num_thread: CPU threads (default = auto-detect)
# top_p: nucleus sampling (0.9 = consider top 90% probability)
# top_k: limits selection to top K tokens (40 is a good default)
# repeat_penalty: penalize repetition (1.0 = off, 1.1 = moderate)

将 Ollama 作为团队服务器运行

Ollama 可以通过绑定所有网络接口来为网络上的多个用户提供服务，将一台强大的机器变成共享 AI 推理服务器。

# Bind Ollama to all interfaces for network access
OLLAMA_HOST=0.0.0.0 ollama serve

# Linux: make it permanent via systemd
sudo systemctl edit ollama
# Add: Environment="OLLAMA_HOST=0.0.0.0"
# Add: Environment="OLLAMA_ORIGINS=*"
sudo systemctl restart ollama

# Team members connect from their machines
curl http://your-server-ip:11434/api/chat -d '{
  "model": "llama3",
  "messages": [{"role": "user", "content": "Hello from remote"}],
  "stream": false
}'

# Nginx reverse proxy with SSL (recommended)
# server {
#     listen 443 ssl;
#     server_name ollama.yourcompany.com;
#     ssl_certificate /etc/letsencrypt/live/ollama.yourcompany.com/fullchain.pem;
#     ssl_certificate_key /etc/letsencrypt/live/ollama.yourcompany.com/privkey.pem;
#     location / {
#         proxy_pass http://localhost:11434;
#         proxy_set_header Host \$host;
#         proxy_buffering off;
#         proxy_read_timeout 300s;
#     }
# }

生产环境 Docker 部署

在生产环境中，使用 Docker Compose 运行 Ollama 和 Open WebUI，提供自托管的 ChatGPT 替代方案。

# docker-compose.yml for production
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=15m
      - OLLAMA_NUM_PARALLEL=4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:
  webui_data:

# Deploy and pre-pull models
docker compose up -d

# Pull models for the team
docker exec ollama ollama pull llama3
docker exec ollama ollama pull codellama:13b
docker exec ollama ollama pull mistral

# Verify everything is running
docker compose ps
curl http://localhost:11434/api/tags  # list available models

常见问题排查

以下是运行 Ollama 时最常见问题的解决方案。

# Problem: "Error: model requires more system memory"
# Solution: Use a smaller model or quantization
ollama run llama3:8b-instruct-q4_0  # smallest variant

# Problem: "connection refused" on localhost:11434
# Solution: Start the Ollama service
ollama serve              # macOS/Linux (foreground)
sudo systemctl start ollama  # Linux (background)

# Problem: Slow generation speed (CPU only)
# Solution: Verify GPU is being used
ollama ps  # check Processor column
# If showing "100% CPU", reinstall GPU drivers

# Problem: Model not found
# Solution: Check available models and pull
ollama list              # see downloaded models
ollama pull llama3       # download if missing

# Problem: CORS errors from web app
# Solution: Set OLLAMA_ORIGINS
OLLAMA_ORIGINS="http://localhost:3000" ollama serve

# Problem: Out of disk space
# Solution: Remove unused models and move storage
ollama rm unused-model
OLLAMA_MODELS=/mnt/large-drive/ollama ollama serve

最佳实践

开发时从小模型（7B）开始，生产时扩大规模进行质量评估
使用量化模型（Q4_K_M）获得质量和速度的最佳平衡
设置适当的上下文窗口 — 更大的上下文线性增加内存使用
使用 nvidia-smi 或 macOS 的活动监视器监控 GPU 内存
使用 keep_alive 参数控制模型加载/卸载行为
为每个用例创建带有特定系统提示词的自定义 Modelfile
在生产环境中固定模型版本以避免更新时的意外行为变化
在应用中使用流式响应以获得更好的感知延迟
为多用户服务器实现请求队列以避免内存压力
在选择生产模型之前，使用代表性工作负载进行测试

常见问题

运行 Ollama 需要什么硬件？

运行 7B 模型至少需要 8GB 内存。16GB 以上的 Apple Silicon Mac 是理想选择。8GB+ 显存的 NVIDIA GPU 也很好用。70B 模型需要 64GB 内存或 48GB 显存的 GPU。

Ollama 是免费的吗？

是的，Ollama 完全免费且开源（MIT 许可证）。没有使用限制、API 费用或订阅费。可用于个人和商业项目。

Ollama 与 ChatGPT 相比如何？

Ollama 在本地运行模型，而 ChatGPT 在 OpenAI 服务器上运行。本地模型通常不如 GPT-4，但提供完全隐私、零成本和无速率限制。Llama 3 70B 在许多任务上接近 GPT-4 质量。

Ollama 能用于代码生成吗？

可以。Code Llama、DeepSeek Coder 和 StarCoder2 是通过 Ollama 可用的优秀编程模型，支持代码补全、解释、调试和生成。

Ollama 支持微调吗？

Ollama 不直接支持微调，但可以导入用 Unsloth 或 Axolotl 等工具创建的微调 GGUF 模型。通过 Modelfile 的系统提示词和参数调整来自定义行为。

Ollama 能同时运行多个模型吗？

可以，只要有足够的 RAM 或 VRAM，Ollama 可以同时加载多个模型。使用 keep_alive 参数控制模型在最后一次请求后保持加载的时间。

如何更新 Ollama 和模型？

macOS 上从 ollama.com 下载最新版本。Linux 上重新运行安装脚本。Docker 用户拉取最新镜像。更新模型：ollama pull 模型名。

使用 Ollama 时数据是私密的吗？

完全私密。所有推理都在本地机器上进行，不会发送数据到外部服务器，不收集遥测数据。非常适合处理敏感文档、专有代码和机密业务数据。