TL;DR
AI 工程是在大语言模型之上构建生产级应用的学科。首先掌握提示工程(系统提示词、少样本学习、思维链),当模型需要领域知识时添加 RAG 与向量数据库,仅在提示工程加 RAG 不够时才微调。使用 LangChain 或 LlamaIndex 编排链和智能体。根据成本、延迟、上下文窗口和任务需求,在 OpenAI GPT-4o、Anthropic Claude 3.5 Sonnet、Google Gemini 1.5 Pro 或开源 Llama 3 之间选择。上线前务必实现安全护栏、评估管道和成本监控。
核心要点
- AI 工程专注于在预训练 LLM 之上构建应用,而非从零训练模型;核心技能包括提示工程、RAG、API 集成和编排框架。
- 决策顺序:先优化提示词,不够则加 RAG,仍不够才微调——90% 的应用用提示工程 + RAG 就能解决。
- 向量数据库是 RAG 的核心组件;根据规模选择:pgvector(百万级以下)、Pinecone(托管)、Weaviate/Qdrant(开源大规模)。
- AI 智能体通过 ReAct 循环(推理 + 行动)和 Function Calling 实现自主多步骤任务执行。
- 安全护栏必不可少:内容过滤、幻觉检测、输出验证、速率限制和成本监控应在上线前就位。
- 成本优化关键:语义缓存、模型路由(简单任务用小模型)、提示词压缩和批量 API 可降低 60-80% 费用。
1. 什么是 AI 工程?(对比 ML 工程和数据科学)
AI 工程(AI Engineering)是一个新兴学科,专注于利用预训练的大语言模型(LLM)构建生产级应用。它处于软件工程和机器学习的交叉点,但与传统的 ML 工程和数据科学有明显区别。AI 工程师不需要从零训练模型,而是通过 API 调用、提示工程、检索增强生成(RAG)和编排框架将 LLM 能力集成到产品中。
2023-2024 年 LLM 的爆发式发展催生了这个角色。随着 GPT-4、Claude 3、Gemini 等模型变得越来越强大,大量应用场景不再需要自训练模型,而是需要懂得如何高效使用这些模型的工程师。AI 工程师的核心价值在于将 LLM 能力转化为可靠的、可扩展的产品功能。
AI Engineering vs ML Engineering vs Data Science
=================================================
Role Focus Key Skills Output
---- ----- ---------- ------
Data Scientist Analysis & insights Statistics, SQL, Reports, dashboards,
Python, visualization predictive models
ML Engineer Model training PyTorch, TensorFlow, Trained models,
& deployment MLOps, feature eng. ML pipelines
AI Engineer LLM application Prompt eng., RAG, AI-powered products,
development LangChain, APIs chatbots, agents
Key Differences:
- Data Scientist: "What does the data tell us?"
- ML Engineer: "How do we train and serve a model?"
- AI Engineer: "How do we build a product with an LLM?"2. LLM API:OpenAI、Anthropic Claude、Google Gemini
LLM API 是 AI 工程的基础。三大主要提供商各有优势:OpenAI 的 GPT-4o 系列生态最成熟;Anthropic 的 Claude 3.5 Sonnet 在长上下文理解和安全性方面领先;Google 的 Gemini 1.5 Pro 拥有超长上下文窗口(100万 token)且与 Google Cloud 深度集成。
// OpenAI API — Chat Completion
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain async/await in JavaScript" }
],
temperature: 0.7,
max_tokens: 1000,
});
console.log(response.choices[0].message.content);
console.log("Tokens used:", response.usage.total_tokens);// Anthropic Claude API
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const message = await anthropic.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
system: "You are a senior software architect.",
messages: [
{ role: "user", content: "Design a rate limiter for a REST API" }
],
});
console.log(message.content[0].text);
console.log("Input tokens:", message.usage.input_tokens);
console.log("Output tokens:", message.usage.output_tokens);// Google Gemini API
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-1.5-pro" });
const result = await model.generateContent({
contents: [{
role: "user",
parts: [{ text: "Compare microservices vs monolith architecture" }]
}],
generationConfig: {
temperature: 0.7,
maxOutputTokens: 1000,
},
});
console.log(result.response.text());LLM API Comparison (2026)
==========================
Provider Model Context Input Cost Output Cost Strengths
-------- ----- ------- ---------- ----------- ---------
OpenAI GPT-4o 128K \$2.50/1M \$10.00/1M Best ecosystem, multimodal
OpenAI GPT-4o-mini 128K \$0.15/1M \$0.60/1M Cheapest smart model
Anthropic Claude 3.5 Sonnet 200K \$3.00/1M \$15.00/1M Long context, safety, coding
Anthropic Claude 3.5 Haiku 200K \$0.25/1M \$1.25/1M Fast and affordable
Google Gemini 1.5 Pro 1M \$1.25/1M \$5.00/1M Longest context window
Google Gemini 1.5 Flash 1M \$0.075/1M \$0.30/1M Ultra fast and cheap
Meta Llama 3.1 405B 128K Self-host Self-host Open source, no API cost
Meta Llama 3.1 70B 128K Self-host Self-host Best open-source mid-tier
Mistral Mixtral 8x22B 64K \$2.00/1M \$6.00/1M Strong EU-based option3. 提示工程技术:少样本学习、思维链与系统提示词
提示工程是 AI 工程师最核心的技能。通过精心设计输入提示词,可以显著改变模型的输出质量,而无需修改模型本身。掌握系统提示词(System Prompt)设定角色和约束、少样本学习(Few-Shot Learning)提供示例引导、思维链(Chain-of-Thought)引导逐步推理,是每个 AI 工程师的必修课。
// 1. System Prompt — Setting Role and Constraints
const messages = [
{
role: "system",
content: `You are a senior TypeScript developer.
Rules:
- Always use strict TypeScript types (no "any")
- Include error handling for all async operations
- Add JSDoc comments for public functions
- If you are unsure, say "I am not certain" rather than guessing
- Format output as markdown code blocks`
},
{ role: "user", content: "Write a function to fetch user data from an API" }
];// 2. Few-Shot Learning — Providing Examples
const fewShotMessages = [
{
role: "system",
content: "You classify customer support tickets into categories."
},
// Example 1
{ role: "user", content: "My order hasn't arrived yet, it's been 2 weeks" },
{ role: "assistant", content: "Category: SHIPPING\nPriority: HIGH\nSentiment: FRUSTRATED" },
// Example 2
{ role: "user", content: "How do I change my password?" },
{ role: "assistant", content: "Category: ACCOUNT\nPriority: LOW\nSentiment: NEUTRAL" },
// Example 3
{ role: "user", content: "Your product is amazing, saved me hours!" },
{ role: "assistant", content: "Category: FEEDBACK\nPriority: LOW\nSentiment: POSITIVE" },
// Actual query
{ role: "user", content: "I was charged twice for my subscription" }
];// 3. Chain-of-Thought (CoT) — Step-by-Step Reasoning
const cotPrompt = {
role: "user",
content: `Analyze whether this API design follows REST best practices.
API Endpoint: POST /api/users/123/delete
Think step by step:
1. Check the HTTP method appropriateness
2. Evaluate the URL structure
3. Assess resource naming conventions
4. Check for idempotency considerations
5. Provide your final assessment with improvements`
};
// CoT Variations:
// - "Let's think step by step" (zero-shot CoT)
// - "Think through this carefully before answering" (implicit CoT)
// - Provide worked examples with reasoning (few-shot CoT)
// - "First analyze X, then consider Y, finally conclude Z" (structured CoT)Prompt Engineering Techniques — Quick Reference
================================================
Technique When to Use Example
--------- ----------- -------
Zero-shot Simple, well-defined tasks "Translate to French: Hello"
Few-shot Classification, formatting Provide 3-5 input/output pairs
Chain-of-Thought Math, logic, complex reasoning "Think step by step..."
System Prompt Role, tone, constraints "You are a legal expert..."
Output Format Structured data needed "Respond in JSON format..."
Self-consistency High-stakes decisions Generate 5 answers, take majority
Tree of Thought Complex problem solving Explore multiple solution paths
ReAct Tool use, agents Think -> Act -> Observe loop
Retrieval-Augmented Domain-specific knowledge Inject relevant docs into context
Anti-patterns to Avoid:
- Vague instructions ("make it better")
- Overloading context (irrelevant information)
- No output format specification
- Not handling edge cases in prompt
- Using negation ("don't do X") instead of affirmation ("do Y")4. RAG 架构:检索增强生成
RAG 是当前 AI 工程中最重要的架构模式之一。它通过在生成回答前先从外部知识库检索相关文档,解决了 LLM 的两大核心问题:知识截止日期和幻觉。RAG 让模型能够基于最新的、特定领域的真实数据来回答问题,同时提供可追溯的来源引用。
RAG Architecture — Data Flow
============================
INDEXING PHASE (offline, one-time):
┌──────────┐ ┌──────────┐ ┌───────────┐ ┌───────────────┐
│ Documents │ -> │ Chunking │ -> │ Embedding │ -> │ Vector Store │
│ (PDF,Web, │ │ (split │ │ Model │ │ (Pinecone, │
│ Notion) │ │ text) │ │ (ada-002) │ │ pgvector) │
└──────────┘ └──────────┘ └───────────┘ └───────────────┘
QUERY PHASE (online, per-request):
┌──────────┐ ┌───────────┐ ┌───────────────┐
│ User │ -> │ Embed │ -> │ Vector Search │
│ Question │ │ Query │ │ (top-K docs) │
└──────────┘ └───────────┘ └───────┬───────┘
│
v
┌──────────┐ ┌───────────────────────────────┐
│ Answer │ <- │ LLM (query + retrieved docs) │
│ + Sources│ │ "Based on context, answer..." │
└──────────┘ └───────────────────────────────┘// Complete RAG Pipeline in TypeScript
import { OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
// Step 1: Document Chunking
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // characters per chunk
chunkOverlap: 200, // overlap between chunks
separators: ["\n\n", "\n", ". ", " "], // split priority
});
const chunks = await splitter.splitDocuments(documents);
// Step 2: Generate Embeddings & Store
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small", // 1536 dimensions
});
const pinecone = new Pinecone();
const index = pinecone.Index("my-rag-index");
const vectorStore = await PineconeStore.fromDocuments(
chunks,
embeddings,
{ pineconeIndex: index }
);
// Step 3: Query — Retrieve + Generate
async function ragQuery(question: string) {
// Retrieve top 5 most relevant chunks
const relevantDocs = await vectorStore.similaritySearch(question, 5);
// Build context from retrieved documents
const context = relevantDocs
.map((doc, i) => "Source " + (i + 1) + ": " + doc.pageContent)
.join("\n\n");
// Generate answer with context
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: "Answer the question based ONLY on the provided context. " +
"If the context does not contain the answer, say so. " +
"Cite your sources using [Source N] format."
},
{
role: "user",
content: "Context:\n" + context + "\n\nQuestion: " + question
}
],
temperature: 0.2, // low temp for factual answers
});
return {
answer: response.choices[0].message.content,
sources: relevantDocs.map(d => d.metadata),
};
}Chunking Strategies
====================
Strategy Chunk Size Overlap Best For
-------- ---------- ------- --------
Fixed-size 500-1000 100-200 General purpose, simple
Recursive character 500-1500 100-300 Most text documents
Sentence-based 1-5 sent. 1 sent. FAQ, precise retrieval
Semantic Varies N/A Topic-coherent chunks
Parent-child Small+Large N/A Retrieve small, pass large
Markdown header By section N/A Technical documentation
Code-aware By function N/A Source code files
Tips:
- Smaller chunks = more precise retrieval but less context
- Larger chunks = more context but noisier retrieval
- Overlap prevents information loss at chunk boundaries
- Test with your actual data to find optimal size5. 向量数据库:Pinecone、Weaviate、Qdrant 与 pgvector
向量数据库是 RAG 架构的核心存储层。它们专门为高维向量的相似性搜索而优化,能够在毫秒级别从数百万个向量中找到最相似的结果。选择合适的向量数据库取决于你的规模、基础设施偏好和功能需求。
// Pinecone — Fully Managed Vector Database
import { Pinecone } from "@pinecone-database/pinecone";
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index("my-index");
// Upsert vectors
await index.upsert([
{
id: "doc-1",
values: embedding, // float array [0.1, -0.2, ...]
metadata: { source: "docs/api.md", section: "auth" },
},
]);
// Query with metadata filter
const results = await index.query({
vector: queryEmbedding,
topK: 5,
filter: { source: { "\$eq": "docs/api.md" } },
includeMetadata: true,
});// pgvector — PostgreSQL Extension (great for existing PG users)
// SQL setup:
// CREATE EXTENSION vector;
// CREATE TABLE documents (
// id SERIAL PRIMARY KEY,
// content TEXT,
// metadata JSONB,
// embedding vector(1536) -- dimension matches your model
// );
// CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);
import pg from "pg";
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
// Insert document with embedding
await pool.query(
"INSERT INTO documents (content, metadata, embedding) VALUES (\$1, \$2, \$3)",
[content, JSON.stringify(metadata), JSON.stringify(embedding)]
);
// Similarity search (cosine distance)
const result = await pool.query(
`SELECT content, metadata,
1 - (embedding <=> \$1::vector) AS similarity
FROM documents
ORDER BY embedding <=> \$1::vector
LIMIT 5`,
[JSON.stringify(queryEmbedding)]
);Vector Database Comparison
===========================
Database Type Best For Pricing Max Scale
-------- ---- -------- ------- ---------
Pinecone Managed Zero-ops, quick start Pay-per-use Billions
Weaviate OSS/Cloud Hybrid search, modules Free/Managed Billions
Qdrant OSS/Cloud Filtering, performance Free/Managed Billions
pgvector PG Ext. Existing PG infra Free (self) Millions
Chroma OSS Local dev, prototyping Free Millions
Milvus OSS Ultra-large scale Free/Managed Trillions
FAISS Library In-memory, research Free Billions
Distance Metrics:
- Cosine similarity: normalized, most common for text embeddings
- Euclidean (L2): absolute distance, good for image features
- Dot product: fastest, works when vectors are normalized6. LangChain 和 LlamaIndex 框架
LangChain 和 LlamaIndex 是两个最流行的 LLM 应用开发框架。LangChain 是通用的编排框架,适合构建复杂的多步骤工作流和智能体;LlamaIndex 专注于数据连接和检索,是构建 RAG 应用的首选。两者可以互补使用。
// LangChain — Chain with Prompt Template
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";
const model = new ChatOpenAI({ model: "gpt-4o", temperature: 0 });
// Simple chain: prompt -> model -> parser
const prompt = ChatPromptTemplate.fromMessages([
["system", "You are a technical writer. Write clear, concise explanations."],
["user", "Explain {topic} for {audience}"],
]);
const chain = RunnableSequence.from([
prompt,
model,
new StringOutputParser(),
]);
const result = await chain.invoke({
topic: "Kubernetes pods",
audience: "junior developers",
});
console.log(result);// LangChain — RAG Chain with Retriever
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
const retriever = vectorStore.asRetriever({ k: 5 });
const ragPrompt = ChatPromptTemplate.fromMessages([
["system",
"Answer based on the following context only.\n" +
"If the context does not help, say you do not know.\n" +
"Context: {context}"],
["user", "{input}"],
]);
const documentChain = await createStuffDocumentsChain({
llm: model,
prompt: ragPrompt,
});
const ragChain = await createRetrievalChain({
retriever,
combineDocsChain: documentChain,
});
const answer = await ragChain.invoke({
input: "How do I configure SSL in Nginx?",
});
console.log(answer.answer);
console.log("Sources:", answer.context.map(d => d.metadata.source));# LlamaIndex — RAG Pipeline (Python)
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Settings,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure global settings
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Load documents and build index
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query with RAG
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact", # "tree_summarize", "refine", "compact"
)
response = query_engine.query("How do I set up a CI/CD pipeline?")
print(response.response)
print("Sources:", [n.metadata for n in response.source_nodes])
# Chat engine with memory
chat_engine = index.as_chat_engine(
chat_mode="context", # "condense_question", "react"
system_prompt="You are a DevOps expert.",
)
response = chat_engine.chat("What is Kubernetes?")
response = chat_engine.chat("How does it compare to Docker Swarm?")LangChain vs LlamaIndex — When to Use Which
=============================================
Feature LangChain LlamaIndex
------- --------- ----------
Primary Focus Chain orchestration Data indexing & retrieval
Best For Complex workflows, RAG applications,
multi-step agents knowledge bases
Data Connectors Via integrations 150+ built-in loaders
Agent Support Excellent (LCEL, tools) Good (ReAct, tools)
RAG Quality Good Excellent (advanced chunking)
Learning Curve Moderate Lower for RAG tasks
Language Support Python, TypeScript/JS Python (primary), TS
Community Very large Large
Use LangChain when:
- Building multi-step agent workflows
- Need complex chain composition (LCEL)
- Building chatbots with tool calling
Use LlamaIndex when:
- Primary need is RAG over your data
- Need advanced indexing strategies
- Have diverse data sources to connect7. 微调 vs RAG vs 提示工程:决策树
这三种技术是 AI 工程师定制 LLM 行为的主要手段。选择正确的方法可以显著降低成本和开发时间。遵循一个简单原则:从最简单的方法开始,只在必要时升级到更复杂的方法。
Decision Tree: How to Customize LLM Behavior
=============================================
Start here: "What does the model need to learn?"
|
|-- Nothing new, just better outputs?
| --> PROMPT ENGINEERING
| - System prompts, few-shot examples, output formats
| - Cost: \$0, Time: hours, Skill: low
|
|-- Needs specific/recent facts or your data?
| --> RAG (Retrieval-Augmented Generation)
| - Vector DB + retrieval pipeline + prompt
| - Cost: \$100-\$1K setup, Time: days, Skill: medium
|
|-- Needs to change behavior/style/format?
| |-- Is the change simple (e.g., always respond in JSON)?
| | --> PROMPT ENGINEERING (with structured output)
| |
| |-- Is the change complex (domain-specific reasoning)?
| --> FINE-TUNING
| - Prepare training data, train, evaluate
| - Cost: \$500-\$10K+, Time: weeks, Skill: high
|
|-- Needs both facts AND behavior change?
--> FINE-TUNING + RAG (combined)
- Fine-tune for domain language understanding
- RAG for real-time factual grounding
- Cost: highest, Time: weeks-months// OpenAI Fine-Tuning API Example
// Step 1: Prepare training data (JSONL format)
// training_data.jsonl:
// {"messages": [{"role":"system","content":"You are a SQL expert"},
// {"role":"user","content":"Show all users created today"},
// {"role":"assistant","content":"SELECT * FROM users WHERE created_at >= CURRENT_DATE;"}]}
// {"messages": [{"role":"system","content":"You are a SQL expert"},
// {"role":"user","content":"Count active premium users"},
// {"role":"assistant","content":"SELECT COUNT(*) FROM users WHERE status = 'active' AND plan = 'premium';"}]}
// Step 2: Upload training file
const file = await openai.files.create({
file: fs.createReadStream("training_data.jsonl"),
purpose: "fine-tune",
});
// Step 3: Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
training_file: file.id,
model: "gpt-4o-mini-2024-07-18",
hyperparameters: {
n_epochs: 3,
batch_size: "auto",
learning_rate_multiplier: "auto",
},
});
// Step 4: Use fine-tuned model
// const response = await openai.chat.completions.create({
// model: "ft:gpt-4o-mini-2024-07-18:my-org::abc123",
// messages: [...]
// });Prompt Engineering vs RAG vs Fine-Tuning
=========================================
Dimension Prompt Eng. RAG Fine-Tuning
--------- ----------- --- -----------
Setup Cost Free Low-Medium High
Time to Implement Hours Days Weeks
Per-Query Cost Base model cost +Retrieval cost -20-50% savings
Knowledge Update Change prompt Update vector DB Retrain model
Factual Accuracy Model knowledge High (grounded) Model knowledge
Style/Format Good (examples) Limited Excellent
Source Citations No Yes No
Data Privacy Sent to API Chunks sent Trained into model
Maintenance Easy Medium Complex
Best Starting Point YES Second choice Last resort8. 嵌入模型与语义搜索
嵌入模型将文本转换为高维向量(通常 768-3072 维),使得语义相似的文本在向量空间中距离更近。嵌入是 RAG、语义搜索、文本分类和聚类等应用的基础。选择合适的嵌入模型直接影响检索质量。
// Generating Embeddings with OpenAI
const response = await openai.embeddings.create({
model: "text-embedding-3-small", // or "text-embedding-3-large"
input: "How do I deploy a Next.js application?",
dimensions: 1536, // can reduce for cost savings (e.g., 512)
});
const embedding = response.data[0].embedding;
// embedding = [0.0123, -0.0456, 0.0789, ...] (1536 floats)
// Batch embeddings (more efficient)
const batchResponse = await openai.embeddings.create({
model: "text-embedding-3-small",
input: [
"How to set up Docker containers",
"Kubernetes pod configuration guide",
"CI/CD pipeline with GitHub Actions",
"AWS Lambda serverless functions",
],
});
// batchResponse.data[0].embedding, batchResponse.data[1].embedding, ...// Semantic Search Implementation
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
// Compare semantic similarity
const queries = [
"How to deploy to production", // semantically similar
"Production deployment guide", // semantically similar
"Best pizza recipe", // semantically different
];
// Result: queries[0] and queries[1] will have high similarity (~0.92)
// queries[0] and queries[2] will have low similarity (~0.15)
// Hybrid Search: combine semantic + keyword for best results
// score = alpha * semantic_score + (1 - alpha) * bm25_score
// alpha = 0.7 is a good starting pointEmbedding Model Comparison
===========================
Model Provider Dims Cost/1M tokens MTEB Score
----- -------- ---- -------------- ----------
text-embedding-3-large OpenAI 3072 \$0.13 64.6
text-embedding-3-small OpenAI 1536 \$0.02 62.3
voyage-3 Voyage AI 1024 \$0.06 67.1
embed-v3.0 Cohere 1024 \$0.10 64.8
BGE-large-en-v1.5 BAAI 1024 Free (OSS) 63.5
GTE-Qwen2-7B Alibaba 3584 Free (OSS) 70.2
nomic-embed-text Nomic 768 Free (OSS) 62.4
Tips:
- Start with text-embedding-3-small (best cost/quality ratio)
- Use dimension reduction for cost savings (3072 -> 1024)
- Benchmark on YOUR data, not just MTEB leaderboard
- Open-source models (BGE, GTE) are competitive and free
- Use the same model for indexing and querying9. AI 智能体与工具调用(Function Calling)
AI 智能体是能够自主规划、推理和使用工具完成复杂任务的 LLM 应用。与简单的问答不同,智能体可以分解任务、调用外部 API、执行代码、查询数据库,并根据中间结果动态调整策略。Function Calling 是实现智能体工具调用的核心 API 机制。
// OpenAI Function Calling — Tool Use
const tools = [
{
type: "function",
function: {
name: "search_documentation",
description: "Search technical documentation for a given query",
parameters: {
type: "object",
properties: {
query: { type: "string", description: "Search query" },
language: {
type: "string",
enum: ["javascript", "python", "rust", "go"],
description: "Programming language filter",
},
},
required: ["query"],
},
},
},
{
type: "function",
function: {
name: "execute_code",
description: "Execute a code snippet and return the output",
parameters: {
type: "object",
properties: {
code: { type: "string", description: "Code to execute" },
language: { type: "string", enum: ["javascript", "python"] },
},
required: ["code", "language"],
},
},
},
];
// Agent loop: Think -> Act -> Observe -> Think
async function agentLoop(userMessage: string) {
const messages: any[] = [
{
role: "system",
content: "You are a helpful coding assistant. Use tools when needed."
},
{ role: "user", content: userMessage }
];
while (true) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
tools,
tool_choice: "auto",
});
const message = response.choices[0].message;
messages.push(message);
// If no tool calls, the agent is done
if (!message.tool_calls || message.tool_calls.length === 0) {
return message.content;
}
// Execute each tool call
for (const toolCall of message.tool_calls) {
const args = JSON.parse(toolCall.function.arguments);
let result: string;
if (toolCall.function.name === "search_documentation") {
result = await searchDocs(args.query, args.language);
} else if (toolCall.function.name === "execute_code") {
result = await executeCode(args.code, args.language);
} else {
result = "Unknown tool: " + toolCall.function.name;
}
messages.push({
role: "tool",
tool_call_id: toolCall.id,
content: result,
});
}
}
}Agent Architecture Patterns
============================
1. ReAct (Reasoning + Acting):
Thought: "I need to find the API rate limits"
Action: search_documentation("rate limits REST API")
Observe: "Rate limits: 100 req/min for free tier..."
Thought: "Now I have the info, I can answer"
Answer: "The API allows 100 requests per minute..."
2. Plan-and-Execute:
Plan: ["Search docs", "Write code", "Test code", "Review"]
Execute: Run each step, re-plan if needed
3. Multi-Agent (Crew/Swarm):
Agent 1 (Researcher): Gathers information
Agent 2 (Developer): Writes code based on research
Agent 3 (Reviewer): Reviews and suggests improvements
Orchestrator: Coordinates agent communication
4. Tool-Augmented Generation:
LLM decides WHEN and WHICH tool to call
Tools: calculator, web search, code exec, DB query, API calls
LLM synthesizes tool results into final answer10. 安全护栏:内容过滤、幻觉检测与输出验证
生产级 LLM 应用必须有安全护栏。护栏确保模型输出是安全的、准确的、格式正确的,并且符合业务规则。没有护栏的 LLM 应用就像没有验证的 API——迟早会出问题。
// Guardrails Implementation Pattern
// 1. Input Validation — Filter malicious/inappropriate inputs
async function validateInput(userMessage: string): Promise<boolean> {
// Check message length
if (userMessage.length > 10000) {
throw new Error("Message too long");
}
// Prompt injection detection
const injectionPatterns = [
/ignore (all |previous |above )?instructions/i,
/you are now/i,
/system prompt/i,
/reveal your/i,
];
if (injectionPatterns.some(p => p.test(userMessage))) {
throw new Error("Potential prompt injection detected");
}
// Content moderation (OpenAI Moderation API)
const moderation = await openai.moderations.create({
input: userMessage,
});
if (moderation.results[0].flagged) {
throw new Error("Content flagged: " +
Object.entries(moderation.results[0].categories)
.filter(([, v]) => v)
.map(([k]) => k)
.join(", ")
);
}
return true;
}// 2. Output Validation — Ensure correct format and content
import { z } from "zod";
// Define expected output schema
const ProductRecommendation = z.object({
products: z.array(z.object({
name: z.string(),
reason: z.string().max(200),
confidence: z.number().min(0).max(1),
price_range: z.enum(["budget", "mid-range", "premium"]),
})).min(1).max(5),
disclaimer: z.string(),
});
// Parse and validate LLM output
function validateOutput(llmOutput: string) {
try {
const parsed = JSON.parse(llmOutput);
const validated = ProductRecommendation.parse(parsed);
return { success: true, data: validated };
} catch (error) {
// Retry with corrective prompt or return fallback
return { success: false, error: String(error) };
}
}
// 3. Hallucination Detection for RAG
async function checkFaithfulness(
answer: string,
sources: string[]
): Promise<{ faithful: boolean; issues: string[] }> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content:
"You are a fact-checker. Compare the answer against the source " +
"documents. Identify any claims NOT supported by the sources. " +
"Respond in JSON: {faithful: boolean, issues: string[]}"
},
{
role: "user",
content: "Sources:\n" + sources.join("\n---\n") +
"\n\nAnswer:\n" + answer
}
],
response_format: { type: "json_object" },
temperature: 0,
});
return JSON.parse(response.choices[0].message.content || "{}");
}Production Guardrails Checklist
================================
Layer Check Priority
----- ----- --------
Input Message length limit CRITICAL
Input Prompt injection detection CRITICAL
Input Content moderation (toxicity) CRITICAL
Input PII detection and redaction HIGH
Input Rate limiting per user HIGH
Output JSON schema validation HIGH
Output Hallucination / faithfulness check HIGH
Output Content safety filter CRITICAL
Output Max output length enforcement MEDIUM
Output Source citation verification MEDIUM
System Cost per request monitoring HIGH
System Latency tracking (P50, P95, P99) HIGH
System Error rate alerting CRITICAL
System Fallback responses for failures HIGH
System Audit logging for compliance MEDIUM11. LLM API 成本优化
LLM API 成本是生产级应用的主要运营开支之一。未经优化的 LLM 应用每月可能消耗数千到数万美元。通过智能缓存、模型路由、提示词压缩和批量处理,可以将成本降低 60-80%,同时保持输出质量。
// 1. Semantic Caching — Avoid redundant API calls
import { createHash } from "crypto";
class SemanticCache {
private cache: Map<string, { response: string; embedding: number[] }> = new Map();
private similarityThreshold = 0.95;
async get(query: string): Promise<string | null> {
const queryEmbedding = await getEmbedding(query);
for (const [, entry] of this.cache) {
const similarity = cosineSimilarity(queryEmbedding, entry.embedding);
if (similarity >= this.similarityThreshold) {
return entry.response; // Cache hit
}
}
return null; // Cache miss
}
async set(query: string, response: string): Promise<void> {
const embedding = await getEmbedding(query);
const key = createHash("sha256").update(query).digest("hex");
this.cache.set(key, { response, embedding });
}
}// 2. Model Routing — Use the cheapest model that works
async function routeToModel(query: string): Promise<string> {
// Classify query complexity
const classification = await openai.chat.completions.create({
model: "gpt-4o-mini", // cheap classifier
messages: [{
role: "system",
content:
"Classify the query complexity as SIMPLE, MEDIUM, or COMPLEX. " +
"SIMPLE: factual lookup, formatting, translation. " +
"MEDIUM: summarization, code generation, analysis. " +
"COMPLEX: multi-step reasoning, creative writing, architecture design. " +
"Respond with only the classification word."
}, {
role: "user", content: query
}],
max_tokens: 10,
});
const complexity = classification.choices[0].message.content?.trim();
const modelMap: Record<string, string> = {
SIMPLE: "gpt-4o-mini", // \$0.15/1M input
MEDIUM: "gpt-4o-mini", // \$0.15/1M input
COMPLEX: "gpt-4o", // \$2.50/1M input
};
const model = modelMap[complexity || "MEDIUM"];
console.log("Routing " + query.slice(0, 50) + "... to " + model);
return model;
}LLM Cost Optimization Strategies
==================================
Strategy Savings Effort Impact on Quality
-------- ------- ------ -----------------
Semantic caching 30-60% Medium None (exact/similar)
Model routing 40-70% Medium Minimal (smart routing)
Prompt compression 10-30% Low Minimal
Batch API (OpenAI) 50% Low None
Reduce max_tokens 5-20% Low None (if set correctly)
Shorter system prompts 5-15% Low Minimal
Streaming (early stop) 10-30% Medium Variable
Open-source models 80-100% High Depends on model
Example Monthly Cost Breakdown (100K queries/month):
Unoptimized (GPT-4o for everything): \$5,000
+ Model routing (80% to mini): \$1,200 (-76%)
+ Semantic caching (40% hit rate): \$720 (-86%)
+ Prompt compression: \$580 (-88%)
+ Batch API for async tasks: \$450 (-91%)12. 模型对比:GPT-4 vs Claude 3 vs Gemini vs Llama 3
选择正确的模型是 AI 工程决策中最关键的一步。不同模型在推理能力、上下文窗口、成本、延迟和特定任务表现方面各有优劣。以下是主流模型的详细对比。
Model Comparison — Detailed Breakdown (2026)
==============================================
Category GPT-4o Claude 3.5 Gemini 1.5 Llama 3.1
Sonnet Pro 405B
-------- ------ ----------- ----------- ----------
Provider OpenAI Anthropic Google Meta (OSS)
Context Window 128K 200K 1M 128K
Multimodal Text+Image+ Text+Image+ Text+Image+ Text+Image
Audio PDF Video+Audio
Coding Excellent Best Very Good Very Good
Reasoning Excellent Excellent Good Good
Long Context Good Excellent Best Good
Safety Good Best Good Good
Speed (TTFT) Fast Fast Fast Varies
API Ecosystem Best Good Good N/A
Fine-tuning Yes No (yet) Yes Full control
Self-hosting No No No Yes
Data Privacy API only API only API only Full control
Best For:
GPT-4o: General purpose, largest ecosystem, best tooling
Claude 3.5 Sonnet: Coding, long documents, safety-critical apps
Gemini 1.5 Pro: Ultra-long context (books, codebases), multimodal
Llama 3.1 405B: Self-hosting, data privacy, customization
Budget Models:
GPT-4o-mini: Best cheap commercial model
Claude 3.5 Haiku: Fast and affordable, good quality
Gemini 1.5 Flash: Ultra cheap, very fast
Llama 3.1 70B: Best open-source mid-tier, self-hostable// Multi-Model Strategy: Use the Right Model for Each Task
const MODEL_CONFIG = {
// High-stakes, complex reasoning
complex: {
model: "gpt-4o",
temperature: 0.3,
useCases: ["architecture design", "code review", "legal analysis"],
},
// Long document processing
longContext: {
model: "gemini-1.5-pro",
temperature: 0.2,
useCases: ["codebase analysis", "book summarization", "log analysis"],
},
// Code generation and debugging
coding: {
model: "claude-3-5-sonnet-20241022",
temperature: 0,
useCases: ["code generation", "debugging", "refactoring"],
},
// Simple tasks, high volume
simple: {
model: "gpt-4o-mini",
temperature: 0.5,
useCases: ["classification", "extraction", "formatting"],
},
// Privacy-sensitive, self-hosted
private: {
model: "meta-llama/Llama-3.1-70B-Instruct",
temperature: 0.3,
useCases: ["medical records", "financial data", "internal tools"],
},
};13. LLM 应用的评估与测试
评估和测试是 AI 工程中最容易被忽视但最重要的环节。LLM 输出的非确定性使传统测试方法不完全适用。你需要一套专门的评估框架来衡量模型输出的质量、准确性和安全性,并在迭代中持续改进。
// LLM Evaluation Framework
// 1. LLM-as-Judge — Use a strong model to evaluate outputs
async function llmJudge(
question: string,
answer: string,
criteria: string
): Promise<{ score: number; reasoning: string }> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "system",
content:
"You are an expert evaluator. Score the answer on a scale of " +
"1-5 based on the given criteria. " +
"Respond in JSON: {score: number, reasoning: string}"
}, {
role: "user",
content: "Question: " + question + "\n" +
"Answer: " + answer + "\n" +
"Criteria: " + criteria
}],
response_format: { type: "json_object" },
temperature: 0,
});
return JSON.parse(response.choices[0].message.content || "{}");
}
// Usage:
// await llmJudge(
// "What is Kubernetes?",
// generatedAnswer,
// "Accuracy, completeness, clarity, and conciseness"
// );// 2. RAG Evaluation Metrics
// Using Ragas framework concepts
interface RAGEvalResult {
faithfulness: number; // Is the answer grounded in sources?
answerRelevancy: number; // Does it actually answer the question?
contextPrecision: number; // Are retrieved docs relevant?
contextRecall: number; // Did we retrieve all relevant docs?
}
async function evaluateRAG(
question: string,
answer: string,
contexts: string[],
groundTruth: string
): Promise<RAGEvalResult> {
// Faithfulness: does the answer use only info from contexts?
const faithfulness = await llmJudge(
question,
answer,
"Score 1-5: Is every claim in the answer supported by the " +
"provided context? Penalize any unsupported claims heavily."
);
// Answer relevancy: does it address the question?
const relevancy = await llmJudge(
question,
answer,
"Score 1-5: Does the answer directly address the question? " +
"Penalize off-topic content and missing key points."
);
return {
faithfulness: faithfulness.score / 5,
answerRelevancy: relevancy.score / 5,
contextPrecision: 0, // computed via retrieval metrics
contextRecall: 0, // computed against ground truth
};
}LLM Testing Strategy
=====================
Test Type What It Tests Tools
--------- ------------- -----
Unit Tests Prompt template outputs pytest, vitest
Integration Tests Full RAG pipeline Ragas, DeepEval
Regression Tests Output consistency Golden datasets
A/B Tests Model/prompt comparison LangSmith, Braintrust
Red Team Tests Safety, edge cases Garak, manual
Load Tests Latency under load k6, Locust
Cost Tests Budget compliance Custom monitoring
Evaluation Tools:
Ragas - Open-source RAG evaluation framework
DeepEval - LLM evaluation with multiple metrics
LangSmith - Tracing, debugging, evaluation (LangChain)
Phoenix - LLM observability (Arize AI)
Braintrust - LLM evaluation and monitoring platform
Promptfoo - Open-source prompt testing CLI
Golden Rule: Never deploy without:
1. A golden evaluation dataset (50-200 examples)
2. Automated regression tests in CI
3. Production monitoring (latency, cost, errors)
4. Human review for high-stakes outputs常见问题解答
AI 工程师和 ML 工程师有什么区别?
ML 工程师专注于从零训练和优化机器学习模型(数据收集、特征工程、模型训练、超参数调优)。AI 工程师专注于利用预训练的大语言模型(LLM)构建应用——通过 API 调用、提示工程、RAG、微调和编排框架将 LLM 集成到产品中。AI 工程师更偏应用层,ML 工程师更偏模型层。
RAG 和微调应该怎么选择?
优先使用 RAG:当你需要最新信息、需要引用来源、数据频繁更新、需要可解释性时。选择微调:当你需要改变模型的语气/风格/格式、需要学习特定领域的推理模式、RAG 检索质量不够好时。两者可以结合使用——先微调让模型更好地理解领域语言,再用 RAG 提供具体事实。成本方面,RAG 的前期成本更低但每次查询有检索开销,微调需要更高的前期训练成本但推理时更简单。
向量数据库怎么选?
选择取决于规模和基础设施。Pinecone:全托管、零运维、适合快速原型和中等规模;Weaviate:开源、支持混合搜索(向量+关键词)、丰富的模块生态;Qdrant:开源、Rust 实现性能出色、过滤功能强大;pgvector:PostgreSQL 扩展,适合已有 PG 基础设施的团队,数据量在百万级以下性能良好;Chroma:轻量级、适合本地开发和原型验证。超大规模(十亿级向量)考虑 Milvus。
如何降低 LLM API 成本?
关键策略:1) 语义缓存——缓存相似查询的响应,避免重复调用;2) 模型路由——简单任务用小模型(GPT-4o-mini/Claude Haiku),复杂任务才用大模型;3) 提示词优化——减少不必要的 token(精简系统提示词、压缩上下文);4) 批量 API——非实时场景使用批量接口可降低 50% 费用;5) 开源模型——对延迟和隐私要求高的场景自部署 Llama 3 或 Mistral;6) 输出长度限制——设置 max_tokens 避免冗长回复。
LangChain 和 LlamaIndex 有什么区别?
LangChain 是通用的 LLM 应用编排框架,擅长构建复杂的链(Chain)和智能体(Agent),适合多步骤工作流、工具调用、对话管理。LlamaIndex(原 GPT Index)专注于数据索引和检索,擅长连接各种数据源并构建高质量的 RAG 管道。实践中两者经常配合使用:LlamaIndex 处理数据摄取和检索,LangChain 处理链编排和智能体逻辑。如果你的核心需求是 RAG,从 LlamaIndex 开始;如果需要复杂的多步骤工作流,从 LangChain 开始。
什么是 AI 智能体?和普通 LLM 调用有什么区别?
AI 智能体是能够自主规划、使用工具、执行多步骤任务的 LLM 应用。普通 LLM 调用是单轮输入-输出,而智能体可以:1) 分解复杂任务为子步骤;2) 调用外部工具(搜索引擎、数据库、API、代码执行器);3) 观察工具返回结果并决定下一步行动;4) 迭代直到任务完成。核心模式是 ReAct(Reasoning + Acting)循环:思考 → 行动 → 观察 → 思考。Function Calling 是实现工具调用的主要 API 机制。
如何评估和测试 LLM 应用?
多层评估方法:1) 离线评估——使用标注数据集计算准确率、F1、BLEU/ROUGE 等指标;2) LLM-as-Judge——用一个强大的 LLM(如 GPT-4)评估另一个模型的输出质量;3) RAG 特定指标——检索相关性(Precision@K)、回答忠实度(是否基于检索内容)、回答相关性;4) 人工评估——对关键场景进行人工审核和 A/B 测试;5) 在线监控——追踪延迟、成本、用户反馈、错误率。推荐工具:Ragas(RAG 评估)、DeepEval、LangSmith(追踪和调试)、Phoenix(可观测性)。
如何处理 LLM 幻觉问题?
减少幻觉的策略:1) RAG——让模型基于检索到的真实文档回答,而非凭记忆生成;2) 系统提示词约束——明确指示"仅基于提供的上下文回答,如不确定请说不知道";3) 温度参数——降低 temperature(0.0-0.3)减少随机性;4) 结构化输出——使用 JSON Schema 或 function calling 约束输出格式;5) 自检机制——让模型生成答案后再验证自己的答案是否有依据;6) 事实核查管道——对输出进行自动化的事实验证;7) 引用来源——要求模型引用具体的文档段落。
AI 工程核心概念速查表
AI Engineering — Quick Reference
=================================
Concept Description
------- -----------
Prompt Engineering Designing inputs to get desired LLM outputs
System Prompt Instructions that set model role and constraints
Few-Shot Learning Providing examples in the prompt for guidance
Chain-of-Thought (CoT) Asking the model to reason step-by-step
RAG Retrieve relevant docs, then generate answers
Vector Database Storage optimized for similarity search
Embedding Dense vector representation of text meaning
Chunking Splitting documents into smaller pieces
Fine-Tuning Training a model further on custom data
Function Calling LLM deciding when and how to call tools
AI Agent LLM that can plan, use tools, and iterate
ReAct Pattern Think -> Act -> Observe -> Think loop
Guardrails Input/output validation and safety filters
Hallucination Model generating false or unsupported info
Semantic Caching Cache responses for semantically similar queries
Model Routing Directing queries to the best-fit model
LLM-as-Judge Using a strong LLM to evaluate outputs
LCEL (LangChain) LangChain Expression Language for chains
Temperature Controls randomness (0=deterministic, 1=creative)
Token Basic unit of text processed by LLMs
Context Window Maximum tokens a model can process at once
Structured Output Constraining LLM output to JSON/schema
Batch API Processing multiple requests at reduced cost
Multimodal Models that process text, image, audio, video