AI 工程指南 2026：LLM、RAG、提示工程、微调与 AI Agent

TL;DR

AI 工程是在大语言模型之上构建生产级应用的学科。首先掌握提示工程（系统提示词、少样本学习、思维链），当模型需要领域知识时添加 RAG 与向量数据库，仅在提示工程加 RAG 不够时才微调。使用 LangChain 或 LlamaIndex 编排链和智能体。根据成本、延迟、上下文窗口和任务需求，在 OpenAI GPT-4o、Anthropic Claude 3.5 Sonnet、Google Gemini 1.5 Pro 或开源 Llama 3 之间选择。上线前务必实现安全护栏、评估管道和成本监控。

核心要点

AI 工程专注于在预训练 LLM 之上构建应用，而非从零训练模型；核心技能包括提示工程、RAG、API 集成和编排框架。
决策顺序：先优化提示词，不够则加 RAG，仍不够才微调——90% 的应用用提示工程 + RAG 就能解决。
向量数据库是 RAG 的核心组件；根据规模选择：pgvector（百万级以下）、Pinecone（托管）、Weaviate/Qdrant（开源大规模）。
AI 智能体通过 ReAct 循环（推理 + 行动）和 Function Calling 实现自主多步骤任务执行。
安全护栏必不可少：内容过滤、幻觉检测、输出验证、速率限制和成本监控应在上线前就位。
成本优化关键：语义缓存、模型路由（简单任务用小模型）、提示词压缩和批量 API 可降低 60-80% 费用。

1. 什么是 AI 工程？（对比 ML 工程和数据科学）

AI 工程（AI Engineering）是一个新兴学科，专注于利用预训练的大语言模型（LLM）构建生产级应用。它处于软件工程和机器学习的交叉点，但与传统的 ML 工程和数据科学有明显区别。AI 工程师不需要从零训练模型，而是通过 API 调用、提示工程、检索增强生成（RAG）和编排框架将 LLM 能力集成到产品中。

2023-2024 年 LLM 的爆发式发展催生了这个角色。随着 GPT-4、Claude 3、Gemini 等模型变得越来越强大，大量应用场景不再需要自训练模型，而是需要懂得如何高效使用这些模型的工程师。AI 工程师的核心价值在于将 LLM 能力转化为可靠的、可扩展的产品功能。

AI Engineering vs ML Engineering vs Data Science
=================================================

Role               Focus                  Key Skills              Output
----               -----                  ----------              ------
Data Scientist     Analysis & insights    Statistics, SQL,        Reports, dashboards,
                                          Python, visualization   predictive models

ML Engineer        Model training         PyTorch, TensorFlow,    Trained models,
                   & deployment           MLOps, feature eng.     ML pipelines

AI Engineer        LLM application        Prompt eng., RAG,       AI-powered products,
                   development            LangChain, APIs         chatbots, agents

Key Differences:
- Data Scientist: "What does the data tell us?"
- ML Engineer:    "How do we train and serve a model?"
- AI Engineer:    "How do we build a product with an LLM?"

2. LLM API：OpenAI、Anthropic Claude、Google Gemini

LLM API 是 AI 工程的基础。三大主要提供商各有优势：OpenAI 的 GPT-4o 系列生态最成熟；Anthropic 的 Claude 3.5 Sonnet 在长上下文理解和安全性方面领先；Google 的 Gemini 1.5 Pro 拥有超长上下文窗口（100万 token）且与 Google Cloud 深度集成。

// OpenAI API — Chat Completion
import OpenAI from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Explain async/await in JavaScript" }
  ],
  temperature: 0.7,
  max_tokens: 1000,
});

console.log(response.choices[0].message.content);
console.log("Tokens used:", response.usage.total_tokens);

// Anthropic Claude API
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

const message = await anthropic.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  system: "You are a senior software architect.",
  messages: [
    { role: "user", content: "Design a rate limiter for a REST API" }
  ],
});

console.log(message.content[0].text);
console.log("Input tokens:", message.usage.input_tokens);
console.log("Output tokens:", message.usage.output_tokens);

// Google Gemini API
import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-1.5-pro" });

const result = await model.generateContent({
  contents: [{
    role: "user",
    parts: [{ text: "Compare microservices vs monolith architecture" }]
  }],
  generationConfig: {
    temperature: 0.7,
    maxOutputTokens: 1000,
  },
});

console.log(result.response.text());

LLM API Comparison (2026)
==========================

Provider     Model               Context    Input Cost     Output Cost    Strengths
--------     -----               -------    ----------     -----------    ---------
OpenAI       GPT-4o              128K       \$2.50/1M      \$10.00/1M     Best ecosystem, multimodal
OpenAI       GPT-4o-mini         128K       \$0.15/1M      \$0.60/1M      Cheapest smart model
Anthropic    Claude 3.5 Sonnet   200K       \$3.00/1M      \$15.00/1M     Long context, safety, coding
Anthropic    Claude 3.5 Haiku    200K       \$0.25/1M      \$1.25/1M      Fast and affordable
Google       Gemini 1.5 Pro      1M         \$1.25/1M      \$5.00/1M      Longest context window
Google       Gemini 1.5 Flash    1M         \$0.075/1M     \$0.30/1M      Ultra fast and cheap
Meta         Llama 3.1 405B      128K       Self-host      Self-host      Open source, no API cost
Meta         Llama 3.1 70B       128K       Self-host      Self-host      Best open-source mid-tier
Mistral      Mixtral 8x22B       64K        \$2.00/1M      \$6.00/1M      Strong EU-based option

3. 提示工程技术：少样本学习、思维链与系统提示词

提示工程是 AI 工程师最核心的技能。通过精心设计输入提示词，可以显著改变模型的输出质量，而无需修改模型本身。掌握系统提示词（System Prompt）设定角色和约束、少样本学习（Few-Shot Learning）提供示例引导、思维链（Chain-of-Thought）引导逐步推理，是每个 AI 工程师的必修课。

// 1. System Prompt — Setting Role and Constraints
const messages = [
  {
    role: "system",
    content: `You are a senior TypeScript developer.
Rules:
- Always use strict TypeScript types (no "any")
- Include error handling for all async operations
- Add JSDoc comments for public functions
- If you are unsure, say "I am not certain" rather than guessing
- Format output as markdown code blocks`
  },
  { role: "user", content: "Write a function to fetch user data from an API" }
];

// 2. Few-Shot Learning — Providing Examples
const fewShotMessages = [
  {
    role: "system",
    content: "You classify customer support tickets into categories."
  },
  // Example 1
  { role: "user", content: "My order hasn't arrived yet, it's been 2 weeks" },
  { role: "assistant", content: "Category: SHIPPING\nPriority: HIGH\nSentiment: FRUSTRATED" },
  // Example 2
  { role: "user", content: "How do I change my password?" },
  { role: "assistant", content: "Category: ACCOUNT\nPriority: LOW\nSentiment: NEUTRAL" },
  // Example 3
  { role: "user", content: "Your product is amazing, saved me hours!" },
  { role: "assistant", content: "Category: FEEDBACK\nPriority: LOW\nSentiment: POSITIVE" },
  // Actual query
  { role: "user", content: "I was charged twice for my subscription" }
];

// 3. Chain-of-Thought (CoT) — Step-by-Step Reasoning
const cotPrompt = {
  role: "user",
  content: `Analyze whether this API design follows REST best practices.

API Endpoint: POST /api/users/123/delete

Think step by step:
1. Check the HTTP method appropriateness
2. Evaluate the URL structure
3. Assess resource naming conventions
4. Check for idempotency considerations
5. Provide your final assessment with improvements`
};

// CoT Variations:
// - "Let's think step by step" (zero-shot CoT)
// - "Think through this carefully before answering" (implicit CoT)
// - Provide worked examples with reasoning (few-shot CoT)
// - "First analyze X, then consider Y, finally conclude Z" (structured CoT)

Prompt Engineering Techniques — Quick Reference
================================================

Technique              When to Use                      Example
---------              -----------                      -------
Zero-shot              Simple, well-defined tasks       "Translate to French: Hello"
Few-shot               Classification, formatting       Provide 3-5 input/output pairs
Chain-of-Thought       Math, logic, complex reasoning   "Think step by step..."
System Prompt          Role, tone, constraints          "You are a legal expert..."
Output Format          Structured data needed           "Respond in JSON format..."
Self-consistency       High-stakes decisions            Generate 5 answers, take majority
Tree of Thought        Complex problem solving          Explore multiple solution paths
ReAct                  Tool use, agents                 Think -> Act -> Observe loop
Retrieval-Augmented    Domain-specific knowledge        Inject relevant docs into context

Anti-patterns to Avoid:
- Vague instructions ("make it better")
- Overloading context (irrelevant information)
- No output format specification
- Not handling edge cases in prompt
- Using negation ("don't do X") instead of affirmation ("do Y")

4. RAG 架构：检索增强生成

RAG 是当前 AI 工程中最重要的架构模式之一。它通过在生成回答前先从外部知识库检索相关文档，解决了 LLM 的两大核心问题：知识截止日期和幻觉。RAG 让模型能够基于最新的、特定领域的真实数据来回答问题，同时提供可追溯的来源引用。

RAG Architecture — Data Flow
============================

  INDEXING PHASE (offline, one-time):
  ┌──────────┐    ┌──────────┐    ┌───────────┐    ┌───────────────┐
  │ Documents │ -> │ Chunking │ -> │ Embedding │ -> │ Vector Store  │
  │ (PDF,Web, │    │ (split   │    │ Model     │    │ (Pinecone,    │
  │  Notion)  │    │  text)   │    │ (ada-002) │    │  pgvector)    │
  └──────────┘    └──────────┘    └───────────┘    └───────────────┘

  QUERY PHASE (online, per-request):
  ┌──────────┐    ┌───────────┐    ┌───────────────┐
  │ User      │ -> │ Embed     │ -> │ Vector Search │
  │ Question  │    │ Query     │    │ (top-K docs)  │
  └──────────┘    └───────────┘    └───────┬───────┘
                                          │
                                          v
  ┌──────────┐    ┌───────────────────────────────┐
  │ Answer   │ <- │ LLM (query + retrieved docs)  │
  │ + Sources│    │ "Based on context, answer..."  │
  └──────────┘    └───────────────────────────────┘

// Complete RAG Pipeline in TypeScript
import { OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

// Step 1: Document Chunking
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,        // characters per chunk
  chunkOverlap: 200,      // overlap between chunks
  separators: ["\n\n", "\n", ". ", " "],  // split priority
});
const chunks = await splitter.splitDocuments(documents);

// Step 2: Generate Embeddings & Store
const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",  // 1536 dimensions
});
const pinecone = new Pinecone();
const index = pinecone.Index("my-rag-index");

const vectorStore = await PineconeStore.fromDocuments(
  chunks,
  embeddings,
  { pineconeIndex: index }
);

// Step 3: Query — Retrieve + Generate
async function ragQuery(question: string) {
  // Retrieve top 5 most relevant chunks
  const relevantDocs = await vectorStore.similaritySearch(question, 5);

  // Build context from retrieved documents
  const context = relevantDocs
    .map((doc, i) => "Source " + (i + 1) + ": " + doc.pageContent)
    .join("\n\n");

  // Generate answer with context
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: "Answer the question based ONLY on the provided context. " +
                 "If the context does not contain the answer, say so. " +
                 "Cite your sources using [Source N] format."
      },
      {
        role: "user",
        content: "Context:\n" + context + "\n\nQuestion: " + question
      }
    ],
    temperature: 0.2,  // low temp for factual answers
  });

  return {
    answer: response.choices[0].message.content,
    sources: relevantDocs.map(d => d.metadata),
  };
}

Chunking Strategies
====================

Strategy                Chunk Size   Overlap   Best For
--------                ----------   -------   --------
Fixed-size              500-1000     100-200   General purpose, simple
Recursive character     500-1500     100-300   Most text documents
Sentence-based          1-5 sent.    1 sent.   FAQ, precise retrieval
Semantic                Varies       N/A       Topic-coherent chunks
Parent-child            Small+Large  N/A       Retrieve small, pass large
Markdown header         By section   N/A       Technical documentation
Code-aware              By function  N/A       Source code files

Tips:
- Smaller chunks = more precise retrieval but less context
- Larger chunks = more context but noisier retrieval
- Overlap prevents information loss at chunk boundaries
- Test with your actual data to find optimal size

5. 向量数据库：Pinecone、Weaviate、Qdrant 与 pgvector

向量数据库是 RAG 架构的核心存储层。它们专门为高维向量的相似性搜索而优化，能够在毫秒级别从数百万个向量中找到最相似的结果。选择合适的向量数据库取决于你的规模、基础设施偏好和功能需求。

// Pinecone — Fully Managed Vector Database
import { Pinecone } from "@pinecone-database/pinecone";

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index("my-index");

// Upsert vectors
await index.upsert([
  {
    id: "doc-1",
    values: embedding,           // float array [0.1, -0.2, ...]
    metadata: { source: "docs/api.md", section: "auth" },
  },
]);

// Query with metadata filter
const results = await index.query({
  vector: queryEmbedding,
  topK: 5,
  filter: { source: { "\$eq": "docs/api.md" } },
  includeMetadata: true,
});

// pgvector — PostgreSQL Extension (great for existing PG users)
// SQL setup:
// CREATE EXTENSION vector;
// CREATE TABLE documents (
//   id SERIAL PRIMARY KEY,
//   content TEXT,
//   metadata JSONB,
//   embedding vector(1536)  -- dimension matches your model
// );
// CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

import pg from "pg";

const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });

// Insert document with embedding
await pool.query(
  "INSERT INTO documents (content, metadata, embedding) VALUES (\$1, \$2, \$3)",
  [content, JSON.stringify(metadata), JSON.stringify(embedding)]
);

// Similarity search (cosine distance)
const result = await pool.query(
  `SELECT content, metadata,
          1 - (embedding <=> \$1::vector) AS similarity
   FROM documents
   ORDER BY embedding <=> \$1::vector
   LIMIT 5`,
  [JSON.stringify(queryEmbedding)]
);

Vector Database Comparison
===========================

Database     Type        Best For                   Pricing         Max Scale
--------     ----        --------                   -------         ---------
Pinecone     Managed     Zero-ops, quick start      Pay-per-use     Billions
Weaviate     OSS/Cloud   Hybrid search, modules     Free/Managed    Billions
Qdrant       OSS/Cloud   Filtering, performance     Free/Managed    Billions
pgvector     PG Ext.     Existing PG infra          Free (self)     Millions
Chroma       OSS         Local dev, prototyping     Free            Millions
Milvus       OSS         Ultra-large scale          Free/Managed    Trillions
FAISS        Library     In-memory, research        Free            Billions

Distance Metrics:
- Cosine similarity: normalized, most common for text embeddings
- Euclidean (L2): absolute distance, good for image features
- Dot product: fastest, works when vectors are normalized

6. LangChain 和 LlamaIndex 框架

LangChain 和 LlamaIndex 是两个最流行的 LLM 应用开发框架。LangChain 是通用的编排框架，适合构建复杂的多步骤工作流和智能体；LlamaIndex 专注于数据连接和检索，是构建 RAG 应用的首选。两者可以互补使用。

// LangChain — Chain with Prompt Template
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";

const model = new ChatOpenAI({ model: "gpt-4o", temperature: 0 });

// Simple chain: prompt -> model -> parser
const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a technical writer. Write clear, concise explanations."],
  ["user", "Explain {topic} for {audience}"],
]);

const chain = RunnableSequence.from([
  prompt,
  model,
  new StringOutputParser(),
]);

const result = await chain.invoke({
  topic: "Kubernetes pods",
  audience: "junior developers",
});
console.log(result);

// LangChain — RAG Chain with Retriever
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";

const retriever = vectorStore.asRetriever({ k: 5 });

const ragPrompt = ChatPromptTemplate.fromMessages([
  ["system",
   "Answer based on the following context only.\n" +
   "If the context does not help, say you do not know.\n" +
   "Context: {context}"],
  ["user", "{input}"],
]);

const documentChain = await createStuffDocumentsChain({
  llm: model,
  prompt: ragPrompt,
});

const ragChain = await createRetrievalChain({
  retriever,
  combineDocsChain: documentChain,
});

const answer = await ragChain.invoke({
  input: "How do I configure SSL in Nginx?",
});
console.log(answer.answer);
console.log("Sources:", answer.context.map(d => d.metadata.source));

# LlamaIndex — RAG Pipeline (Python)
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure global settings
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load documents and build index
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query with RAG
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact",  # "tree_summarize", "refine", "compact"
)

response = query_engine.query("How do I set up a CI/CD pipeline?")
print(response.response)
print("Sources:", [n.metadata for n in response.source_nodes])

# Chat engine with memory
chat_engine = index.as_chat_engine(
    chat_mode="context",   # "condense_question", "react"
    system_prompt="You are a DevOps expert.",
)
response = chat_engine.chat("What is Kubernetes?")
response = chat_engine.chat("How does it compare to Docker Swarm?")

LangChain vs LlamaIndex — When to Use Which
=============================================

Feature              LangChain                LlamaIndex
-------              ---------                ----------
Primary Focus        Chain orchestration       Data indexing & retrieval
Best For             Complex workflows,        RAG applications,
                     multi-step agents         knowledge bases
Data Connectors      Via integrations          150+ built-in loaders
Agent Support        Excellent (LCEL, tools)   Good (ReAct, tools)
RAG Quality          Good                      Excellent (advanced chunking)
Learning Curve       Moderate                  Lower for RAG tasks
Language Support      Python, TypeScript/JS    Python (primary), TS
Community            Very large                Large

Use LangChain when:
  - Building multi-step agent workflows
  - Need complex chain composition (LCEL)
  - Building chatbots with tool calling

Use LlamaIndex when:
  - Primary need is RAG over your data
  - Need advanced indexing strategies
  - Have diverse data sources to connect

7. 微调 vs RAG vs 提示工程：决策树

这三种技术是 AI 工程师定制 LLM 行为的主要手段。选择正确的方法可以显著降低成本和开发时间。遵循一个简单原则：从最简单的方法开始，只在必要时升级到更复杂的方法。

Decision Tree: How to Customize LLM Behavior
=============================================

Start here: "What does the model need to learn?"
  |
  |-- Nothing new, just better outputs?
  |   --> PROMPT ENGINEERING
  |   - System prompts, few-shot examples, output formats
  |   - Cost: \$0, Time: hours, Skill: low
  |
  |-- Needs specific/recent facts or your data?
  |   --> RAG (Retrieval-Augmented Generation)
  |   - Vector DB + retrieval pipeline + prompt
  |   - Cost: \$100-\$1K setup, Time: days, Skill: medium
  |
  |-- Needs to change behavior/style/format?
  |   |-- Is the change simple (e.g., always respond in JSON)?
  |   |   --> PROMPT ENGINEERING (with structured output)
  |   |
  |   |-- Is the change complex (domain-specific reasoning)?
  |       --> FINE-TUNING
  |       - Prepare training data, train, evaluate
  |       - Cost: \$500-\$10K+, Time: weeks, Skill: high
  |
  |-- Needs both facts AND behavior change?
      --> FINE-TUNING + RAG (combined)
      - Fine-tune for domain language understanding
      - RAG for real-time factual grounding
      - Cost: highest, Time: weeks-months

// OpenAI Fine-Tuning API Example

// Step 1: Prepare training data (JSONL format)
// training_data.jsonl:
// {"messages": [{"role":"system","content":"You are a SQL expert"},
//   {"role":"user","content":"Show all users created today"},
//   {"role":"assistant","content":"SELECT * FROM users WHERE created_at >= CURRENT_DATE;"}]}
// {"messages": [{"role":"system","content":"You are a SQL expert"},
//   {"role":"user","content":"Count active premium users"},
//   {"role":"assistant","content":"SELECT COUNT(*) FROM users WHERE status = 'active' AND plan = 'premium';"}]}

// Step 2: Upload training file
const file = await openai.files.create({
  file: fs.createReadStream("training_data.jsonl"),
  purpose: "fine-tune",
});

// Step 3: Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
  training_file: file.id,
  model: "gpt-4o-mini-2024-07-18",
  hyperparameters: {
    n_epochs: 3,
    batch_size: "auto",
    learning_rate_multiplier: "auto",
  },
});

// Step 4: Use fine-tuned model
// const response = await openai.chat.completions.create({
//   model: "ft:gpt-4o-mini-2024-07-18:my-org::abc123",
//   messages: [...]
// });

Prompt Engineering vs RAG vs Fine-Tuning
=========================================

Dimension            Prompt Eng.       RAG                Fine-Tuning
---------            -----------       ---                -----------
Setup Cost           Free              Low-Medium         High
Time to Implement    Hours             Days               Weeks
Per-Query Cost       Base model cost   +Retrieval cost    -20-50% savings
Knowledge Update     Change prompt     Update vector DB   Retrain model
Factual Accuracy     Model knowledge   High (grounded)    Model knowledge
Style/Format         Good (examples)   Limited            Excellent
Source Citations     No                Yes                No
Data Privacy         Sent to API       Chunks sent        Trained into model
Maintenance          Easy              Medium             Complex
Best Starting Point  YES               Second choice      Last resort

8. 嵌入模型与语义搜索

嵌入模型将文本转换为高维向量（通常 768-3072 维），使得语义相似的文本在向量空间中距离更近。嵌入是 RAG、语义搜索、文本分类和聚类等应用的基础。选择合适的嵌入模型直接影响检索质量。

// Generating Embeddings with OpenAI
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",  // or "text-embedding-3-large"
  input: "How do I deploy a Next.js application?",
  dimensions: 1536,  // can reduce for cost savings (e.g., 512)
});

const embedding = response.data[0].embedding;
// embedding = [0.0123, -0.0456, 0.0789, ...] (1536 floats)

// Batch embeddings (more efficient)
const batchResponse = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: [
    "How to set up Docker containers",
    "Kubernetes pod configuration guide",
    "CI/CD pipeline with GitHub Actions",
    "AWS Lambda serverless functions",
  ],
});
// batchResponse.data[0].embedding, batchResponse.data[1].embedding, ...

// Semantic Search Implementation
function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

// Compare semantic similarity
const queries = [
  "How to deploy to production",      // semantically similar
  "Production deployment guide",       // semantically similar
  "Best pizza recipe",                 // semantically different
];

// Result: queries[0] and queries[1] will have high similarity (~0.92)
//         queries[0] and queries[2] will have low similarity (~0.15)

// Hybrid Search: combine semantic + keyword for best results
// score = alpha * semantic_score + (1 - alpha) * bm25_score
// alpha = 0.7 is a good starting point

Embedding Model Comparison
===========================

Model                    Provider   Dims    Cost/1M tokens   MTEB Score
-----                    --------   ----    --------------   ----------
text-embedding-3-large   OpenAI     3072    \$0.13           64.6
text-embedding-3-small   OpenAI     1536    \$0.02           62.3
voyage-3                 Voyage AI  1024    \$0.06           67.1
embed-v3.0               Cohere     1024    \$0.10           64.8
BGE-large-en-v1.5        BAAI       1024    Free (OSS)       63.5
GTE-Qwen2-7B             Alibaba    3584    Free (OSS)       70.2
nomic-embed-text         Nomic      768     Free (OSS)       62.4

Tips:
- Start with text-embedding-3-small (best cost/quality ratio)
- Use dimension reduction for cost savings (3072 -> 1024)
- Benchmark on YOUR data, not just MTEB leaderboard
- Open-source models (BGE, GTE) are competitive and free
- Use the same model for indexing and querying

9. AI 智能体与工具调用（Function Calling）

AI 智能体是能够自主规划、推理和使用工具完成复杂任务的 LLM 应用。与简单的问答不同，智能体可以分解任务、调用外部 API、执行代码、查询数据库，并根据中间结果动态调整策略。Function Calling 是实现智能体工具调用的核心 API 机制。

// OpenAI Function Calling — Tool Use
const tools = [
  {
    type: "function",
    function: {
      name: "search_documentation",
      description: "Search technical documentation for a given query",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "Search query" },
          language: {
            type: "string",
            enum: ["javascript", "python", "rust", "go"],
            description: "Programming language filter",
          },
        },
        required: ["query"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "execute_code",
      description: "Execute a code snippet and return the output",
      parameters: {
        type: "object",
        properties: {
          code: { type: "string", description: "Code to execute" },
          language: { type: "string", enum: ["javascript", "python"] },
        },
        required: ["code", "language"],
      },
    },
  },
];

// Agent loop: Think -> Act -> Observe -> Think
async function agentLoop(userMessage: string) {
  const messages: any[] = [
    {
      role: "system",
      content: "You are a helpful coding assistant. Use tools when needed."
    },
    { role: "user", content: userMessage }
  ];

  while (true) {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages,
      tools,
      tool_choice: "auto",
    });

    const message = response.choices[0].message;
    messages.push(message);

    // If no tool calls, the agent is done
    if (!message.tool_calls || message.tool_calls.length === 0) {
      return message.content;
    }

    // Execute each tool call
    for (const toolCall of message.tool_calls) {
      const args = JSON.parse(toolCall.function.arguments);
      let result: string;

      if (toolCall.function.name === "search_documentation") {
        result = await searchDocs(args.query, args.language);
      } else if (toolCall.function.name === "execute_code") {
        result = await executeCode(args.code, args.language);
      } else {
        result = "Unknown tool: " + toolCall.function.name;
      }

      messages.push({
        role: "tool",
        tool_call_id: toolCall.id,
        content: result,
      });
    }
  }
}

Agent Architecture Patterns
============================

1. ReAct (Reasoning + Acting):
   Thought: "I need to find the API rate limits"
   Action:  search_documentation("rate limits REST API")
   Observe: "Rate limits: 100 req/min for free tier..."
   Thought: "Now I have the info, I can answer"
   Answer:  "The API allows 100 requests per minute..."

2. Plan-and-Execute:
   Plan:    ["Search docs", "Write code", "Test code", "Review"]
   Execute: Run each step, re-plan if needed

3. Multi-Agent (Crew/Swarm):
   Agent 1 (Researcher):  Gathers information
   Agent 2 (Developer):   Writes code based on research
   Agent 3 (Reviewer):    Reviews and suggests improvements
   Orchestrator:          Coordinates agent communication

4. Tool-Augmented Generation:
   LLM decides WHEN and WHICH tool to call
   Tools: calculator, web search, code exec, DB query, API calls
   LLM synthesizes tool results into final answer

10. 安全护栏：内容过滤、幻觉检测与输出验证

生产级 LLM 应用必须有安全护栏。护栏确保模型输出是安全的、准确的、格式正确的，并且符合业务规则。没有护栏的 LLM 应用就像没有验证的 API——迟早会出问题。

// Guardrails Implementation Pattern

// 1. Input Validation — Filter malicious/inappropriate inputs
async function validateInput(userMessage: string): Promise<boolean> {
  // Check message length
  if (userMessage.length > 10000) {
    throw new Error("Message too long");
  }

  // Prompt injection detection
  const injectionPatterns = [
    /ignore (all |previous |above )?instructions/i,
    /you are now/i,
    /system prompt/i,
    /reveal your/i,
  ];
  if (injectionPatterns.some(p => p.test(userMessage))) {
    throw new Error("Potential prompt injection detected");
  }

  // Content moderation (OpenAI Moderation API)
  const moderation = await openai.moderations.create({
    input: userMessage,
  });
  if (moderation.results[0].flagged) {
    throw new Error("Content flagged: " +
      Object.entries(moderation.results[0].categories)
        .filter(([, v]) => v)
        .map(([k]) => k)
        .join(", ")
    );
  }

  return true;
}

// 2. Output Validation — Ensure correct format and content
import { z } from "zod";

// Define expected output schema
const ProductRecommendation = z.object({
  products: z.array(z.object({
    name: z.string(),
    reason: z.string().max(200),
    confidence: z.number().min(0).max(1),
    price_range: z.enum(["budget", "mid-range", "premium"]),
  })).min(1).max(5),
  disclaimer: z.string(),
});

// Parse and validate LLM output
function validateOutput(llmOutput: string) {
  try {
    const parsed = JSON.parse(llmOutput);
    const validated = ProductRecommendation.parse(parsed);
    return { success: true, data: validated };
  } catch (error) {
    // Retry with corrective prompt or return fallback
    return { success: false, error: String(error) };
  }
}

// 3. Hallucination Detection for RAG
async function checkFaithfulness(
  answer: string,
  sources: string[]
): Promise<{ faithful: boolean; issues: string[] }> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content:
          "You are a fact-checker. Compare the answer against the source " +
          "documents. Identify any claims NOT supported by the sources. " +
          "Respond in JSON: {faithful: boolean, issues: string[]}"
      },
      {
        role: "user",
        content: "Sources:\n" + sources.join("\n---\n") +
                 "\n\nAnswer:\n" + answer
      }
    ],
    response_format: { type: "json_object" },
    temperature: 0,
  });

  return JSON.parse(response.choices[0].message.content || "{}");
}

Production Guardrails Checklist
================================

Layer              Check                              Priority
-----              -----                              --------
Input              Message length limit                CRITICAL
Input              Prompt injection detection           CRITICAL
Input              Content moderation (toxicity)        CRITICAL
Input              PII detection and redaction          HIGH
Input              Rate limiting per user               HIGH
Output             JSON schema validation               HIGH
Output             Hallucination / faithfulness check   HIGH
Output             Content safety filter                CRITICAL
Output             Max output length enforcement        MEDIUM
Output             Source citation verification         MEDIUM
System             Cost per request monitoring          HIGH
System             Latency tracking (P50, P95, P99)     HIGH
System             Error rate alerting                  CRITICAL
System             Fallback responses for failures      HIGH
System             Audit logging for compliance         MEDIUM

11. LLM API 成本优化

LLM API 成本是生产级应用的主要运营开支之一。未经优化的 LLM 应用每月可能消耗数千到数万美元。通过智能缓存、模型路由、提示词压缩和批量处理，可以将成本降低 60-80%，同时保持输出质量。

// 1. Semantic Caching — Avoid redundant API calls
import { createHash } from "crypto";

class SemanticCache {
  private cache: Map<string, { response: string; embedding: number[] }> = new Map();
  private similarityThreshold = 0.95;

  async get(query: string): Promise<string | null> {
    const queryEmbedding = await getEmbedding(query);

    for (const [, entry] of this.cache) {
      const similarity = cosineSimilarity(queryEmbedding, entry.embedding);
      if (similarity >= this.similarityThreshold) {
        return entry.response; // Cache hit
      }
    }
    return null; // Cache miss
  }

  async set(query: string, response: string): Promise<void> {
    const embedding = await getEmbedding(query);
    const key = createHash("sha256").update(query).digest("hex");
    this.cache.set(key, { response, embedding });
  }
}

// 2. Model Routing — Use the cheapest model that works
async function routeToModel(query: string): Promise<string> {
  // Classify query complexity
  const classification = await openai.chat.completions.create({
    model: "gpt-4o-mini",  // cheap classifier
    messages: [{
      role: "system",
      content:
        "Classify the query complexity as SIMPLE, MEDIUM, or COMPLEX. " +
        "SIMPLE: factual lookup, formatting, translation. " +
        "MEDIUM: summarization, code generation, analysis. " +
        "COMPLEX: multi-step reasoning, creative writing, architecture design. " +
        "Respond with only the classification word."
    }, {
      role: "user", content: query
    }],
    max_tokens: 10,
  });

  const complexity = classification.choices[0].message.content?.trim();

  const modelMap: Record<string, string> = {
    SIMPLE:  "gpt-4o-mini",     // \$0.15/1M input
    MEDIUM:  "gpt-4o-mini",     // \$0.15/1M input
    COMPLEX: "gpt-4o",          // \$2.50/1M input
  };

  const model = modelMap[complexity || "MEDIUM"];
  console.log("Routing " + query.slice(0, 50) + "... to " + model);
  return model;
}

LLM Cost Optimization Strategies
==================================

Strategy                 Savings     Effort    Impact on Quality
--------                 -------     ------    -----------------
Semantic caching         30-60%      Medium    None (exact/similar)
Model routing            40-70%      Medium    Minimal (smart routing)
Prompt compression       10-30%      Low       Minimal
Batch API (OpenAI)       50%         Low       None
Reduce max_tokens        5-20%       Low       None (if set correctly)
Shorter system prompts   5-15%       Low       Minimal
Streaming (early stop)   10-30%      Medium    Variable
Open-source models       80-100%     High      Depends on model

Example Monthly Cost Breakdown (100K queries/month):
  Unoptimized (GPT-4o for everything):  \$5,000
  + Model routing (80% to mini):         \$1,200  (-76%)
  + Semantic caching (40% hit rate):      \$720  (-86%)
  + Prompt compression:                   \$580  (-88%)
  + Batch API for async tasks:            \$450  (-91%)

12. 模型对比：GPT-4 vs Claude 3 vs Gemini vs Llama 3

选择正确的模型是 AI 工程决策中最关键的一步。不同模型在推理能力、上下文窗口、成本、延迟和特定任务表现方面各有优劣。以下是主流模型的详细对比。

Model Comparison — Detailed Breakdown (2026)
==============================================

Category           GPT-4o           Claude 3.5       Gemini 1.5      Llama 3.1
                                    Sonnet           Pro             405B
--------           ------           -----------      -----------     ----------
Provider           OpenAI           Anthropic        Google          Meta (OSS)
Context Window     128K             200K             1M              128K
Multimodal         Text+Image+      Text+Image+      Text+Image+     Text+Image
                   Audio            PDF              Video+Audio
Coding             Excellent        Best             Very Good       Very Good
Reasoning          Excellent        Excellent        Good            Good
Long Context       Good             Excellent        Best            Good
Safety             Good             Best             Good            Good
Speed (TTFT)       Fast             Fast             Fast            Varies
API Ecosystem      Best             Good             Good            N/A
Fine-tuning        Yes              No (yet)         Yes             Full control
Self-hosting       No               No               No              Yes
Data Privacy       API only         API only         API only        Full control

Best For:
  GPT-4o:            General purpose, largest ecosystem, best tooling
  Claude 3.5 Sonnet: Coding, long documents, safety-critical apps
  Gemini 1.5 Pro:    Ultra-long context (books, codebases), multimodal
  Llama 3.1 405B:    Self-hosting, data privacy, customization

Budget Models:
  GPT-4o-mini:       Best cheap commercial model
  Claude 3.5 Haiku:  Fast and affordable, good quality
  Gemini 1.5 Flash:  Ultra cheap, very fast
  Llama 3.1 70B:     Best open-source mid-tier, self-hostable

// Multi-Model Strategy: Use the Right Model for Each Task
const MODEL_CONFIG = {
  // High-stakes, complex reasoning
  complex: {
    model: "gpt-4o",
    temperature: 0.3,
    useCases: ["architecture design", "code review", "legal analysis"],
  },

  // Long document processing
  longContext: {
    model: "gemini-1.5-pro",
    temperature: 0.2,
    useCases: ["codebase analysis", "book summarization", "log analysis"],
  },

  // Code generation and debugging
  coding: {
    model: "claude-3-5-sonnet-20241022",
    temperature: 0,
    useCases: ["code generation", "debugging", "refactoring"],
  },

  // Simple tasks, high volume
  simple: {
    model: "gpt-4o-mini",
    temperature: 0.5,
    useCases: ["classification", "extraction", "formatting"],
  },

  // Privacy-sensitive, self-hosted
  private: {
    model: "meta-llama/Llama-3.1-70B-Instruct",
    temperature: 0.3,
    useCases: ["medical records", "financial data", "internal tools"],
  },
};

13. LLM 应用的评估与测试

评估和测试是 AI 工程中最容易被忽视但最重要的环节。LLM 输出的非确定性使传统测试方法不完全适用。你需要一套专门的评估框架来衡量模型输出的质量、准确性和安全性，并在迭代中持续改进。

// LLM Evaluation Framework

// 1. LLM-as-Judge — Use a strong model to evaluate outputs
async function llmJudge(
  question: string,
  answer: string,
  criteria: string
): Promise<{ score: number; reasoning: string }> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{
      role: "system",
      content:
        "You are an expert evaluator. Score the answer on a scale of " +
        "1-5 based on the given criteria. " +
        "Respond in JSON: {score: number, reasoning: string}"
    }, {
      role: "user",
      content: "Question: " + question + "\n" +
               "Answer: " + answer + "\n" +
               "Criteria: " + criteria
    }],
    response_format: { type: "json_object" },
    temperature: 0,
  });

  return JSON.parse(response.choices[0].message.content || "{}");
}

// Usage:
// await llmJudge(
//   "What is Kubernetes?",
//   generatedAnswer,
//   "Accuracy, completeness, clarity, and conciseness"
// );

// 2. RAG Evaluation Metrics
// Using Ragas framework concepts

interface RAGEvalResult {
  faithfulness: number;      // Is the answer grounded in sources?
  answerRelevancy: number;   // Does it actually answer the question?
  contextPrecision: number;  // Are retrieved docs relevant?
  contextRecall: number;     // Did we retrieve all relevant docs?
}

async function evaluateRAG(
  question: string,
  answer: string,
  contexts: string[],
  groundTruth: string
): Promise<RAGEvalResult> {
  // Faithfulness: does the answer use only info from contexts?
  const faithfulness = await llmJudge(
    question,
    answer,
    "Score 1-5: Is every claim in the answer supported by the " +
    "provided context? Penalize any unsupported claims heavily."
  );

  // Answer relevancy: does it address the question?
  const relevancy = await llmJudge(
    question,
    answer,
    "Score 1-5: Does the answer directly address the question? " +
    "Penalize off-topic content and missing key points."
  );

  return {
    faithfulness: faithfulness.score / 5,
    answerRelevancy: relevancy.score / 5,
    contextPrecision: 0, // computed via retrieval metrics
    contextRecall: 0,    // computed against ground truth
  };
}

LLM Testing Strategy
=====================

Test Type           What It Tests              Tools
---------           -------------              -----
Unit Tests          Prompt template outputs    pytest, vitest
Integration Tests   Full RAG pipeline          Ragas, DeepEval
Regression Tests    Output consistency         Golden datasets
A/B Tests           Model/prompt comparison    LangSmith, Braintrust
Red Team Tests      Safety, edge cases         Garak, manual
Load Tests          Latency under load         k6, Locust
Cost Tests          Budget compliance          Custom monitoring

Evaluation Tools:
  Ragas         - Open-source RAG evaluation framework
  DeepEval      - LLM evaluation with multiple metrics
  LangSmith     - Tracing, debugging, evaluation (LangChain)
  Phoenix       - LLM observability (Arize AI)
  Braintrust    - LLM evaluation and monitoring platform
  Promptfoo     - Open-source prompt testing CLI

Golden Rule: Never deploy without:
  1. A golden evaluation dataset (50-200 examples)
  2. Automated regression tests in CI
  3. Production monitoring (latency, cost, errors)
  4. Human review for high-stakes outputs

常见问题解答

AI 工程师和 ML 工程师有什么区别？

ML 工程师专注于从零训练和优化机器学习模型（数据收集、特征工程、模型训练、超参数调优）。AI 工程师专注于利用预训练的大语言模型（LLM）构建应用——通过 API 调用、提示工程、RAG、微调和编排框架将 LLM 集成到产品中。AI 工程师更偏应用层，ML 工程师更偏模型层。

RAG 和微调应该怎么选择？

优先使用 RAG：当你需要最新信息、需要引用来源、数据频繁更新、需要可解释性时。选择微调：当你需要改变模型的语气/风格/格式、需要学习特定领域的推理模式、RAG 检索质量不够好时。两者可以结合使用——先微调让模型更好地理解领域语言，再用 RAG 提供具体事实。成本方面，RAG 的前期成本更低但每次查询有检索开销，微调需要更高的前期训练成本但推理时更简单。

向量数据库怎么选？

选择取决于规模和基础设施。Pinecone：全托管、零运维、适合快速原型和中等规模；Weaviate：开源、支持混合搜索（向量+关键词）、丰富的模块生态；Qdrant：开源、Rust 实现性能出色、过滤功能强大；pgvector：PostgreSQL 扩展，适合已有 PG 基础设施的团队，数据量在百万级以下性能良好；Chroma：轻量级、适合本地开发和原型验证。超大规模（十亿级向量）考虑 Milvus。

如何降低 LLM API 成本？

关键策略：1) 语义缓存——缓存相似查询的响应，避免重复调用；2) 模型路由——简单任务用小模型（GPT-4o-mini/Claude Haiku），复杂任务才用大模型；3) 提示词优化——减少不必要的 token（精简系统提示词、压缩上下文）；4) 批量 API——非实时场景使用批量接口可降低 50% 费用；5) 开源模型——对延迟和隐私要求高的场景自部署 Llama 3 或 Mistral；6) 输出长度限制——设置 max_tokens 避免冗长回复。

LangChain 和 LlamaIndex 有什么区别？

LangChain 是通用的 LLM 应用编排框架，擅长构建复杂的链（Chain）和智能体（Agent），适合多步骤工作流、工具调用、对话管理。LlamaIndex（原 GPT Index）专注于数据索引和检索，擅长连接各种数据源并构建高质量的 RAG 管道。实践中两者经常配合使用：LlamaIndex 处理数据摄取和检索，LangChain 处理链编排和智能体逻辑。如果你的核心需求是 RAG，从 LlamaIndex 开始；如果需要复杂的多步骤工作流，从 LangChain 开始。

什么是 AI 智能体？和普通 LLM 调用有什么区别？

AI 智能体是能够自主规划、使用工具、执行多步骤任务的 LLM 应用。普通 LLM 调用是单轮输入-输出，而智能体可以：1) 分解复杂任务为子步骤；2) 调用外部工具（搜索引擎、数据库、API、代码执行器）；3) 观察工具返回结果并决定下一步行动；4) 迭代直到任务完成。核心模式是 ReAct（Reasoning + Acting）循环：思考 → 行动 → 观察 → 思考。Function Calling 是实现工具调用的主要 API 机制。

如何评估和测试 LLM 应用？

多层评估方法：1) 离线评估——使用标注数据集计算准确率、F1、BLEU/ROUGE 等指标；2) LLM-as-Judge——用一个强大的 LLM（如 GPT-4）评估另一个模型的输出质量；3) RAG 特定指标——检索相关性（Precision@K）、回答忠实度（是否基于检索内容）、回答相关性；4) 人工评估——对关键场景进行人工审核和 A/B 测试；5) 在线监控——追踪延迟、成本、用户反馈、错误率。推荐工具：Ragas（RAG 评估）、DeepEval、LangSmith（追踪和调试）、Phoenix（可观测性）。

如何处理 LLM 幻觉问题？

减少幻觉的策略：1) RAG——让模型基于检索到的真实文档回答，而非凭记忆生成；2) 系统提示词约束——明确指示"仅基于提供的上下文回答，如不确定请说不知道"；3) 温度参数——降低 temperature（0.0-0.3）减少随机性；4) 结构化输出——使用 JSON Schema 或 function calling 约束输出格式；5) 自检机制——让模型生成答案后再验证自己的答案是否有依据；6) 事实核查管道——对输出进行自动化的事实验证；7) 引用来源——要求模型引用具体的文档段落。

AI 工程核心概念速查表

AI Engineering — Quick Reference
=================================

Concept                       Description
-------                       -----------
Prompt Engineering            Designing inputs to get desired LLM outputs
System Prompt                 Instructions that set model role and constraints
Few-Shot Learning             Providing examples in the prompt for guidance
Chain-of-Thought (CoT)        Asking the model to reason step-by-step
RAG                           Retrieve relevant docs, then generate answers
Vector Database               Storage optimized for similarity search
Embedding                     Dense vector representation of text meaning
Chunking                      Splitting documents into smaller pieces
Fine-Tuning                   Training a model further on custom data
Function Calling              LLM deciding when and how to call tools
AI Agent                      LLM that can plan, use tools, and iterate
ReAct Pattern                 Think -> Act -> Observe -> Think loop
Guardrails                    Input/output validation and safety filters
Hallucination                 Model generating false or unsupported info
Semantic Caching              Cache responses for semantically similar queries
Model Routing                 Directing queries to the best-fit model
LLM-as-Judge                  Using a strong LLM to evaluate outputs
LCEL (LangChain)              LangChain Expression Language for chains
Temperature                   Controls randomness (0=deterministic, 1=creative)
Token                         Basic unit of text processed by LLMs
Context Window                Maximum tokens a model can process at once
Structured Output             Constraining LLM output to JSON/schema
Batch API                     Processing multiple requests at reduced cost
Multimodal                    Models that process text, image, audio, video