AI Engineering Guide 2026: LLMs, RAG, Prompt Engineering, Fine-Tuning & AI Agents

TL;DR

AI Engineering is the discipline of building production applications on top of large language models. Master prompt engineering first (system prompts, few-shot, chain-of-thought), then add RAG with vector databases when the model needs domain knowledge, and only fine-tune when prompt engineering plus RAG are insufficient. Use LangChain or LlamaIndex to orchestrate chains and agents. Choose between OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro, or open-source Llama 3 based on cost, latency, context window, and task requirements. Always implement guardrails, evaluation pipelines, and cost monitoring before going to production.

Key Takeaways

AI Engineering focuses on building applications on top of pre-trained LLMs, not training models from scratch; core skills include prompt engineering, RAG, API integration, and orchestration frameworks.
Decision order: optimize prompts first, add RAG if insufficient, fine-tune only as a last resort — 90% of applications can be solved with prompt engineering + RAG.
Vector databases are the core component of RAG; choose by scale: pgvector (under millions), Pinecone (managed), Weaviate/Qdrant (open-source at scale).
AI agents achieve autonomous multi-step task execution through ReAct loops (Reasoning + Acting) and Function Calling.
Guardrails are essential: content filtering, hallucination detection, output validation, rate limiting, and cost monitoring must be in place before production.
Cost optimization keys: semantic caching, model routing (small models for simple tasks), prompt compression, and batch APIs can reduce costs by 60-80%.

1. What Is AI Engineering? (vs ML Engineering vs Data Science)

AI Engineering is an emerging discipline focused on building production applications using pre-trained large language models (LLMs). It sits at the intersection of software engineering and machine learning, but differs significantly from traditional ML Engineering and Data Science. AI Engineers do not train models from scratch — instead they integrate LLM capabilities into products through API calls, prompt engineering, Retrieval-Augmented Generation (RAG), and orchestration frameworks.

The explosive growth of LLMs in 2023-2024 gave rise to this role. As models like GPT-4, Claude 3, and Gemini became increasingly powerful, many application scenarios no longer require training custom models — instead they need engineers who know how to use these models effectively. The core value of an AI Engineer is transforming LLM capabilities into reliable, scalable product features.

AI Engineering vs ML Engineering vs Data Science
=================================================

Role               Focus                  Key Skills              Output
----               -----                  ----------              ------
Data Scientist     Analysis & insights    Statistics, SQL,        Reports, dashboards,
                                          Python, visualization   predictive models

ML Engineer        Model training         PyTorch, TensorFlow,    Trained models,
                   & deployment           MLOps, feature eng.     ML pipelines

AI Engineer        LLM application        Prompt eng., RAG,       AI-powered products,
                   development            LangChain, APIs         chatbots, agents

Key Differences:
- Data Scientist: "What does the data tell us?"
- ML Engineer:    "How do we train and serve a model?"
- AI Engineer:    "How do we build a product with an LLM?"

2. LLM APIs: OpenAI, Anthropic Claude, Google Gemini

LLM APIs are the foundation of AI Engineering. The three major providers each have strengths: OpenAI GPT-4o has the most mature ecosystem; Anthropic Claude 3.5 Sonnet leads in long-context understanding and safety; Google Gemini 1.5 Pro offers an ultra-long context window (1M tokens) with deep Google Cloud integration.

// OpenAI API — Chat Completion
import OpenAI from "openai";

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Explain async/await in JavaScript" }
  ],
  temperature: 0.7,
  max_tokens: 1000,
});

console.log(response.choices[0].message.content);
console.log("Tokens used:", response.usage.total_tokens);

// Anthropic Claude API
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

const message = await anthropic.messages.create({
  model: "claude-3-5-sonnet-20241022",
  max_tokens: 1024,
  system: "You are a senior software architect.",
  messages: [
    { role: "user", content: "Design a rate limiter for a REST API" }
  ],
});

console.log(message.content[0].text);
console.log("Input tokens:", message.usage.input_tokens);
console.log("Output tokens:", message.usage.output_tokens);

// Google Gemini API
import { GoogleGenerativeAI } from "@google/generative-ai";

const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-1.5-pro" });

const result = await model.generateContent({
  contents: [{
    role: "user",
    parts: [{ text: "Compare microservices vs monolith architecture" }]
  }],
  generationConfig: {
    temperature: 0.7,
    maxOutputTokens: 1000,
  },
});

console.log(result.response.text());

LLM API Comparison (2026)
==========================

Provider     Model               Context    Input Cost     Output Cost    Strengths
--------     -----               -------    ----------     -----------    ---------
OpenAI       GPT-4o              128K       \$2.50/1M      \$10.00/1M     Best ecosystem, multimodal
OpenAI       GPT-4o-mini         128K       \$0.15/1M      \$0.60/1M      Cheapest smart model
Anthropic    Claude 3.5 Sonnet   200K       \$3.00/1M      \$15.00/1M     Long context, safety, coding
Anthropic    Claude 3.5 Haiku    200K       \$0.25/1M      \$1.25/1M      Fast and affordable
Google       Gemini 1.5 Pro      1M         \$1.25/1M      \$5.00/1M      Longest context window
Google       Gemini 1.5 Flash    1M         \$0.075/1M     \$0.30/1M      Ultra fast and cheap
Meta         Llama 3.1 405B      128K       Self-host      Self-host      Open source, no API cost
Meta         Llama 3.1 70B       128K       Self-host      Self-host      Best open-source mid-tier
Mistral      Mixtral 8x22B       64K        \$2.00/1M      \$6.00/1M      Strong EU-based option

3. Prompt Engineering: Few-Shot, Chain-of-Thought, and System Prompts

Prompt engineering is the most essential skill for AI Engineers. By carefully designing input prompts, you can dramatically improve model output quality without modifying the model itself. Mastering system prompts (setting roles and constraints), few-shot learning (providing examples for guidance), and chain-of-thought (guiding step-by-step reasoning) is a required course for every AI Engineer.

// 1. System Prompt — Setting Role and Constraints
const messages = [
  {
    role: "system",
    content: `You are a senior TypeScript developer.
Rules:
- Always use strict TypeScript types (no "any")
- Include error handling for all async operations
- Add JSDoc comments for public functions
- If you are unsure, say "I am not certain" rather than guessing
- Format output as markdown code blocks`
  },
  { role: "user", content: "Write a function to fetch user data from an API" }
];

// 2. Few-Shot Learning — Providing Examples
const fewShotMessages = [
  {
    role: "system",
    content: "You classify customer support tickets into categories."
  },
  // Example 1
  { role: "user", content: "My order hasn't arrived yet, it's been 2 weeks" },
  { role: "assistant", content: "Category: SHIPPING\nPriority: HIGH\nSentiment: FRUSTRATED" },
  // Example 2
  { role: "user", content: "How do I change my password?" },
  { role: "assistant", content: "Category: ACCOUNT\nPriority: LOW\nSentiment: NEUTRAL" },
  // Example 3
  { role: "user", content: "Your product is amazing, saved me hours!" },
  { role: "assistant", content: "Category: FEEDBACK\nPriority: LOW\nSentiment: POSITIVE" },
  // Actual query
  { role: "user", content: "I was charged twice for my subscription" }
];

// 3. Chain-of-Thought (CoT) — Step-by-Step Reasoning
const cotPrompt = {
  role: "user",
  content: `Analyze whether this API design follows REST best practices.

API Endpoint: POST /api/users/123/delete

Think step by step:
1. Check the HTTP method appropriateness
2. Evaluate the URL structure
3. Assess resource naming conventions
4. Check for idempotency considerations
5. Provide your final assessment with improvements`
};

// CoT Variations:
// - "Let's think step by step" (zero-shot CoT)
// - "Think through this carefully before answering" (implicit CoT)
// - Provide worked examples with reasoning (few-shot CoT)
// - "First analyze X, then consider Y, finally conclude Z" (structured CoT)

Prompt Engineering Techniques — Quick Reference
================================================

Technique              When to Use                      Example
---------              -----------                      -------
Zero-shot              Simple, well-defined tasks       "Translate to French: Hello"
Few-shot               Classification, formatting       Provide 3-5 input/output pairs
Chain-of-Thought       Math, logic, complex reasoning   "Think step by step..."
System Prompt          Role, tone, constraints          "You are a legal expert..."
Output Format          Structured data needed           "Respond in JSON format..."
Self-consistency       High-stakes decisions            Generate 5 answers, take majority
Tree of Thought        Complex problem solving          Explore multiple solution paths
ReAct                  Tool use, agents                 Think -> Act -> Observe loop
Retrieval-Augmented    Domain-specific knowledge        Inject relevant docs into context

Anti-patterns to Avoid:
- Vague instructions ("make it better")
- Overloading context (irrelevant information)
- No output format specification
- Not handling edge cases in prompt
- Using negation ("don't do X") instead of affirmation ("do Y")

4. RAG (Retrieval-Augmented Generation) Architecture

RAG is one of the most important architectural patterns in AI Engineering today. By retrieving relevant documents from an external knowledge base before generating answers, it solves two core LLM problems: knowledge cutoff dates and hallucinations. RAG enables models to answer questions grounded in up-to-date, domain-specific real data while providing traceable source citations.

RAG Architecture — Data Flow
============================

  INDEXING PHASE (offline, one-time):
  ┌──────────┐    ┌──────────┐    ┌───────────┐    ┌───────────────┐
  │ Documents │ -> │ Chunking │ -> │ Embedding │ -> │ Vector Store  │
  │ (PDF,Web, │    │ (split   │    │ Model     │    │ (Pinecone,    │
  │  Notion)  │    │  text)   │    │ (ada-002) │    │  pgvector)    │
  └──────────┘    └──────────┘    └───────────┘    └───────────────┘

  QUERY PHASE (online, per-request):
  ┌──────────┐    ┌───────────┐    ┌───────────────┐
  │ User      │ -> │ Embed     │ -> │ Vector Search │
  │ Question  │    │ Query     │    │ (top-K docs)  │
  └──────────┘    └───────────┘    └───────┬───────┘
                                          │
                                          v
  ┌──────────┐    ┌───────────────────────────────┐
  │ Answer   │ <- │ LLM (query + retrieved docs)  │
  │ + Sources│    │ "Based on context, answer..."  │
  └──────────┘    └───────────────────────────────┘

// Complete RAG Pipeline in TypeScript
import { OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

// Step 1: Document Chunking
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,        // characters per chunk
  chunkOverlap: 200,      // overlap between chunks
  separators: ["\n\n", "\n", ". ", " "],  // split priority
});
const chunks = await splitter.splitDocuments(documents);

// Step 2: Generate Embeddings & Store
const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-small",  // 1536 dimensions
});
const pinecone = new Pinecone();
const index = pinecone.Index("my-rag-index");

const vectorStore = await PineconeStore.fromDocuments(
  chunks,
  embeddings,
  { pineconeIndex: index }
);

// Step 3: Query — Retrieve + Generate
async function ragQuery(question: string) {
  // Retrieve top 5 most relevant chunks
  const relevantDocs = await vectorStore.similaritySearch(question, 5);

  // Build context from retrieved documents
  const context = relevantDocs
    .map((doc, i) => "Source " + (i + 1) + ": " + doc.pageContent)
    .join("\n\n");

  // Generate answer with context
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: "Answer the question based ONLY on the provided context. " +
                 "If the context does not contain the answer, say so. " +
                 "Cite your sources using [Source N] format."
      },
      {
        role: "user",
        content: "Context:\n" + context + "\n\nQuestion: " + question
      }
    ],
    temperature: 0.2,  // low temp for factual answers
  });

  return {
    answer: response.choices[0].message.content,
    sources: relevantDocs.map(d => d.metadata),
  };
}

Chunking Strategies
====================

Strategy                Chunk Size   Overlap   Best For
--------                ----------   -------   --------
Fixed-size              500-1000     100-200   General purpose, simple
Recursive character     500-1500     100-300   Most text documents
Sentence-based          1-5 sent.    1 sent.   FAQ, precise retrieval
Semantic                Varies       N/A       Topic-coherent chunks
Parent-child            Small+Large  N/A       Retrieve small, pass large
Markdown header         By section   N/A       Technical documentation
Code-aware              By function  N/A       Source code files

Tips:
- Smaller chunks = more precise retrieval but less context
- Larger chunks = more context but noisier retrieval
- Overlap prevents information loss at chunk boundaries
- Test with your actual data to find optimal size

5. Vector Databases: Pinecone, Weaviate, Qdrant, and pgvector

Vector databases are the core storage layer in RAG architectures. They are purpose-built for similarity search over high-dimensional vectors, capable of finding the most similar results from millions of vectors in milliseconds. Choosing the right vector database depends on your scale, infrastructure preferences, and feature requirements.

// Pinecone — Fully Managed Vector Database
import { Pinecone } from "@pinecone-database/pinecone";

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index("my-index");

// Upsert vectors
await index.upsert([
  {
    id: "doc-1",
    values: embedding,           // float array [0.1, -0.2, ...]
    metadata: { source: "docs/api.md", section: "auth" },
  },
]);

// Query with metadata filter
const results = await index.query({
  vector: queryEmbedding,
  topK: 5,
  filter: { source: { "\$eq": "docs/api.md" } },
  includeMetadata: true,
});

// pgvector — PostgreSQL Extension (great for existing PG users)
// SQL setup:
// CREATE EXTENSION vector;
// CREATE TABLE documents (
//   id SERIAL PRIMARY KEY,
//   content TEXT,
//   metadata JSONB,
//   embedding vector(1536)  -- dimension matches your model
// );
// CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

import pg from "pg";

const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });

// Insert document with embedding
await pool.query(
  "INSERT INTO documents (content, metadata, embedding) VALUES (\$1, \$2, \$3)",
  [content, JSON.stringify(metadata), JSON.stringify(embedding)]
);

// Similarity search (cosine distance)
const result = await pool.query(
  `SELECT content, metadata,
          1 - (embedding <=> \$1::vector) AS similarity
   FROM documents
   ORDER BY embedding <=> \$1::vector
   LIMIT 5`,
  [JSON.stringify(queryEmbedding)]
);

Vector Database Comparison
===========================

Database     Type        Best For                   Pricing         Max Scale
--------     ----        --------                   -------         ---------
Pinecone     Managed     Zero-ops, quick start      Pay-per-use     Billions
Weaviate     OSS/Cloud   Hybrid search, modules     Free/Managed    Billions
Qdrant       OSS/Cloud   Filtering, performance     Free/Managed    Billions
pgvector     PG Ext.     Existing PG infra          Free (self)     Millions
Chroma       OSS         Local dev, prototyping     Free            Millions
Milvus       OSS         Ultra-large scale          Free/Managed    Trillions
FAISS        Library     In-memory, research        Free            Billions

Distance Metrics:
- Cosine similarity: normalized, most common for text embeddings
- Euclidean (L2): absolute distance, good for image features
- Dot product: fastest, works when vectors are normalized

6. LangChain and LlamaIndex Frameworks

LangChain and LlamaIndex are the two most popular LLM application development frameworks. LangChain is a general-purpose orchestration framework ideal for complex multi-step workflows and agents; LlamaIndex specializes in data connection and retrieval, making it the top choice for RAG applications. They can be used complementarily.

// LangChain — Chain with Prompt Template
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";

const model = new ChatOpenAI({ model: "gpt-4o", temperature: 0 });

// Simple chain: prompt -> model -> parser
const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a technical writer. Write clear, concise explanations."],
  ["user", "Explain {topic} for {audience}"],
]);

const chain = RunnableSequence.from([
  prompt,
  model,
  new StringOutputParser(),
]);

const result = await chain.invoke({
  topic: "Kubernetes pods",
  audience: "junior developers",
});
console.log(result);

// LangChain — RAG Chain with Retriever
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";

const retriever = vectorStore.asRetriever({ k: 5 });

const ragPrompt = ChatPromptTemplate.fromMessages([
  ["system",
   "Answer based on the following context only.\n" +
   "If the context does not help, say you do not know.\n" +
   "Context: {context}"],
  ["user", "{input}"],
]);

const documentChain = await createStuffDocumentsChain({
  llm: model,
  prompt: ragPrompt,
});

const ragChain = await createRetrievalChain({
  retriever,
  combineDocsChain: documentChain,
});

const answer = await ragChain.invoke({
  input: "How do I configure SSL in Nginx?",
});
console.log(answer.answer);
console.log("Sources:", answer.context.map(d => d.metadata.source));

# LlamaIndex — RAG Pipeline (Python)
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure global settings
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

# Load documents and build index
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query with RAG
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact",  # "tree_summarize", "refine", "compact"
)

response = query_engine.query("How do I set up a CI/CD pipeline?")
print(response.response)
print("Sources:", [n.metadata for n in response.source_nodes])

# Chat engine with memory
chat_engine = index.as_chat_engine(
    chat_mode="context",   # "condense_question", "react"
    system_prompt="You are a DevOps expert.",
)
response = chat_engine.chat("What is Kubernetes?")
response = chat_engine.chat("How does it compare to Docker Swarm?")

LangChain vs LlamaIndex — When to Use Which
=============================================

Feature              LangChain                LlamaIndex
-------              ---------                ----------
Primary Focus        Chain orchestration       Data indexing & retrieval
Best For             Complex workflows,        RAG applications,
                     multi-step agents         knowledge bases
Data Connectors      Via integrations          150+ built-in loaders
Agent Support        Excellent (LCEL, tools)   Good (ReAct, tools)
RAG Quality          Good                      Excellent (advanced chunking)
Learning Curve       Moderate                  Lower for RAG tasks
Language Support      Python, TypeScript/JS    Python (primary), TS
Community            Very large                Large

Use LangChain when:
  - Building multi-step agent workflows
  - Need complex chain composition (LCEL)
  - Building chatbots with tool calling

Use LlamaIndex when:
  - Primary need is RAG over your data
  - Need advanced indexing strategies
  - Have diverse data sources to connect

7. Fine-Tuning vs RAG vs Prompt Engineering: Decision Tree

These three techniques are the primary means for AI Engineers to customize LLM behavior. Choosing the right approach can dramatically reduce cost and development time. Follow a simple principle: start with the simplest method and upgrade to more complex ones only when necessary.

Decision Tree: How to Customize LLM Behavior
=============================================

Start here: "What does the model need to learn?"
  |
  |-- Nothing new, just better outputs?
  |   --> PROMPT ENGINEERING
  |   - System prompts, few-shot examples, output formats
  |   - Cost: \$0, Time: hours, Skill: low
  |
  |-- Needs specific/recent facts or your data?
  |   --> RAG (Retrieval-Augmented Generation)
  |   - Vector DB + retrieval pipeline + prompt
  |   - Cost: \$100-\$1K setup, Time: days, Skill: medium
  |
  |-- Needs to change behavior/style/format?
  |   |-- Is the change simple (e.g., always respond in JSON)?
  |   |   --> PROMPT ENGINEERING (with structured output)
  |   |
  |   |-- Is the change complex (domain-specific reasoning)?
  |       --> FINE-TUNING
  |       - Prepare training data, train, evaluate
  |       - Cost: \$500-\$10K+, Time: weeks, Skill: high
  |
  |-- Needs both facts AND behavior change?
      --> FINE-TUNING + RAG (combined)
      - Fine-tune for domain language understanding
      - RAG for real-time factual grounding
      - Cost: highest, Time: weeks-months

// OpenAI Fine-Tuning API Example

// Step 1: Prepare training data (JSONL format)
// training_data.jsonl:
// {"messages": [{"role":"system","content":"You are a SQL expert"},
//   {"role":"user","content":"Show all users created today"},
//   {"role":"assistant","content":"SELECT * FROM users WHERE created_at >= CURRENT_DATE;"}]}
// {"messages": [{"role":"system","content":"You are a SQL expert"},
//   {"role":"user","content":"Count active premium users"},
//   {"role":"assistant","content":"SELECT COUNT(*) FROM users WHERE status = 'active' AND plan = 'premium';"}]}

// Step 2: Upload training file
const file = await openai.files.create({
  file: fs.createReadStream("training_data.jsonl"),
  purpose: "fine-tune",
});

// Step 3: Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
  training_file: file.id,
  model: "gpt-4o-mini-2024-07-18",
  hyperparameters: {
    n_epochs: 3,
    batch_size: "auto",
    learning_rate_multiplier: "auto",
  },
});

// Step 4: Use fine-tuned model
// const response = await openai.chat.completions.create({
//   model: "ft:gpt-4o-mini-2024-07-18:my-org::abc123",
//   messages: [...]
// });

Prompt Engineering vs RAG vs Fine-Tuning
=========================================

Dimension            Prompt Eng.       RAG                Fine-Tuning
---------            -----------       ---                -----------
Setup Cost           Free              Low-Medium         High
Time to Implement    Hours             Days               Weeks
Per-Query Cost       Base model cost   +Retrieval cost    -20-50% savings
Knowledge Update     Change prompt     Update vector DB   Retrain model
Factual Accuracy     Model knowledge   High (grounded)    Model knowledge
Style/Format         Good (examples)   Limited            Excellent
Source Citations     No                Yes                No
Data Privacy         Sent to API       Chunks sent        Trained into model
Maintenance          Easy              Medium             Complex
Best Starting Point  YES               Second choice      Last resort

8. Embedding Models and Semantic Search

Embedding models convert text into high-dimensional vectors (typically 768-3072 dimensions) such that semantically similar texts are closer together in vector space. Embeddings are the foundation of RAG, semantic search, text classification, and clustering applications. Choosing the right embedding model directly impacts retrieval quality.

// Generating Embeddings with OpenAI
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",  // or "text-embedding-3-large"
  input: "How do I deploy a Next.js application?",
  dimensions: 1536,  // can reduce for cost savings (e.g., 512)
});

const embedding = response.data[0].embedding;
// embedding = [0.0123, -0.0456, 0.0789, ...] (1536 floats)

// Batch embeddings (more efficient)
const batchResponse = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: [
    "How to set up Docker containers",
    "Kubernetes pod configuration guide",
    "CI/CD pipeline with GitHub Actions",
    "AWS Lambda serverless functions",
  ],
});
// batchResponse.data[0].embedding, batchResponse.data[1].embedding, ...

// Semantic Search Implementation
function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

// Compare semantic similarity
const queries = [
  "How to deploy to production",      // semantically similar
  "Production deployment guide",       // semantically similar
  "Best pizza recipe",                 // semantically different
];

// Result: queries[0] and queries[1] will have high similarity (~0.92)
//         queries[0] and queries[2] will have low similarity (~0.15)

// Hybrid Search: combine semantic + keyword for best results
// score = alpha * semantic_score + (1 - alpha) * bm25_score
// alpha = 0.7 is a good starting point

Embedding Model Comparison
===========================

Model                    Provider   Dims    Cost/1M tokens   MTEB Score
-----                    --------   ----    --------------   ----------
text-embedding-3-large   OpenAI     3072    \$0.13           64.6
text-embedding-3-small   OpenAI     1536    \$0.02           62.3
voyage-3                 Voyage AI  1024    \$0.06           67.1
embed-v3.0               Cohere     1024    \$0.10           64.8
BGE-large-en-v1.5        BAAI       1024    Free (OSS)       63.5
GTE-Qwen2-7B             Alibaba    3584    Free (OSS)       70.2
nomic-embed-text         Nomic      768     Free (OSS)       62.4

Tips:
- Start with text-embedding-3-small (best cost/quality ratio)
- Use dimension reduction for cost savings (3072 -> 1024)
- Benchmark on YOUR data, not just MTEB leaderboard
- Open-source models (BGE, GTE) are competitive and free
- Use the same model for indexing and querying

9. AI Agents and Tool Use (Function Calling)

AI agents are LLM applications capable of autonomous planning, reasoning, and using tools to complete complex tasks. Unlike simple Q&A, agents can decompose tasks, call external APIs, execute code, query databases, and dynamically adjust strategies based on intermediate results. Function Calling is the core API mechanism for implementing agent tool use.

// OpenAI Function Calling — Tool Use
const tools = [
  {
    type: "function",
    function: {
      name: "search_documentation",
      description: "Search technical documentation for a given query",
      parameters: {
        type: "object",
        properties: {
          query: { type: "string", description: "Search query" },
          language: {
            type: "string",
            enum: ["javascript", "python", "rust", "go"],
            description: "Programming language filter",
          },
        },
        required: ["query"],
      },
    },
  },
  {
    type: "function",
    function: {
      name: "execute_code",
      description: "Execute a code snippet and return the output",
      parameters: {
        type: "object",
        properties: {
          code: { type: "string", description: "Code to execute" },
          language: { type: "string", enum: ["javascript", "python"] },
        },
        required: ["code", "language"],
      },
    },
  },
];

// Agent loop: Think -> Act -> Observe -> Think
async function agentLoop(userMessage: string) {
  const messages: any[] = [
    {
      role: "system",
      content: "You are a helpful coding assistant. Use tools when needed."
    },
    { role: "user", content: userMessage }
  ];

  while (true) {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages,
      tools,
      tool_choice: "auto",
    });

    const message = response.choices[0].message;
    messages.push(message);

    // If no tool calls, the agent is done
    if (!message.tool_calls || message.tool_calls.length === 0) {
      return message.content;
    }

    // Execute each tool call
    for (const toolCall of message.tool_calls) {
      const args = JSON.parse(toolCall.function.arguments);
      let result: string;

      if (toolCall.function.name === "search_documentation") {
        result = await searchDocs(args.query, args.language);
      } else if (toolCall.function.name === "execute_code") {
        result = await executeCode(args.code, args.language);
      } else {
        result = "Unknown tool: " + toolCall.function.name;
      }

      messages.push({
        role: "tool",
        tool_call_id: toolCall.id,
        content: result,
      });
    }
  }
}

Agent Architecture Patterns
============================

1. ReAct (Reasoning + Acting):
   Thought: "I need to find the API rate limits"
   Action:  search_documentation("rate limits REST API")
   Observe: "Rate limits: 100 req/min for free tier..."
   Thought: "Now I have the info, I can answer"
   Answer:  "The API allows 100 requests per minute..."

2. Plan-and-Execute:
   Plan:    ["Search docs", "Write code", "Test code", "Review"]
   Execute: Run each step, re-plan if needed

3. Multi-Agent (Crew/Swarm):
   Agent 1 (Researcher):  Gathers information
   Agent 2 (Developer):   Writes code based on research
   Agent 3 (Reviewer):    Reviews and suggests improvements
   Orchestrator:          Coordinates agent communication

4. Tool-Augmented Generation:
   LLM decides WHEN and WHICH tool to call
   Tools: calculator, web search, code exec, DB query, API calls
   LLM synthesizes tool results into final answer

10. Guardrails and Safety: Content Filtering, Hallucination Detection

Production LLM applications must have guardrails. Guardrails ensure model outputs are safe, accurate, properly formatted, and comply with business rules. An LLM application without guardrails is like an API without validation — it will eventually break.

// Guardrails Implementation Pattern

// 1. Input Validation — Filter malicious/inappropriate inputs
async function validateInput(userMessage: string): Promise<boolean> {
  // Check message length
  if (userMessage.length > 10000) {
    throw new Error("Message too long");
  }

  // Prompt injection detection
  const injectionPatterns = [
    /ignore (all |previous |above )?instructions/i,
    /you are now/i,
    /system prompt/i,
    /reveal your/i,
  ];
  if (injectionPatterns.some(p => p.test(userMessage))) {
    throw new Error("Potential prompt injection detected");
  }

  // Content moderation (OpenAI Moderation API)
  const moderation = await openai.moderations.create({
    input: userMessage,
  });
  if (moderation.results[0].flagged) {
    throw new Error("Content flagged: " +
      Object.entries(moderation.results[0].categories)
        .filter(([, v]) => v)
        .map(([k]) => k)
        .join(", ")
    );
  }

  return true;
}

// 2. Output Validation — Ensure correct format and content
import { z } from "zod";

// Define expected output schema
const ProductRecommendation = z.object({
  products: z.array(z.object({
    name: z.string(),
    reason: z.string().max(200),
    confidence: z.number().min(0).max(1),
    price_range: z.enum(["budget", "mid-range", "premium"]),
  })).min(1).max(5),
  disclaimer: z.string(),
});

// Parse and validate LLM output
function validateOutput(llmOutput: string) {
  try {
    const parsed = JSON.parse(llmOutput);
    const validated = ProductRecommendation.parse(parsed);
    return { success: true, data: validated };
  } catch (error) {
    // Retry with corrective prompt or return fallback
    return { success: false, error: String(error) };
  }
}

// 3. Hallucination Detection for RAG
async function checkFaithfulness(
  answer: string,
  sources: string[]
): Promise<{ faithful: boolean; issues: string[] }> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content:
          "You are a fact-checker. Compare the answer against the source " +
          "documents. Identify any claims NOT supported by the sources. " +
          "Respond in JSON: {faithful: boolean, issues: string[]}"
      },
      {
        role: "user",
        content: "Sources:\n" + sources.join("\n---\n") +
                 "\n\nAnswer:\n" + answer
      }
    ],
    response_format: { type: "json_object" },
    temperature: 0,
  });

  return JSON.parse(response.choices[0].message.content || "{}");
}

Production Guardrails Checklist
================================

Layer              Check                              Priority
-----              -----                              --------
Input              Message length limit                CRITICAL
Input              Prompt injection detection           CRITICAL
Input              Content moderation (toxicity)        CRITICAL
Input              PII detection and redaction          HIGH
Input              Rate limiting per user               HIGH
Output             JSON schema validation               HIGH
Output             Hallucination / faithfulness check   HIGH
Output             Content safety filter                CRITICAL
Output             Max output length enforcement        MEDIUM
Output             Source citation verification         MEDIUM
System             Cost per request monitoring          HIGH
System             Latency tracking (P50, P95, P99)     HIGH
System             Error rate alerting                  CRITICAL
System             Fallback responses for failures      HIGH
System             Audit logging for compliance         MEDIUM

11. Cost Optimization for LLM APIs

LLM API costs are one of the primary operational expenses for production applications. Unoptimized LLM applications can consume thousands to tens of thousands of dollars monthly. Through intelligent caching, model routing, prompt compression, and batch processing, costs can be reduced by 60-80% while maintaining output quality.

// 1. Semantic Caching — Avoid redundant API calls
import { createHash } from "crypto";

class SemanticCache {
  private cache: Map<string, { response: string; embedding: number[] }> = new Map();
  private similarityThreshold = 0.95;

  async get(query: string): Promise<string | null> {
    const queryEmbedding = await getEmbedding(query);

    for (const [, entry] of this.cache) {
      const similarity = cosineSimilarity(queryEmbedding, entry.embedding);
      if (similarity >= this.similarityThreshold) {
        return entry.response; // Cache hit
      }
    }
    return null; // Cache miss
  }

  async set(query: string, response: string): Promise<void> {
    const embedding = await getEmbedding(query);
    const key = createHash("sha256").update(query).digest("hex");
    this.cache.set(key, { response, embedding });
  }
}

// 2. Model Routing — Use the cheapest model that works
async function routeToModel(query: string): Promise<string> {
  // Classify query complexity
  const classification = await openai.chat.completions.create({
    model: "gpt-4o-mini",  // cheap classifier
    messages: [{
      role: "system",
      content:
        "Classify the query complexity as SIMPLE, MEDIUM, or COMPLEX. " +
        "SIMPLE: factual lookup, formatting, translation. " +
        "MEDIUM: summarization, code generation, analysis. " +
        "COMPLEX: multi-step reasoning, creative writing, architecture design. " +
        "Respond with only the classification word."
    }, {
      role: "user", content: query
    }],
    max_tokens: 10,
  });

  const complexity = classification.choices[0].message.content?.trim();

  const modelMap: Record<string, string> = {
    SIMPLE:  "gpt-4o-mini",     // \$0.15/1M input
    MEDIUM:  "gpt-4o-mini",     // \$0.15/1M input
    COMPLEX: "gpt-4o",          // \$2.50/1M input
  };

  const model = modelMap[complexity || "MEDIUM"];
  console.log("Routing " + query.slice(0, 50) + "... to " + model);
  return model;
}

LLM Cost Optimization Strategies
==================================

Strategy                 Savings     Effort    Impact on Quality
--------                 -------     ------    -----------------
Semantic caching         30-60%      Medium    None (exact/similar)
Model routing            40-70%      Medium    Minimal (smart routing)
Prompt compression       10-30%      Low       Minimal
Batch API (OpenAI)       50%         Low       None
Reduce max_tokens        5-20%       Low       None (if set correctly)
Shorter system prompts   5-15%       Low       Minimal
Streaming (early stop)   10-30%      Medium    Variable
Open-source models       80-100%     High      Depends on model

Example Monthly Cost Breakdown (100K queries/month):
  Unoptimized (GPT-4o for everything):  \$5,000
  + Model routing (80% to mini):         \$1,200  (-76%)
  + Semantic caching (40% hit rate):      \$720  (-86%)
  + Prompt compression:                   \$580  (-88%)
  + Batch API for async tasks:            \$450  (-91%)

12. Model Comparison: GPT-4 vs Claude 3 vs Gemini vs Llama 3

Choosing the right model is the most critical decision in AI Engineering. Different models have varying strengths in reasoning capability, context window, cost, latency, and task-specific performance. Below is a detailed comparison of the leading models.

Model Comparison — Detailed Breakdown (2026)
==============================================

Category           GPT-4o           Claude 3.5       Gemini 1.5      Llama 3.1
                                    Sonnet           Pro             405B
--------           ------           -----------      -----------     ----------
Provider           OpenAI           Anthropic        Google          Meta (OSS)
Context Window     128K             200K             1M              128K
Multimodal         Text+Image+      Text+Image+      Text+Image+     Text+Image
                   Audio            PDF              Video+Audio
Coding             Excellent        Best             Very Good       Very Good
Reasoning          Excellent        Excellent        Good            Good
Long Context       Good             Excellent        Best            Good
Safety             Good             Best             Good            Good
Speed (TTFT)       Fast             Fast             Fast            Varies
API Ecosystem      Best             Good             Good            N/A
Fine-tuning        Yes              No (yet)         Yes             Full control
Self-hosting       No               No               No              Yes
Data Privacy       API only         API only         API only        Full control

Best For:
  GPT-4o:            General purpose, largest ecosystem, best tooling
  Claude 3.5 Sonnet: Coding, long documents, safety-critical apps
  Gemini 1.5 Pro:    Ultra-long context (books, codebases), multimodal
  Llama 3.1 405B:    Self-hosting, data privacy, customization

Budget Models:
  GPT-4o-mini:       Best cheap commercial model
  Claude 3.5 Haiku:  Fast and affordable, good quality
  Gemini 1.5 Flash:  Ultra cheap, very fast
  Llama 3.1 70B:     Best open-source mid-tier, self-hostable

// Multi-Model Strategy: Use the Right Model for Each Task
const MODEL_CONFIG = {
  // High-stakes, complex reasoning
  complex: {
    model: "gpt-4o",
    temperature: 0.3,
    useCases: ["architecture design", "code review", "legal analysis"],
  },

  // Long document processing
  longContext: {
    model: "gemini-1.5-pro",
    temperature: 0.2,
    useCases: ["codebase analysis", "book summarization", "log analysis"],
  },

  // Code generation and debugging
  coding: {
    model: "claude-3-5-sonnet-20241022",
    temperature: 0,
    useCases: ["code generation", "debugging", "refactoring"],
  },

  // Simple tasks, high volume
  simple: {
    model: "gpt-4o-mini",
    temperature: 0.5,
    useCases: ["classification", "extraction", "formatting"],
  },

  // Privacy-sensitive, self-hosted
  private: {
    model: "meta-llama/Llama-3.1-70B-Instruct",
    temperature: 0.3,
    useCases: ["medical records", "financial data", "internal tools"],
  },
};

13. Evaluation and Testing LLM Applications

Evaluation and testing is the most overlooked yet most important aspect of AI Engineering. The non-deterministic nature of LLM outputs makes traditional testing methods only partially applicable. You need a specialized evaluation framework to measure output quality, accuracy, and safety, continuously improving through iterations.

// LLM Evaluation Framework

// 1. LLM-as-Judge — Use a strong model to evaluate outputs
async function llmJudge(
  question: string,
  answer: string,
  criteria: string
): Promise<{ score: number; reasoning: string }> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{
      role: "system",
      content:
        "You are an expert evaluator. Score the answer on a scale of " +
        "1-5 based on the given criteria. " +
        "Respond in JSON: {score: number, reasoning: string}"
    }, {
      role: "user",
      content: "Question: " + question + "\n" +
               "Answer: " + answer + "\n" +
               "Criteria: " + criteria
    }],
    response_format: { type: "json_object" },
    temperature: 0,
  });

  return JSON.parse(response.choices[0].message.content || "{}");
}

// Usage:
// await llmJudge(
//   "What is Kubernetes?",
//   generatedAnswer,
//   "Accuracy, completeness, clarity, and conciseness"
// );

// 2. RAG Evaluation Metrics
// Using Ragas framework concepts

interface RAGEvalResult {
  faithfulness: number;      // Is the answer grounded in sources?
  answerRelevancy: number;   // Does it actually answer the question?
  contextPrecision: number;  // Are retrieved docs relevant?
  contextRecall: number;     // Did we retrieve all relevant docs?
}

async function evaluateRAG(
  question: string,
  answer: string,
  contexts: string[],
  groundTruth: string
): Promise<RAGEvalResult> {
  // Faithfulness: does the answer use only info from contexts?
  const faithfulness = await llmJudge(
    question,
    answer,
    "Score 1-5: Is every claim in the answer supported by the " +
    "provided context? Penalize any unsupported claims heavily."
  );

  // Answer relevancy: does it address the question?
  const relevancy = await llmJudge(
    question,
    answer,
    "Score 1-5: Does the answer directly address the question? " +
    "Penalize off-topic content and missing key points."
  );

  return {
    faithfulness: faithfulness.score / 5,
    answerRelevancy: relevancy.score / 5,
    contextPrecision: 0, // computed via retrieval metrics
    contextRecall: 0,    // computed against ground truth
  };
}

LLM Testing Strategy
=====================

Test Type           What It Tests              Tools
---------           -------------              -----
Unit Tests          Prompt template outputs    pytest, vitest
Integration Tests   Full RAG pipeline          Ragas, DeepEval
Regression Tests    Output consistency         Golden datasets
A/B Tests           Model/prompt comparison    LangSmith, Braintrust
Red Team Tests      Safety, edge cases         Garak, manual
Load Tests          Latency under load         k6, Locust
Cost Tests          Budget compliance          Custom monitoring

Evaluation Tools:
  Ragas         - Open-source RAG evaluation framework
  DeepEval      - LLM evaluation with multiple metrics
  LangSmith     - Tracing, debugging, evaluation (LangChain)
  Phoenix       - LLM observability (Arize AI)
  Braintrust    - LLM evaluation and monitoring platform
  Promptfoo     - Open-source prompt testing CLI

Golden Rule: Never deploy without:
  1. A golden evaluation dataset (50-200 examples)
  2. Automated regression tests in CI
  3. Production monitoring (latency, cost, errors)
  4. Human review for high-stakes outputs

Frequently Asked Questions

What is the difference between an AI Engineer and an ML Engineer?

ML Engineers focus on training and optimizing machine learning models from scratch — data collection, feature engineering, model training, hyperparameter tuning. AI Engineers focus on building applications using pre-trained LLMs — integrating models via APIs, prompt engineering, RAG, fine-tuning, and orchestration frameworks. AI Engineering is more application-layer; ML Engineering is more model-layer.

When should I use RAG vs fine-tuning?

Prefer RAG when: you need up-to-date information, need source citations, data changes frequently, or need explainability. Choose fine-tuning when: you need to change the model tone/style/format, need to learn domain-specific reasoning patterns, or RAG retrieval quality is insufficient. They can be combined — fine-tune to help the model understand domain language better, then use RAG to supply specific facts. Cost-wise, RAG has lower upfront cost but per-query retrieval overhead; fine-tuning requires higher upfront training cost but simpler inference.

How do I choose a vector database?

The choice depends on scale and infrastructure. Pinecone: fully managed, zero ops, good for prototypes and medium scale. Weaviate: open-source, hybrid search (vector + keyword), rich module ecosystem. Qdrant: open-source, Rust-based with excellent performance, powerful filtering. pgvector: PostgreSQL extension, ideal for teams with existing PG infrastructure, performs well under a few million vectors. Chroma: lightweight, great for local development and prototyping. For billion-scale vectors, consider Milvus.

How can I reduce LLM API costs?

Key strategies: 1) Semantic caching — cache responses for similar queries to avoid redundant calls; 2) Model routing — use smaller models (GPT-4o-mini/Claude Haiku) for simple tasks, large models only for complex ones; 3) Prompt optimization — reduce unnecessary tokens (trim system prompts, compress context); 4) Batch APIs — use batch endpoints for non-realtime scenarios, saving up to 50%; 5) Open-source models — self-host Llama 3 or Mistral for latency-sensitive and privacy-critical workloads; 6) Output length limits — set max_tokens to prevent verbose responses.

What is the difference between LangChain and LlamaIndex?

LangChain is a general-purpose LLM application orchestration framework, excelling at building complex Chains and Agents — ideal for multi-step workflows, tool calling, and conversation management. LlamaIndex (formerly GPT Index) specializes in data indexing and retrieval, connecting diverse data sources and building high-quality RAG pipelines. In practice they are often used together: LlamaIndex handles data ingestion and retrieval while LangChain handles chain orchestration and agent logic. If your core need is RAG, start with LlamaIndex; if you need complex multi-step workflows, start with LangChain.

What are AI agents and how do they differ from regular LLM calls?

AI agents are LLM applications that can autonomously plan, use tools, and execute multi-step tasks. A regular LLM call is single-turn input-output, whereas agents can: 1) decompose complex tasks into sub-steps; 2) call external tools (search engines, databases, APIs, code executors); 3) observe tool results and decide the next action; 4) iterate until the task is complete. The core pattern is the ReAct (Reasoning + Acting) loop: Think, Act, Observe, Think. Function Calling is the primary API mechanism for implementing tool use.

How do I evaluate and test LLM applications?

Multi-layer evaluation approach: 1) Offline evaluation — use labeled datasets to compute accuracy, F1, BLEU/ROUGE metrics; 2) LLM-as-Judge — use a powerful LLM (e.g., GPT-4) to evaluate another model output quality; 3) RAG-specific metrics — retrieval relevance (Precision@K), answer faithfulness (grounded in retrieved content), answer relevance; 4) Human evaluation — manual review and A/B testing for critical scenarios; 5) Online monitoring — track latency, cost, user feedback, error rates. Recommended tools: Ragas (RAG evaluation), DeepEval, LangSmith (tracing and debugging), Phoenix (observability).

How do I handle LLM hallucinations?

Strategies to reduce hallucinations: 1) RAG — ground the model in retrieved real documents rather than relying on memory; 2) System prompt constraints — explicitly instruct "answer only based on the provided context, say you do not know if unsure"; 3) Temperature parameter — lower temperature (0.0-0.3) to reduce randomness; 4) Structured output — use JSON Schema or function calling to constrain output format; 5) Self-verification — have the model verify its own answer is supported by evidence after generation; 6) Fact-checking pipeline — automated factual verification of outputs; 7) Source citation — require the model to cite specific document passages.

AI Engineering Core Concepts Quick Reference

AI Engineering — Quick Reference
=================================

Concept                       Description
-------                       -----------
Prompt Engineering            Designing inputs to get desired LLM outputs
System Prompt                 Instructions that set model role and constraints
Few-Shot Learning             Providing examples in the prompt for guidance
Chain-of-Thought (CoT)        Asking the model to reason step-by-step
RAG                           Retrieve relevant docs, then generate answers
Vector Database               Storage optimized for similarity search
Embedding                     Dense vector representation of text meaning
Chunking                      Splitting documents into smaller pieces
Fine-Tuning                   Training a model further on custom data
Function Calling              LLM deciding when and how to call tools
AI Agent                      LLM that can plan, use tools, and iterate
ReAct Pattern                 Think -> Act -> Observe -> Think loop
Guardrails                    Input/output validation and safety filters
Hallucination                 Model generating false or unsupported info
Semantic Caching              Cache responses for semantically similar queries
Model Routing                 Directing queries to the best-fit model
LLM-as-Judge                  Using a strong LLM to evaluate outputs
LCEL (LangChain)              LangChain Expression Language for chains
Temperature                   Controls randomness (0=deterministic, 1=creative)
Token                         Basic unit of text processed by LLMs
Context Window                Maximum tokens a model can process at once
Structured Output             Constraining LLM output to JSON/schema
Batch API                     Processing multiple requests at reduced cost
Multimodal                    Models that process text, image, audio, video