TL;DR
AI Engineering is the discipline of building production applications on top of large language models. Master prompt engineering first (system prompts, few-shot, chain-of-thought), then add RAG with vector databases when the model needs domain knowledge, and only fine-tune when prompt engineering plus RAG are insufficient. Use LangChain or LlamaIndex to orchestrate chains and agents. Choose between OpenAI GPT-4o, Anthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro, or open-source Llama 3 based on cost, latency, context window, and task requirements. Always implement guardrails, evaluation pipelines, and cost monitoring before going to production.
Key Takeaways
- AI Engineering focuses on building applications on top of pre-trained LLMs, not training models from scratch; core skills include prompt engineering, RAG, API integration, and orchestration frameworks.
- Decision order: optimize prompts first, add RAG if insufficient, fine-tune only as a last resort — 90% of applications can be solved with prompt engineering + RAG.
- Vector databases are the core component of RAG; choose by scale: pgvector (under millions), Pinecone (managed), Weaviate/Qdrant (open-source at scale).
- AI agents achieve autonomous multi-step task execution through ReAct loops (Reasoning + Acting) and Function Calling.
- Guardrails are essential: content filtering, hallucination detection, output validation, rate limiting, and cost monitoring must be in place before production.
- Cost optimization keys: semantic caching, model routing (small models for simple tasks), prompt compression, and batch APIs can reduce costs by 60-80%.
1. What Is AI Engineering? (vs ML Engineering vs Data Science)
AI Engineering is an emerging discipline focused on building production applications using pre-trained large language models (LLMs). It sits at the intersection of software engineering and machine learning, but differs significantly from traditional ML Engineering and Data Science. AI Engineers do not train models from scratch — instead they integrate LLM capabilities into products through API calls, prompt engineering, Retrieval-Augmented Generation (RAG), and orchestration frameworks.
The explosive growth of LLMs in 2023-2024 gave rise to this role. As models like GPT-4, Claude 3, and Gemini became increasingly powerful, many application scenarios no longer require training custom models — instead they need engineers who know how to use these models effectively. The core value of an AI Engineer is transforming LLM capabilities into reliable, scalable product features.
AI Engineering vs ML Engineering vs Data Science
=================================================
Role Focus Key Skills Output
---- ----- ---------- ------
Data Scientist Analysis & insights Statistics, SQL, Reports, dashboards,
Python, visualization predictive models
ML Engineer Model training PyTorch, TensorFlow, Trained models,
& deployment MLOps, feature eng. ML pipelines
AI Engineer LLM application Prompt eng., RAG, AI-powered products,
development LangChain, APIs chatbots, agents
Key Differences:
- Data Scientist: "What does the data tell us?"
- ML Engineer: "How do we train and serve a model?"
- AI Engineer: "How do we build a product with an LLM?"2. LLM APIs: OpenAI, Anthropic Claude, Google Gemini
LLM APIs are the foundation of AI Engineering. The three major providers each have strengths: OpenAI GPT-4o has the most mature ecosystem; Anthropic Claude 3.5 Sonnet leads in long-context understanding and safety; Google Gemini 1.5 Pro offers an ultra-long context window (1M tokens) with deep Google Cloud integration.
// OpenAI API — Chat Completion
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain async/await in JavaScript" }
],
temperature: 0.7,
max_tokens: 1000,
});
console.log(response.choices[0].message.content);
console.log("Tokens used:", response.usage.total_tokens);// Anthropic Claude API
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const message = await anthropic.messages.create({
model: "claude-3-5-sonnet-20241022",
max_tokens: 1024,
system: "You are a senior software architect.",
messages: [
{ role: "user", content: "Design a rate limiter for a REST API" }
],
});
console.log(message.content[0].text);
console.log("Input tokens:", message.usage.input_tokens);
console.log("Output tokens:", message.usage.output_tokens);// Google Gemini API
import { GoogleGenerativeAI } from "@google/generative-ai";
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);
const model = genAI.getGenerativeModel({ model: "gemini-1.5-pro" });
const result = await model.generateContent({
contents: [{
role: "user",
parts: [{ text: "Compare microservices vs monolith architecture" }]
}],
generationConfig: {
temperature: 0.7,
maxOutputTokens: 1000,
},
});
console.log(result.response.text());LLM API Comparison (2026)
==========================
Provider Model Context Input Cost Output Cost Strengths
-------- ----- ------- ---------- ----------- ---------
OpenAI GPT-4o 128K \$2.50/1M \$10.00/1M Best ecosystem, multimodal
OpenAI GPT-4o-mini 128K \$0.15/1M \$0.60/1M Cheapest smart model
Anthropic Claude 3.5 Sonnet 200K \$3.00/1M \$15.00/1M Long context, safety, coding
Anthropic Claude 3.5 Haiku 200K \$0.25/1M \$1.25/1M Fast and affordable
Google Gemini 1.5 Pro 1M \$1.25/1M \$5.00/1M Longest context window
Google Gemini 1.5 Flash 1M \$0.075/1M \$0.30/1M Ultra fast and cheap
Meta Llama 3.1 405B 128K Self-host Self-host Open source, no API cost
Meta Llama 3.1 70B 128K Self-host Self-host Best open-source mid-tier
Mistral Mixtral 8x22B 64K \$2.00/1M \$6.00/1M Strong EU-based option3. Prompt Engineering: Few-Shot, Chain-of-Thought, and System Prompts
Prompt engineering is the most essential skill for AI Engineers. By carefully designing input prompts, you can dramatically improve model output quality without modifying the model itself. Mastering system prompts (setting roles and constraints), few-shot learning (providing examples for guidance), and chain-of-thought (guiding step-by-step reasoning) is a required course for every AI Engineer.
// 1. System Prompt — Setting Role and Constraints
const messages = [
{
role: "system",
content: `You are a senior TypeScript developer.
Rules:
- Always use strict TypeScript types (no "any")
- Include error handling for all async operations
- Add JSDoc comments for public functions
- If you are unsure, say "I am not certain" rather than guessing
- Format output as markdown code blocks`
},
{ role: "user", content: "Write a function to fetch user data from an API" }
];// 2. Few-Shot Learning — Providing Examples
const fewShotMessages = [
{
role: "system",
content: "You classify customer support tickets into categories."
},
// Example 1
{ role: "user", content: "My order hasn't arrived yet, it's been 2 weeks" },
{ role: "assistant", content: "Category: SHIPPING\nPriority: HIGH\nSentiment: FRUSTRATED" },
// Example 2
{ role: "user", content: "How do I change my password?" },
{ role: "assistant", content: "Category: ACCOUNT\nPriority: LOW\nSentiment: NEUTRAL" },
// Example 3
{ role: "user", content: "Your product is amazing, saved me hours!" },
{ role: "assistant", content: "Category: FEEDBACK\nPriority: LOW\nSentiment: POSITIVE" },
// Actual query
{ role: "user", content: "I was charged twice for my subscription" }
];// 3. Chain-of-Thought (CoT) — Step-by-Step Reasoning
const cotPrompt = {
role: "user",
content: `Analyze whether this API design follows REST best practices.
API Endpoint: POST /api/users/123/delete
Think step by step:
1. Check the HTTP method appropriateness
2. Evaluate the URL structure
3. Assess resource naming conventions
4. Check for idempotency considerations
5. Provide your final assessment with improvements`
};
// CoT Variations:
// - "Let's think step by step" (zero-shot CoT)
// - "Think through this carefully before answering" (implicit CoT)
// - Provide worked examples with reasoning (few-shot CoT)
// - "First analyze X, then consider Y, finally conclude Z" (structured CoT)Prompt Engineering Techniques — Quick Reference
================================================
Technique When to Use Example
--------- ----------- -------
Zero-shot Simple, well-defined tasks "Translate to French: Hello"
Few-shot Classification, formatting Provide 3-5 input/output pairs
Chain-of-Thought Math, logic, complex reasoning "Think step by step..."
System Prompt Role, tone, constraints "You are a legal expert..."
Output Format Structured data needed "Respond in JSON format..."
Self-consistency High-stakes decisions Generate 5 answers, take majority
Tree of Thought Complex problem solving Explore multiple solution paths
ReAct Tool use, agents Think -> Act -> Observe loop
Retrieval-Augmented Domain-specific knowledge Inject relevant docs into context
Anti-patterns to Avoid:
- Vague instructions ("make it better")
- Overloading context (irrelevant information)
- No output format specification
- Not handling edge cases in prompt
- Using negation ("don't do X") instead of affirmation ("do Y")4. RAG (Retrieval-Augmented Generation) Architecture
RAG is one of the most important architectural patterns in AI Engineering today. By retrieving relevant documents from an external knowledge base before generating answers, it solves two core LLM problems: knowledge cutoff dates and hallucinations. RAG enables models to answer questions grounded in up-to-date, domain-specific real data while providing traceable source citations.
RAG Architecture — Data Flow
============================
INDEXING PHASE (offline, one-time):
┌──────────┐ ┌──────────┐ ┌───────────┐ ┌───────────────┐
│ Documents │ -> │ Chunking │ -> │ Embedding │ -> │ Vector Store │
│ (PDF,Web, │ │ (split │ │ Model │ │ (Pinecone, │
│ Notion) │ │ text) │ │ (ada-002) │ │ pgvector) │
└──────────┘ └──────────┘ └───────────┘ └───────────────┘
QUERY PHASE (online, per-request):
┌──────────┐ ┌───────────┐ ┌───────────────┐
│ User │ -> │ Embed │ -> │ Vector Search │
│ Question │ │ Query │ │ (top-K docs) │
└──────────┘ └───────────┘ └───────┬───────┘
│
v
┌──────────┐ ┌───────────────────────────────┐
│ Answer │ <- │ LLM (query + retrieved docs) │
│ + Sources│ │ "Based on context, answer..." │
└──────────┘ └───────────────────────────────┘// Complete RAG Pipeline in TypeScript
import { OpenAIEmbeddings } from "@langchain/openai";
import { PineconeStore } from "@langchain/pinecone";
import { Pinecone } from "@pinecone-database/pinecone";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
// Step 1: Document Chunking
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // characters per chunk
chunkOverlap: 200, // overlap between chunks
separators: ["\n\n", "\n", ". ", " "], // split priority
});
const chunks = await splitter.splitDocuments(documents);
// Step 2: Generate Embeddings & Store
const embeddings = new OpenAIEmbeddings({
model: "text-embedding-3-small", // 1536 dimensions
});
const pinecone = new Pinecone();
const index = pinecone.Index("my-rag-index");
const vectorStore = await PineconeStore.fromDocuments(
chunks,
embeddings,
{ pineconeIndex: index }
);
// Step 3: Query — Retrieve + Generate
async function ragQuery(question: string) {
// Retrieve top 5 most relevant chunks
const relevantDocs = await vectorStore.similaritySearch(question, 5);
// Build context from retrieved documents
const context = relevantDocs
.map((doc, i) => "Source " + (i + 1) + ": " + doc.pageContent)
.join("\n\n");
// Generate answer with context
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: "Answer the question based ONLY on the provided context. " +
"If the context does not contain the answer, say so. " +
"Cite your sources using [Source N] format."
},
{
role: "user",
content: "Context:\n" + context + "\n\nQuestion: " + question
}
],
temperature: 0.2, // low temp for factual answers
});
return {
answer: response.choices[0].message.content,
sources: relevantDocs.map(d => d.metadata),
};
}Chunking Strategies
====================
Strategy Chunk Size Overlap Best For
-------- ---------- ------- --------
Fixed-size 500-1000 100-200 General purpose, simple
Recursive character 500-1500 100-300 Most text documents
Sentence-based 1-5 sent. 1 sent. FAQ, precise retrieval
Semantic Varies N/A Topic-coherent chunks
Parent-child Small+Large N/A Retrieve small, pass large
Markdown header By section N/A Technical documentation
Code-aware By function N/A Source code files
Tips:
- Smaller chunks = more precise retrieval but less context
- Larger chunks = more context but noisier retrieval
- Overlap prevents information loss at chunk boundaries
- Test with your actual data to find optimal size5. Vector Databases: Pinecone, Weaviate, Qdrant, and pgvector
Vector databases are the core storage layer in RAG architectures. They are purpose-built for similarity search over high-dimensional vectors, capable of finding the most similar results from millions of vectors in milliseconds. Choosing the right vector database depends on your scale, infrastructure preferences, and feature requirements.
// Pinecone — Fully Managed Vector Database
import { Pinecone } from "@pinecone-database/pinecone";
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index("my-index");
// Upsert vectors
await index.upsert([
{
id: "doc-1",
values: embedding, // float array [0.1, -0.2, ...]
metadata: { source: "docs/api.md", section: "auth" },
},
]);
// Query with metadata filter
const results = await index.query({
vector: queryEmbedding,
topK: 5,
filter: { source: { "\$eq": "docs/api.md" } },
includeMetadata: true,
});// pgvector — PostgreSQL Extension (great for existing PG users)
// SQL setup:
// CREATE EXTENSION vector;
// CREATE TABLE documents (
// id SERIAL PRIMARY KEY,
// content TEXT,
// metadata JSONB,
// embedding vector(1536) -- dimension matches your model
// );
// CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);
import pg from "pg";
const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
// Insert document with embedding
await pool.query(
"INSERT INTO documents (content, metadata, embedding) VALUES (\$1, \$2, \$3)",
[content, JSON.stringify(metadata), JSON.stringify(embedding)]
);
// Similarity search (cosine distance)
const result = await pool.query(
`SELECT content, metadata,
1 - (embedding <=> \$1::vector) AS similarity
FROM documents
ORDER BY embedding <=> \$1::vector
LIMIT 5`,
[JSON.stringify(queryEmbedding)]
);Vector Database Comparison
===========================
Database Type Best For Pricing Max Scale
-------- ---- -------- ------- ---------
Pinecone Managed Zero-ops, quick start Pay-per-use Billions
Weaviate OSS/Cloud Hybrid search, modules Free/Managed Billions
Qdrant OSS/Cloud Filtering, performance Free/Managed Billions
pgvector PG Ext. Existing PG infra Free (self) Millions
Chroma OSS Local dev, prototyping Free Millions
Milvus OSS Ultra-large scale Free/Managed Trillions
FAISS Library In-memory, research Free Billions
Distance Metrics:
- Cosine similarity: normalized, most common for text embeddings
- Euclidean (L2): absolute distance, good for image features
- Dot product: fastest, works when vectors are normalized6. LangChain and LlamaIndex Frameworks
LangChain and LlamaIndex are the two most popular LLM application development frameworks. LangChain is a general-purpose orchestration framework ideal for complex multi-step workflows and agents; LlamaIndex specializes in data connection and retrieval, making it the top choice for RAG applications. They can be used complementarily.
// LangChain — Chain with Prompt Template
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";
const model = new ChatOpenAI({ model: "gpt-4o", temperature: 0 });
// Simple chain: prompt -> model -> parser
const prompt = ChatPromptTemplate.fromMessages([
["system", "You are a technical writer. Write clear, concise explanations."],
["user", "Explain {topic} for {audience}"],
]);
const chain = RunnableSequence.from([
prompt,
model,
new StringOutputParser(),
]);
const result = await chain.invoke({
topic: "Kubernetes pods",
audience: "junior developers",
});
console.log(result);// LangChain — RAG Chain with Retriever
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
const retriever = vectorStore.asRetriever({ k: 5 });
const ragPrompt = ChatPromptTemplate.fromMessages([
["system",
"Answer based on the following context only.\n" +
"If the context does not help, say you do not know.\n" +
"Context: {context}"],
["user", "{input}"],
]);
const documentChain = await createStuffDocumentsChain({
llm: model,
prompt: ragPrompt,
});
const ragChain = await createRetrievalChain({
retriever,
combineDocsChain: documentChain,
});
const answer = await ragChain.invoke({
input: "How do I configure SSL in Nginx?",
});
console.log(answer.answer);
console.log("Sources:", answer.context.map(d => d.metadata.source));# LlamaIndex — RAG Pipeline (Python)
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Settings,
)
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure global settings
Settings.llm = OpenAI(model="gpt-4o", temperature=0)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Load documents and build index
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query with RAG
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact", # "tree_summarize", "refine", "compact"
)
response = query_engine.query("How do I set up a CI/CD pipeline?")
print(response.response)
print("Sources:", [n.metadata for n in response.source_nodes])
# Chat engine with memory
chat_engine = index.as_chat_engine(
chat_mode="context", # "condense_question", "react"
system_prompt="You are a DevOps expert.",
)
response = chat_engine.chat("What is Kubernetes?")
response = chat_engine.chat("How does it compare to Docker Swarm?")LangChain vs LlamaIndex — When to Use Which
=============================================
Feature LangChain LlamaIndex
------- --------- ----------
Primary Focus Chain orchestration Data indexing & retrieval
Best For Complex workflows, RAG applications,
multi-step agents knowledge bases
Data Connectors Via integrations 150+ built-in loaders
Agent Support Excellent (LCEL, tools) Good (ReAct, tools)
RAG Quality Good Excellent (advanced chunking)
Learning Curve Moderate Lower for RAG tasks
Language Support Python, TypeScript/JS Python (primary), TS
Community Very large Large
Use LangChain when:
- Building multi-step agent workflows
- Need complex chain composition (LCEL)
- Building chatbots with tool calling
Use LlamaIndex when:
- Primary need is RAG over your data
- Need advanced indexing strategies
- Have diverse data sources to connect7. Fine-Tuning vs RAG vs Prompt Engineering: Decision Tree
These three techniques are the primary means for AI Engineers to customize LLM behavior. Choosing the right approach can dramatically reduce cost and development time. Follow a simple principle: start with the simplest method and upgrade to more complex ones only when necessary.
Decision Tree: How to Customize LLM Behavior
=============================================
Start here: "What does the model need to learn?"
|
|-- Nothing new, just better outputs?
| --> PROMPT ENGINEERING
| - System prompts, few-shot examples, output formats
| - Cost: \$0, Time: hours, Skill: low
|
|-- Needs specific/recent facts or your data?
| --> RAG (Retrieval-Augmented Generation)
| - Vector DB + retrieval pipeline + prompt
| - Cost: \$100-\$1K setup, Time: days, Skill: medium
|
|-- Needs to change behavior/style/format?
| |-- Is the change simple (e.g., always respond in JSON)?
| | --> PROMPT ENGINEERING (with structured output)
| |
| |-- Is the change complex (domain-specific reasoning)?
| --> FINE-TUNING
| - Prepare training data, train, evaluate
| - Cost: \$500-\$10K+, Time: weeks, Skill: high
|
|-- Needs both facts AND behavior change?
--> FINE-TUNING + RAG (combined)
- Fine-tune for domain language understanding
- RAG for real-time factual grounding
- Cost: highest, Time: weeks-months// OpenAI Fine-Tuning API Example
// Step 1: Prepare training data (JSONL format)
// training_data.jsonl:
// {"messages": [{"role":"system","content":"You are a SQL expert"},
// {"role":"user","content":"Show all users created today"},
// {"role":"assistant","content":"SELECT * FROM users WHERE created_at >= CURRENT_DATE;"}]}
// {"messages": [{"role":"system","content":"You are a SQL expert"},
// {"role":"user","content":"Count active premium users"},
// {"role":"assistant","content":"SELECT COUNT(*) FROM users WHERE status = 'active' AND plan = 'premium';"}]}
// Step 2: Upload training file
const file = await openai.files.create({
file: fs.createReadStream("training_data.jsonl"),
purpose: "fine-tune",
});
// Step 3: Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
training_file: file.id,
model: "gpt-4o-mini-2024-07-18",
hyperparameters: {
n_epochs: 3,
batch_size: "auto",
learning_rate_multiplier: "auto",
},
});
// Step 4: Use fine-tuned model
// const response = await openai.chat.completions.create({
// model: "ft:gpt-4o-mini-2024-07-18:my-org::abc123",
// messages: [...]
// });Prompt Engineering vs RAG vs Fine-Tuning
=========================================
Dimension Prompt Eng. RAG Fine-Tuning
--------- ----------- --- -----------
Setup Cost Free Low-Medium High
Time to Implement Hours Days Weeks
Per-Query Cost Base model cost +Retrieval cost -20-50% savings
Knowledge Update Change prompt Update vector DB Retrain model
Factual Accuracy Model knowledge High (grounded) Model knowledge
Style/Format Good (examples) Limited Excellent
Source Citations No Yes No
Data Privacy Sent to API Chunks sent Trained into model
Maintenance Easy Medium Complex
Best Starting Point YES Second choice Last resort8. Embedding Models and Semantic Search
Embedding models convert text into high-dimensional vectors (typically 768-3072 dimensions) such that semantically similar texts are closer together in vector space. Embeddings are the foundation of RAG, semantic search, text classification, and clustering applications. Choosing the right embedding model directly impacts retrieval quality.
// Generating Embeddings with OpenAI
const response = await openai.embeddings.create({
model: "text-embedding-3-small", // or "text-embedding-3-large"
input: "How do I deploy a Next.js application?",
dimensions: 1536, // can reduce for cost savings (e.g., 512)
});
const embedding = response.data[0].embedding;
// embedding = [0.0123, -0.0456, 0.0789, ...] (1536 floats)
// Batch embeddings (more efficient)
const batchResponse = await openai.embeddings.create({
model: "text-embedding-3-small",
input: [
"How to set up Docker containers",
"Kubernetes pod configuration guide",
"CI/CD pipeline with GitHub Actions",
"AWS Lambda serverless functions",
],
});
// batchResponse.data[0].embedding, batchResponse.data[1].embedding, ...// Semantic Search Implementation
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
// Compare semantic similarity
const queries = [
"How to deploy to production", // semantically similar
"Production deployment guide", // semantically similar
"Best pizza recipe", // semantically different
];
// Result: queries[0] and queries[1] will have high similarity (~0.92)
// queries[0] and queries[2] will have low similarity (~0.15)
// Hybrid Search: combine semantic + keyword for best results
// score = alpha * semantic_score + (1 - alpha) * bm25_score
// alpha = 0.7 is a good starting pointEmbedding Model Comparison
===========================
Model Provider Dims Cost/1M tokens MTEB Score
----- -------- ---- -------------- ----------
text-embedding-3-large OpenAI 3072 \$0.13 64.6
text-embedding-3-small OpenAI 1536 \$0.02 62.3
voyage-3 Voyage AI 1024 \$0.06 67.1
embed-v3.0 Cohere 1024 \$0.10 64.8
BGE-large-en-v1.5 BAAI 1024 Free (OSS) 63.5
GTE-Qwen2-7B Alibaba 3584 Free (OSS) 70.2
nomic-embed-text Nomic 768 Free (OSS) 62.4
Tips:
- Start with text-embedding-3-small (best cost/quality ratio)
- Use dimension reduction for cost savings (3072 -> 1024)
- Benchmark on YOUR data, not just MTEB leaderboard
- Open-source models (BGE, GTE) are competitive and free
- Use the same model for indexing and querying9. AI Agents and Tool Use (Function Calling)
AI agents are LLM applications capable of autonomous planning, reasoning, and using tools to complete complex tasks. Unlike simple Q&A, agents can decompose tasks, call external APIs, execute code, query databases, and dynamically adjust strategies based on intermediate results. Function Calling is the core API mechanism for implementing agent tool use.
// OpenAI Function Calling — Tool Use
const tools = [
{
type: "function",
function: {
name: "search_documentation",
description: "Search technical documentation for a given query",
parameters: {
type: "object",
properties: {
query: { type: "string", description: "Search query" },
language: {
type: "string",
enum: ["javascript", "python", "rust", "go"],
description: "Programming language filter",
},
},
required: ["query"],
},
},
},
{
type: "function",
function: {
name: "execute_code",
description: "Execute a code snippet and return the output",
parameters: {
type: "object",
properties: {
code: { type: "string", description: "Code to execute" },
language: { type: "string", enum: ["javascript", "python"] },
},
required: ["code", "language"],
},
},
},
];
// Agent loop: Think -> Act -> Observe -> Think
async function agentLoop(userMessage: string) {
const messages: any[] = [
{
role: "system",
content: "You are a helpful coding assistant. Use tools when needed."
},
{ role: "user", content: userMessage }
];
while (true) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
tools,
tool_choice: "auto",
});
const message = response.choices[0].message;
messages.push(message);
// If no tool calls, the agent is done
if (!message.tool_calls || message.tool_calls.length === 0) {
return message.content;
}
// Execute each tool call
for (const toolCall of message.tool_calls) {
const args = JSON.parse(toolCall.function.arguments);
let result: string;
if (toolCall.function.name === "search_documentation") {
result = await searchDocs(args.query, args.language);
} else if (toolCall.function.name === "execute_code") {
result = await executeCode(args.code, args.language);
} else {
result = "Unknown tool: " + toolCall.function.name;
}
messages.push({
role: "tool",
tool_call_id: toolCall.id,
content: result,
});
}
}
}Agent Architecture Patterns
============================
1. ReAct (Reasoning + Acting):
Thought: "I need to find the API rate limits"
Action: search_documentation("rate limits REST API")
Observe: "Rate limits: 100 req/min for free tier..."
Thought: "Now I have the info, I can answer"
Answer: "The API allows 100 requests per minute..."
2. Plan-and-Execute:
Plan: ["Search docs", "Write code", "Test code", "Review"]
Execute: Run each step, re-plan if needed
3. Multi-Agent (Crew/Swarm):
Agent 1 (Researcher): Gathers information
Agent 2 (Developer): Writes code based on research
Agent 3 (Reviewer): Reviews and suggests improvements
Orchestrator: Coordinates agent communication
4. Tool-Augmented Generation:
LLM decides WHEN and WHICH tool to call
Tools: calculator, web search, code exec, DB query, API calls
LLM synthesizes tool results into final answer10. Guardrails and Safety: Content Filtering, Hallucination Detection
Production LLM applications must have guardrails. Guardrails ensure model outputs are safe, accurate, properly formatted, and comply with business rules. An LLM application without guardrails is like an API without validation — it will eventually break.
// Guardrails Implementation Pattern
// 1. Input Validation — Filter malicious/inappropriate inputs
async function validateInput(userMessage: string): Promise<boolean> {
// Check message length
if (userMessage.length > 10000) {
throw new Error("Message too long");
}
// Prompt injection detection
const injectionPatterns = [
/ignore (all |previous |above )?instructions/i,
/you are now/i,
/system prompt/i,
/reveal your/i,
];
if (injectionPatterns.some(p => p.test(userMessage))) {
throw new Error("Potential prompt injection detected");
}
// Content moderation (OpenAI Moderation API)
const moderation = await openai.moderations.create({
input: userMessage,
});
if (moderation.results[0].flagged) {
throw new Error("Content flagged: " +
Object.entries(moderation.results[0].categories)
.filter(([, v]) => v)
.map(([k]) => k)
.join(", ")
);
}
return true;
}// 2. Output Validation — Ensure correct format and content
import { z } from "zod";
// Define expected output schema
const ProductRecommendation = z.object({
products: z.array(z.object({
name: z.string(),
reason: z.string().max(200),
confidence: z.number().min(0).max(1),
price_range: z.enum(["budget", "mid-range", "premium"]),
})).min(1).max(5),
disclaimer: z.string(),
});
// Parse and validate LLM output
function validateOutput(llmOutput: string) {
try {
const parsed = JSON.parse(llmOutput);
const validated = ProductRecommendation.parse(parsed);
return { success: true, data: validated };
} catch (error) {
// Retry with corrective prompt or return fallback
return { success: false, error: String(error) };
}
}
// 3. Hallucination Detection for RAG
async function checkFaithfulness(
answer: string,
sources: string[]
): Promise<{ faithful: boolean; issues: string[] }> {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content:
"You are a fact-checker. Compare the answer against the source " +
"documents. Identify any claims NOT supported by the sources. " +
"Respond in JSON: {faithful: boolean, issues: string[]}"
},
{
role: "user",
content: "Sources:\n" + sources.join("\n---\n") +
"\n\nAnswer:\n" + answer
}
],
response_format: { type: "json_object" },
temperature: 0,
});
return JSON.parse(response.choices[0].message.content || "{}");
}Production Guardrails Checklist
================================
Layer Check Priority
----- ----- --------
Input Message length limit CRITICAL
Input Prompt injection detection CRITICAL
Input Content moderation (toxicity) CRITICAL
Input PII detection and redaction HIGH
Input Rate limiting per user HIGH
Output JSON schema validation HIGH
Output Hallucination / faithfulness check HIGH
Output Content safety filter CRITICAL
Output Max output length enforcement MEDIUM
Output Source citation verification MEDIUM
System Cost per request monitoring HIGH
System Latency tracking (P50, P95, P99) HIGH
System Error rate alerting CRITICAL
System Fallback responses for failures HIGH
System Audit logging for compliance MEDIUM11. Cost Optimization for LLM APIs
LLM API costs are one of the primary operational expenses for production applications. Unoptimized LLM applications can consume thousands to tens of thousands of dollars monthly. Through intelligent caching, model routing, prompt compression, and batch processing, costs can be reduced by 60-80% while maintaining output quality.
// 1. Semantic Caching — Avoid redundant API calls
import { createHash } from "crypto";
class SemanticCache {
private cache: Map<string, { response: string; embedding: number[] }> = new Map();
private similarityThreshold = 0.95;
async get(query: string): Promise<string | null> {
const queryEmbedding = await getEmbedding(query);
for (const [, entry] of this.cache) {
const similarity = cosineSimilarity(queryEmbedding, entry.embedding);
if (similarity >= this.similarityThreshold) {
return entry.response; // Cache hit
}
}
return null; // Cache miss
}
async set(query: string, response: string): Promise<void> {
const embedding = await getEmbedding(query);
const key = createHash("sha256").update(query).digest("hex");
this.cache.set(key, { response, embedding });
}
}// 2. Model Routing — Use the cheapest model that works
async function routeToModel(query: string): Promise<string> {
// Classify query complexity
const classification = await openai.chat.completions.create({
model: "gpt-4o-mini", // cheap classifier
messages: [{
role: "system",
content:
"Classify the query complexity as SIMPLE, MEDIUM, or COMPLEX. " +
"SIMPLE: factual lookup, formatting, translation. " +
"MEDIUM: summarization, code generation, analysis. " +
"COMPLEX: multi-step reasoning, creative writing, architecture design. " +
"Respond with only the classification word."
}, {
role: "user", content: query
}],
max_tokens: 10,
});
const complexity = classification.choices[0].message.content?.trim();
const modelMap: Record<string, string> = {
SIMPLE: "gpt-4o-mini", // \$0.15/1M input
MEDIUM: "gpt-4o-mini", // \$0.15/1M input
COMPLEX: "gpt-4o", // \$2.50/1M input
};
const model = modelMap[complexity || "MEDIUM"];
console.log("Routing " + query.slice(0, 50) + "... to " + model);
return model;
}LLM Cost Optimization Strategies
==================================
Strategy Savings Effort Impact on Quality
-------- ------- ------ -----------------
Semantic caching 30-60% Medium None (exact/similar)
Model routing 40-70% Medium Minimal (smart routing)
Prompt compression 10-30% Low Minimal
Batch API (OpenAI) 50% Low None
Reduce max_tokens 5-20% Low None (if set correctly)
Shorter system prompts 5-15% Low Minimal
Streaming (early stop) 10-30% Medium Variable
Open-source models 80-100% High Depends on model
Example Monthly Cost Breakdown (100K queries/month):
Unoptimized (GPT-4o for everything): \$5,000
+ Model routing (80% to mini): \$1,200 (-76%)
+ Semantic caching (40% hit rate): \$720 (-86%)
+ Prompt compression: \$580 (-88%)
+ Batch API for async tasks: \$450 (-91%)12. Model Comparison: GPT-4 vs Claude 3 vs Gemini vs Llama 3
Choosing the right model is the most critical decision in AI Engineering. Different models have varying strengths in reasoning capability, context window, cost, latency, and task-specific performance. Below is a detailed comparison of the leading models.
Model Comparison — Detailed Breakdown (2026)
==============================================
Category GPT-4o Claude 3.5 Gemini 1.5 Llama 3.1
Sonnet Pro 405B
-------- ------ ----------- ----------- ----------
Provider OpenAI Anthropic Google Meta (OSS)
Context Window 128K 200K 1M 128K
Multimodal Text+Image+ Text+Image+ Text+Image+ Text+Image
Audio PDF Video+Audio
Coding Excellent Best Very Good Very Good
Reasoning Excellent Excellent Good Good
Long Context Good Excellent Best Good
Safety Good Best Good Good
Speed (TTFT) Fast Fast Fast Varies
API Ecosystem Best Good Good N/A
Fine-tuning Yes No (yet) Yes Full control
Self-hosting No No No Yes
Data Privacy API only API only API only Full control
Best For:
GPT-4o: General purpose, largest ecosystem, best tooling
Claude 3.5 Sonnet: Coding, long documents, safety-critical apps
Gemini 1.5 Pro: Ultra-long context (books, codebases), multimodal
Llama 3.1 405B: Self-hosting, data privacy, customization
Budget Models:
GPT-4o-mini: Best cheap commercial model
Claude 3.5 Haiku: Fast and affordable, good quality
Gemini 1.5 Flash: Ultra cheap, very fast
Llama 3.1 70B: Best open-source mid-tier, self-hostable// Multi-Model Strategy: Use the Right Model for Each Task
const MODEL_CONFIG = {
// High-stakes, complex reasoning
complex: {
model: "gpt-4o",
temperature: 0.3,
useCases: ["architecture design", "code review", "legal analysis"],
},
// Long document processing
longContext: {
model: "gemini-1.5-pro",
temperature: 0.2,
useCases: ["codebase analysis", "book summarization", "log analysis"],
},
// Code generation and debugging
coding: {
model: "claude-3-5-sonnet-20241022",
temperature: 0,
useCases: ["code generation", "debugging", "refactoring"],
},
// Simple tasks, high volume
simple: {
model: "gpt-4o-mini",
temperature: 0.5,
useCases: ["classification", "extraction", "formatting"],
},
// Privacy-sensitive, self-hosted
private: {
model: "meta-llama/Llama-3.1-70B-Instruct",
temperature: 0.3,
useCases: ["medical records", "financial data", "internal tools"],
},
};13. Evaluation and Testing LLM Applications
Evaluation and testing is the most overlooked yet most important aspect of AI Engineering. The non-deterministic nature of LLM outputs makes traditional testing methods only partially applicable. You need a specialized evaluation framework to measure output quality, accuracy, and safety, continuously improving through iterations.
// LLM Evaluation Framework
// 1. LLM-as-Judge — Use a strong model to evaluate outputs
async function llmJudge(
question: string,
answer: string,
criteria: string
): Promise<{ score: number; reasoning: string }> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "system",
content:
"You are an expert evaluator. Score the answer on a scale of " +
"1-5 based on the given criteria. " +
"Respond in JSON: {score: number, reasoning: string}"
}, {
role: "user",
content: "Question: " + question + "\n" +
"Answer: " + answer + "\n" +
"Criteria: " + criteria
}],
response_format: { type: "json_object" },
temperature: 0,
});
return JSON.parse(response.choices[0].message.content || "{}");
}
// Usage:
// await llmJudge(
// "What is Kubernetes?",
// generatedAnswer,
// "Accuracy, completeness, clarity, and conciseness"
// );// 2. RAG Evaluation Metrics
// Using Ragas framework concepts
interface RAGEvalResult {
faithfulness: number; // Is the answer grounded in sources?
answerRelevancy: number; // Does it actually answer the question?
contextPrecision: number; // Are retrieved docs relevant?
contextRecall: number; // Did we retrieve all relevant docs?
}
async function evaluateRAG(
question: string,
answer: string,
contexts: string[],
groundTruth: string
): Promise<RAGEvalResult> {
// Faithfulness: does the answer use only info from contexts?
const faithfulness = await llmJudge(
question,
answer,
"Score 1-5: Is every claim in the answer supported by the " +
"provided context? Penalize any unsupported claims heavily."
);
// Answer relevancy: does it address the question?
const relevancy = await llmJudge(
question,
answer,
"Score 1-5: Does the answer directly address the question? " +
"Penalize off-topic content and missing key points."
);
return {
faithfulness: faithfulness.score / 5,
answerRelevancy: relevancy.score / 5,
contextPrecision: 0, // computed via retrieval metrics
contextRecall: 0, // computed against ground truth
};
}LLM Testing Strategy
=====================
Test Type What It Tests Tools
--------- ------------- -----
Unit Tests Prompt template outputs pytest, vitest
Integration Tests Full RAG pipeline Ragas, DeepEval
Regression Tests Output consistency Golden datasets
A/B Tests Model/prompt comparison LangSmith, Braintrust
Red Team Tests Safety, edge cases Garak, manual
Load Tests Latency under load k6, Locust
Cost Tests Budget compliance Custom monitoring
Evaluation Tools:
Ragas - Open-source RAG evaluation framework
DeepEval - LLM evaluation with multiple metrics
LangSmith - Tracing, debugging, evaluation (LangChain)
Phoenix - LLM observability (Arize AI)
Braintrust - LLM evaluation and monitoring platform
Promptfoo - Open-source prompt testing CLI
Golden Rule: Never deploy without:
1. A golden evaluation dataset (50-200 examples)
2. Automated regression tests in CI
3. Production monitoring (latency, cost, errors)
4. Human review for high-stakes outputsFrequently Asked Questions
What is the difference between an AI Engineer and an ML Engineer?
ML Engineers focus on training and optimizing machine learning models from scratch — data collection, feature engineering, model training, hyperparameter tuning. AI Engineers focus on building applications using pre-trained LLMs — integrating models via APIs, prompt engineering, RAG, fine-tuning, and orchestration frameworks. AI Engineering is more application-layer; ML Engineering is more model-layer.
When should I use RAG vs fine-tuning?
Prefer RAG when: you need up-to-date information, need source citations, data changes frequently, or need explainability. Choose fine-tuning when: you need to change the model tone/style/format, need to learn domain-specific reasoning patterns, or RAG retrieval quality is insufficient. They can be combined — fine-tune to help the model understand domain language better, then use RAG to supply specific facts. Cost-wise, RAG has lower upfront cost but per-query retrieval overhead; fine-tuning requires higher upfront training cost but simpler inference.
How do I choose a vector database?
The choice depends on scale and infrastructure. Pinecone: fully managed, zero ops, good for prototypes and medium scale. Weaviate: open-source, hybrid search (vector + keyword), rich module ecosystem. Qdrant: open-source, Rust-based with excellent performance, powerful filtering. pgvector: PostgreSQL extension, ideal for teams with existing PG infrastructure, performs well under a few million vectors. Chroma: lightweight, great for local development and prototyping. For billion-scale vectors, consider Milvus.
How can I reduce LLM API costs?
Key strategies: 1) Semantic caching — cache responses for similar queries to avoid redundant calls; 2) Model routing — use smaller models (GPT-4o-mini/Claude Haiku) for simple tasks, large models only for complex ones; 3) Prompt optimization — reduce unnecessary tokens (trim system prompts, compress context); 4) Batch APIs — use batch endpoints for non-realtime scenarios, saving up to 50%; 5) Open-source models — self-host Llama 3 or Mistral for latency-sensitive and privacy-critical workloads; 6) Output length limits — set max_tokens to prevent verbose responses.
What is the difference between LangChain and LlamaIndex?
LangChain is a general-purpose LLM application orchestration framework, excelling at building complex Chains and Agents — ideal for multi-step workflows, tool calling, and conversation management. LlamaIndex (formerly GPT Index) specializes in data indexing and retrieval, connecting diverse data sources and building high-quality RAG pipelines. In practice they are often used together: LlamaIndex handles data ingestion and retrieval while LangChain handles chain orchestration and agent logic. If your core need is RAG, start with LlamaIndex; if you need complex multi-step workflows, start with LangChain.
What are AI agents and how do they differ from regular LLM calls?
AI agents are LLM applications that can autonomously plan, use tools, and execute multi-step tasks. A regular LLM call is single-turn input-output, whereas agents can: 1) decompose complex tasks into sub-steps; 2) call external tools (search engines, databases, APIs, code executors); 3) observe tool results and decide the next action; 4) iterate until the task is complete. The core pattern is the ReAct (Reasoning + Acting) loop: Think, Act, Observe, Think. Function Calling is the primary API mechanism for implementing tool use.
How do I evaluate and test LLM applications?
Multi-layer evaluation approach: 1) Offline evaluation — use labeled datasets to compute accuracy, F1, BLEU/ROUGE metrics; 2) LLM-as-Judge — use a powerful LLM (e.g., GPT-4) to evaluate another model output quality; 3) RAG-specific metrics — retrieval relevance (Precision@K), answer faithfulness (grounded in retrieved content), answer relevance; 4) Human evaluation — manual review and A/B testing for critical scenarios; 5) Online monitoring — track latency, cost, user feedback, error rates. Recommended tools: Ragas (RAG evaluation), DeepEval, LangSmith (tracing and debugging), Phoenix (observability).
How do I handle LLM hallucinations?
Strategies to reduce hallucinations: 1) RAG — ground the model in retrieved real documents rather than relying on memory; 2) System prompt constraints — explicitly instruct "answer only based on the provided context, say you do not know if unsure"; 3) Temperature parameter — lower temperature (0.0-0.3) to reduce randomness; 4) Structured output — use JSON Schema or function calling to constrain output format; 5) Self-verification — have the model verify its own answer is supported by evidence after generation; 6) Fact-checking pipeline — automated factual verification of outputs; 7) Source citation — require the model to cite specific document passages.
AI Engineering Core Concepts Quick Reference
AI Engineering — Quick Reference
=================================
Concept Description
------- -----------
Prompt Engineering Designing inputs to get desired LLM outputs
System Prompt Instructions that set model role and constraints
Few-Shot Learning Providing examples in the prompt for guidance
Chain-of-Thought (CoT) Asking the model to reason step-by-step
RAG Retrieve relevant docs, then generate answers
Vector Database Storage optimized for similarity search
Embedding Dense vector representation of text meaning
Chunking Splitting documents into smaller pieces
Fine-Tuning Training a model further on custom data
Function Calling LLM deciding when and how to call tools
AI Agent LLM that can plan, use tools, and iterate
ReAct Pattern Think -> Act -> Observe -> Think loop
Guardrails Input/output validation and safety filters
Hallucination Model generating false or unsupported info
Semantic Caching Cache responses for semantically similar queries
Model Routing Directing queries to the best-fit model
LLM-as-Judge Using a strong LLM to evaluate outputs
LCEL (LangChain) LangChain Expression Language for chains
Temperature Controls randomness (0=deterministic, 1=creative)
Token Basic unit of text processed by LLMs
Context Window Maximum tokens a model can process at once
Structured Output Constraining LLM output to JSON/schema
Batch API Processing multiple requests at reduced cost
Multimodal Models that process text, image, audio, video