数据库优化指南：索引策略、查询调优、Schema 设计与大规模性能优化

TL;DR — 60 秒掌握数据库优化

合理的索引是影响最大的单一优化 — 复合索引应遵循最左前缀规则
诊断慢查询时始终使用 EXPLAIN ANALYZE，不要靠猜
使用 JOIN、预加载或 DataLoader 批处理消除 N+1 查询
连接池（PgBouncer、HikariCP）防止高负载下的连接耗尽
用 Redis/Memcached 缓存热数据 — Cache-Aside 模式覆盖 90% 的场景
写密集型工作负载使用范式化，读密集型分析使用反范式化
大表按范围或哈希分区；仅在单节点达到极限时才分片
使用 pg_stat_statements、慢查询日志和连接池指标持续监控

核心要点

B-tree 索引处理 90% 的查询；JSONB/全文用 GIN，时序数据用 BRIN
复合索引列顺序很重要：等值列在前，范围列其次，排序列最后
覆盖索引（INCLUDE）可完全避免堆查找，实现纯索引扫描
部分索引在查询小子集行时可大幅减小索引体积
VACUUM 和 ANALYZE 是 PostgreSQL 关键维护操作 — 永远不要禁用 autovacuum
读副本可线性扩展读取；异步流复制适合分析场景
WAL（预写日志）确保持久性 — 调优 wal_buffers 和 checkpoint_completion_target
Schema 迁移应向后兼容：添加列可以，生产中永远不要重命名列

1. 索引策略：B-Tree、Hash、复合、覆盖和部分索引

索引是关系型数据库中最重要的性能工具。选择合适的索引类型和列顺序可以将查询时间从秒级降到毫秒级。理解 B-tree 遍历、复合索引前缀规则以及何时使用特殊索引类型，对数据库性能至关重要。

B-Tree, Hash, GIN, BRIN, Composite, Covering, and Partial Indexes

The B-tree index is the workhorse of relational databases — it maintains a balanced tree structure with O(log n) lookup time and supports equality, range, sorting, and prefix LIKE queries. Composite indexes extend this by indexing multiple columns but follow the strict left-prefix rule: the index on (status, created_at) can serve queries filtering on status alone, or status AND created_at, but not created_at alone.

Covering indexes use the INCLUDE clause to store additional columns in the index leaf pages, enabling index-only scans that never touch the heap table. Partial indexes only index rows matching a WHERE condition, dramatically reducing index size when you frequently query a small subset. Hash indexes provide O(1) equality lookups. GIN handles multi-valued data like JSONB and full-text search. BRIN is ideal for naturally ordered time-series data with minimal storage overhead.

-- B-Tree Index: the default and most versatile index type
-- Handles: =, <, >, <=, >=, BETWEEN, IN, LIKE prefix%, IS NULL
-- Structure: balanced tree with O(log n) lookup

-- Simple single-column index
CREATE INDEX idx_users_email ON users(email);

-- Composite index: column order follows the left-prefix rule
-- This index covers:
--   WHERE status = ?                           (uses 1st column)
--   WHERE status = ? AND created_at > ?        (uses both columns)
--   WHERE status = ? ORDER BY created_at DESC  (uses both columns)
-- But NOT:
--   WHERE created_at > ?   (cannot skip leading column)
CREATE INDEX idx_orders_status_date
  ON orders(status, created_at DESC);

-- Covering index with INCLUDE (PostgreSQL 11+, SQL Server)
-- Stores extra columns in leaf pages for index-only scans
-- The included columns are NOT part of the search key
CREATE INDEX idx_orders_covering
  ON orders(customer_id, status)
  INCLUDE (total, created_at);

-- This query uses index-only scan (never touches heap):
SELECT total, created_at FROM orders
WHERE customer_id = 42 AND status = 'completed';

-- Partial index: only index rows matching a WHERE condition
-- Dramatically smaller index when querying a subset
CREATE INDEX idx_orders_pending
  ON orders(created_at)
  WHERE status = 'pending';   -- only ~5% of rows indexed

-- Hash index: O(1) lookup, equality only (no range queries)
-- PostgreSQL 10+ makes hash indexes WAL-logged and crash-safe
CREATE INDEX idx_sessions_token ON sessions USING hash(token);

-- GIN index: for JSONB, arrays, full-text search
CREATE INDEX idx_products_attrs ON products USING gin(attributes);
SELECT * FROM products WHERE attributes @> '{"color": "red"}';

-- BRIN index: for naturally ordered data (timestamps, IDs)
-- Tiny index size — stores min/max per block range
CREATE INDEX idx_logs_created ON logs USING brin(created_at)
  WITH (pages_per_range = 32);

-- Expression index: index on computed values
CREATE INDEX idx_users_email_lower ON users(lower(email));
SELECT * FROM users WHERE lower(email) = 'user@example.com';

Index Type Comparison

Index Type	Best For	Write Overhead	Notes
B-Tree	Equality, range, sort, LIKE prefix%	Low	Default; handles 90% of queries
Hash	Equality only (=)	Low	O(1) lookup, no range/sort support
GIN	JSONB, arrays, full-text search	High write	Multi-valued data, containment queries
GiST	Geometric, range types, full-text	Medium	Nearest-neighbor, overlap queries
BRIN	Time-series, sequential data	Very low	Tiny size; stores min/max per block range
Composite	Multi-column queries	Medium	Left-prefix rule; column order matters
Covering	Index-only scans	Medium	INCLUDE extra columns; avoids heap lookup
Partial	Querying subset of rows	Low	WHERE clause in index definition

2. 查询优化：EXPLAIN 计划、N+1 问题与查询重写

即使有完美的索引，编写不当的查询也会导致性能灾难。学会阅读 EXPLAIN ANALYZE 输出、检测 N+1 模式、将子查询重写为 JOIN，是每个后端开发者的核心技能。

EXPLAIN ANALYZE, N+1 Detection, and Query Rewriting

EXPLAIN ANALYZE is the most important diagnostic tool for query performance. Unlike plain EXPLAIN, it actually executes the query and reports real timing, actual row counts, and buffer usage. When the estimated rows differ significantly from actual rows, it indicates stale statistics that need ANALYZE to refresh. Watch for sequential scans on large tables, external merge sorts (indicating insufficient work_mem), and nested loops with large inner sets.

The N+1 problem is the most common ORM performance trap. It occurs when code loads a list with one query, then fires a separate query for each item to fetch related data. The fix is always to batch: use JOINs in raw SQL, eager loading in ORMs (Prisma include, TypeORM relations), IN clause batching, or DataLoader for GraphQL resolvers. Rewriting correlated subqueries as JOINs, using window functions instead of self-joins, and using EXISTS over COUNT for existence checks can yield order-of-magnitude improvements.

-- EXPLAIN ANALYZE: run the query and show actual execution stats
-- Always use ANALYZE in development to see real numbers

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT c.name, COUNT(o.id) AS order_count, SUM(o.total) AS revenue
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.created_at >= '2025-01-01'
  AND o.status = 'completed'
GROUP BY c.id, c.name
ORDER BY revenue DESC
LIMIT 20;

-- Key plan nodes to understand:
-- Seq Scan         -> full table scan, no index used       (RED FLAG on large tables)
-- Index Scan       -> B-tree traversal, fetches from heap  (good)
-- Index Only Scan  -> reads only from index, no heap       (best)
-- Bitmap Index Scan-> collects row pointers, then fetches  (good for many rows)
-- Hash Join        -> builds hash table for inner relation (good for large sets)
-- Nested Loop      -> row-by-row join                      (good for small inner)
-- Sort             -> explicit sort operation               (check work_mem)

-- Red flags in EXPLAIN output:
-- 1. "rows=1" estimate but "actual rows=50000" -> stale stats, run ANALYZE
-- 2. Seq Scan on a table with millions of rows -> missing index
-- 3. "Sort Method: external merge" -> work_mem too small
-- 4. Nested Loop with large inner set -> consider Hash Join

-- N+1 query detection and fix:
-- BAD: 101 queries for 100 users
-- const users = await db.query('SELECT * FROM users LIMIT 100');
-- for (const u of users) {
--   const orders = await db.query(
--     'SELECT * FROM orders WHERE customer_id = $1', [u.id]
--   );
-- }

-- GOOD: single query with JOIN
SELECT u.id, u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.customer_id = u.id
GROUP BY u.id, u.name
LIMIT 100;

-- Query rewrite: EXISTS instead of COUNT for existence check
-- SLOW: scans all matching rows to count
SELECT * FROM products p
WHERE (SELECT COUNT(*) FROM reviews r WHERE r.product_id = p.id) > 0;

-- FAST: stops at first match
SELECT * FROM products p
WHERE EXISTS (SELECT 1 FROM reviews r WHERE r.product_id = p.id);

3. Schema 设计：范式化 vs 反范式化、分区与分片

项目早期的 Schema 设计决策对性能有持久影响。理解何时使用范式化保证一致性、何时反范式化提升速度，以及如何通过分区和分片实现水平扩展，可以避免后期昂贵的重构。

Normalization, Denormalization, Partitioning, and Materialized Views

Third Normal Form (3NF) eliminates data redundancy by ensuring each fact is stored exactly once. This is ideal for OLTP workloads where data consistency and write efficiency matter. Denormalization duplicates data for read performance, using summary tables, materialized views, or embedded aggregates. The trade-off is write complexity for read speed.

Table partitioning splits large tables into smaller, more manageable pieces. Range partitioning by date is the most common pattern — queries that filter on the partition key automatically skip irrelevant partitions (partition pruning). PostgreSQL supports declarative partitioning with RANGE, LIST, and HASH strategies. Sharding extends this concept across multiple database servers but adds significant operational complexity and should only be considered when single-node limits are genuinely reached.

-- Normalization: 3NF eliminates redundancy, ensures consistency
-- Each fact is stored once; updates happen in one place

-- Normalized schema (3NF)
CREATE TABLE customers (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  email VARCHAR(255) UNIQUE NOT NULL
);

CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  customer_id INT REFERENCES customers(id),
  status VARCHAR(20) DEFAULT 'pending',
  total DECIMAL(10,2) NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE order_items (
  id SERIAL PRIMARY KEY,
  order_id INT REFERENCES orders(id),
  product_id INT REFERENCES products(id),
  quantity INT NOT NULL,
  unit_price DECIMAL(10,2) NOT NULL
);

-- Denormalized: embed aggregates for fast reads (analytics)
-- Trade-off: write complexity for read speed
CREATE TABLE customer_summary (
  customer_id INT PRIMARY KEY REFERENCES customers(id),
  total_orders INT DEFAULT 0,
  total_revenue DECIMAL(12,2) DEFAULT 0,
  last_order_at TIMESTAMPTZ,
  -- Updated via trigger or application code
  updated_at TIMESTAMPTZ DEFAULT now()
);

-- Materialized view: precomputed query results, refreshed on demand
CREATE MATERIALIZED VIEW mv_daily_revenue AS
SELECT date_trunc('day', created_at) AS day,
  COUNT(*) AS order_count,
  SUM(total) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY 1;

CREATE UNIQUE INDEX ON mv_daily_revenue(day);
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_daily_revenue;

-- Table partitioning: split large tables by range
CREATE TABLE events (
  id BIGSERIAL,
  event_type VARCHAR(50),
  payload JSONB,
  created_at TIMESTAMPTZ NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2025_q1 PARTITION OF events
  FOR VALUES FROM ('2025-01-01') TO ('2025-04-01');
CREATE TABLE events_2025_q2 PARTITION OF events
  FOR VALUES FROM ('2025-04-01') TO ('2025-07-01');

-- Queries automatically prune irrelevant partitions:
SELECT * FROM events WHERE created_at >= '2025-03-01'
  AND created_at < '2025-04-01';  -- only scans Q1 partition

4. 连接池与资源管理

数据库连接是昂贵的资源。每个连接会消耗服务器内存，创建新连接有显著开销。连接池在应用请求间复用固定数量的连接，显著提升吞吐量并降低并发负载下的延迟。

pg Pool, PgBouncer, and Prepared Statements

Every PostgreSQL connection consumes approximately 10MB of memory and takes 50-100ms to establish. Without pooling, a web application serving 1000 concurrent users would need 1000 connections, consuming 10GB of memory just for connections. Connection pooling solves this by maintaining a fixed pool of reusable connections. PgBouncer in transaction mode is the gold standard for PostgreSQL — it can multiplex thousands of application connections through just 20-50 database connections. Prepared statements further reduce overhead by parsing the SQL once and reusing the execution plan.

// Node.js: pg Pool with proper configuration
import { Pool } from 'pg';

const pool = new Pool({
  host: process.env.DB_HOST,
  port: 5432,
  database: process.env.DB_NAME,
  user: process.env.DB_USER,
  password: process.env.DB_PASSWORD,
  max: 20,                       // max connections in pool
  idleTimeoutMillis: 30000,      // close idle connections after 30s
  connectionTimeoutMillis: 5000, // fail if can't connect in 5s
  keepAlive: true,               // TCP keepalive
  keepAliveInitialDelayMillis: 10000,
  // statement_timeout prevents runaway queries
  statement_timeout: 30000,      // 30s max per query
});

// Monitor pool health
setInterval(() => {
  console.log({
    totalCount: pool.totalCount,     // total connections
    idleCount: pool.idleCount,       // available connections
    waitingCount: pool.waitingCount, // queued requests
  });
}, 10000);

// Use prepared statements for repeated queries
const result = await pool.query({
  name: 'get-user-orders',
  text: 'SELECT * FROM orders WHERE customer_id = $1 AND status = $2',
  values: [customerId, 'completed'],
});

// PgBouncer configuration (pgbouncer.ini)
// [databases]
// mydb = host=127.0.0.1 port=5432 dbname=mydb
//
// [pgbouncer]
// pool_mode = transaction    # most flexible
// max_client_conn = 1000     # app-facing connections
// default_pool_size = 25     # actual DB connections per pool
// reserve_pool_size = 5      # extra connections for bursts
// server_idle_timeout = 600  # close idle server connections
// query_timeout = 30         # kill queries exceeding 30s
// log_connections = 1
// log_disconnections = 1

5. 缓存策略：Redis、Memcached 与查询缓存

缓存是降低读密集型应用数据库负载的最有效方式。理解缓存模式、失效策略以及 Redis 与 Memcached 的选择，可以防止数据过期和缓存雪崩问题。

Redis Cache-Aside, Stampede Prevention, and Invalidation

The cache-aside pattern is the most widely used caching strategy: check the cache first, on miss fetch from the database, then populate the cache. This lazy-loading approach means only data that is actually requested gets cached. Cache invalidation on write (delete the cache key) ensures eventual consistency. The cache stampede problem occurs when a popular key expires and hundreds of concurrent requests all hit the database simultaneously. Preventing this with a mutex/lock pattern ensures only one request rebuilds the cache while others wait.

Caching Pattern Comparison

Pattern	Data Flow	Pros	Cons
Cache-Aside	App checks cache -> miss -> query DB -> write cache	Simple, lazy loading	Cache miss penalty, possible stale data
Write-Through	App writes DB + cache atomically	Cache always fresh	Write latency, caches unused data
Write-Behind	App writes cache -> async flush to DB	Fastest writes	Risk of data loss, complex
Read-Through	Cache auto-fetches from DB on miss	Transparent to app	Cold start latency
Refresh-Ahead	Cache pre-refreshes before expiry	No miss penalty	Wasted refreshes if data not read

// Redis cache-aside pattern (most common)
import Redis from 'ioredis';

const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: 6379,
  maxRetriesPerRequest: 3,
  retryDelayOnFailover: 100,
  lazyConnect: true,
});

async function getUserWithCache(userId: string) {
  const cacheKey = 'user:' + userId;

  // 1. Check cache first
  const cached = await redis.get(cacheKey);
  if (cached) {
    return JSON.parse(cached);
  }

  // 2. Cache miss: fetch from database
  const user = await db.query(
    'SELECT * FROM users WHERE id = $1', [userId]
  );

  // 3. Store in cache with TTL (seconds)
  await redis.set(cacheKey, JSON.stringify(user), 'EX', 3600);

  return user;
}

// Cache invalidation on update
async function updateUser(userId: string, data: any) {
  await db.query('UPDATE users SET name = $1 WHERE id = $2',
    [data.name, userId]);

  // Delete cache entry (next read will repopulate)
  await redis.del('user:' + userId);
}

// Cache stampede prevention with singleflight/mutex
async function getWithLock(key: string, fetchFn: () => Promise<any>) {
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  // Try to acquire lock (NX = only if not exists)
  const lockKey = 'lock:' + key;
  const acquired = await redis.set(lockKey, '1', 'EX', 10, 'NX');

  if (acquired) {
    try {
      const data = await fetchFn();
      await redis.set(key, JSON.stringify(data), 'EX', 3600);
      return data;
    } finally {
      await redis.del(lockKey);
    }
  }

  // Another process is loading; wait and retry
  await new Promise(r => setTimeout(r, 100));
  return getWithLock(key, fetchFn);
}

6. PostgreSQL 专项优化

PostgreSQL 提供独特的优化功能，包括 VACUUM、ANALYZE、咨询锁、并行查询和 pg_stat_statements 扩展。正确调优这些 PostgreSQL 特定设置可以将性能提升数个数量级。

postgresql.conf Tuning, VACUUM, ANALYZE, and WAL Settings

PostgreSQL performance tuning starts with shared_buffers (set to 25% of RAM), effective_cache_size (75% of RAM), and work_mem (memory for sorts and hash operations). For SSD storage, lower random_page_cost to 1.1 (from the HDD default of 4.0) so the planner correctly favors index scans. VACUUM reclaims dead tuple space from updates and deletes — the MVCC architecture means deleted rows are not immediately removed. Autovacuum handles this automatically, but high-write tables may need more aggressive settings. Write-ahead logging (WAL) ensures crash recovery by writing changes to a sequential log before applying them to data files. Tuning wal_buffers and checkpoint_completion_target balances durability with write throughput.

-- PostgreSQL configuration tuning (postgresql.conf)
-- Adjust based on available RAM and workload type

-- Memory settings (for 16GB RAM server)
shared_buffers = '4GB'             -- 25% of total RAM
effective_cache_size = '12GB'      -- 75% of total RAM
work_mem = '64MB'                  -- per-operation sort/hash memory
maintenance_work_mem = '1GB'       -- VACUUM, CREATE INDEX memory

-- WAL (Write-Ahead Logging) settings
wal_buffers = '64MB'              -- WAL write buffer
checkpoint_completion_target = 0.9 -- spread checkpoint I/O
max_wal_size = '4GB'              -- trigger checkpoint
min_wal_size = '1GB'

-- Planner settings
random_page_cost = 1.1             -- SSD (default 4.0 for HDD)
effective_io_concurrency = 200     -- SSD concurrent I/O
default_statistics_target = 200    -- more accurate planner stats

-- Parallel query
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_worker_processes = 8

-- Autovacuum tuning for high-write tables
autovacuum_vacuum_scale_factor = 0.02    -- trigger at 2% dead rows
autovacuum_analyze_scale_factor = 0.01   -- analyze at 1% changes
autovacuum_max_workers = 4
autovacuum_naptime = '30s'

-- Monitor table bloat and dead tuples
SELECT schemaname, relname, n_live_tup, n_dead_tup,
  round(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_pct,
  last_vacuum, last_autovacuum, last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 20;

-- Find missing indexes with pg_stat_user_tables
SELECT relname, seq_scan, idx_scan,
  round(seq_scan::numeric / NULLIF(seq_scan + idx_scan, 0) * 100, 2)
    AS seq_scan_pct
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_scan_pct DESC;

7. MySQL 调优与 InnoDB 优化

MySQL 的 InnoDB 引擎有其独特的调优参数和优化模式。缓冲池大小、查询缓存配置和 InnoDB 特定设置直接影响 MySQL 应用的性能。

InnoDB Buffer Pool, Redo Log, and Slow Query Log

The InnoDB buffer pool is the single most important MySQL setting — it caches both data and indexes in memory. Set it to 70-80% of available RAM on a dedicated database server. Monitor the hit ratio (should be above 99%) and increase the size if reads frequently go to disk. The redo log (InnoDB equivalent of WAL) should be sized large enough to avoid frequent checkpoints. Enabling the slow query log with log_queries_not_using_indexes catches both slow queries and those missing indexes. Every MySQL table should have an explicit primary key — InnoDB uses the primary key as the clustered index, and without one, it generates a hidden row ID that wastes space.

-- MySQL / InnoDB tuning (my.cnf)

-- InnoDB buffer pool: the single most important setting
-- Store frequently accessed data and indexes in memory
innodb_buffer_pool_size = 12G          -- 70-80% of total RAM
innodb_buffer_pool_instances = 8       -- parallel buffer pools
innodb_buffer_pool_chunk_size = 128M

-- Redo log (InnoDB WAL equivalent)
innodb_log_file_size = 2G              -- larger = fewer checkpoints
innodb_log_buffer_size = 64M
innodb_flush_log_at_trx_commit = 1     -- 1=durable, 2=fast
innodb_flush_method = O_DIRECT         -- bypass OS page cache

-- I/O settings for SSD
innodb_io_capacity = 2000              -- background I/O ops/sec
innodb_io_capacity_max = 4000
innodb_read_io_threads = 8
innodb_write_io_threads = 8

-- Query optimization
join_buffer_size = 256K
sort_buffer_size = 2M
read_rnd_buffer_size = 1M
tmp_table_size = 64M
max_heap_table_size = 64M

-- Slow query log (essential for optimization)
slow_query_log = ON
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.5                  -- log queries > 500ms
log_queries_not_using_indexes = ON

-- Check buffer pool hit ratio (should be > 99%)
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';
-- Hit ratio = 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)

-- Find tables without primary key (InnoDB requires one)
SELECT t.TABLE_SCHEMA, t.TABLE_NAME
FROM information_schema.TABLES t
LEFT JOIN information_schema.TABLE_CONSTRAINTS tc
  ON t.TABLE_SCHEMA = tc.TABLE_SCHEMA
  AND t.TABLE_NAME = tc.TABLE_NAME
  AND tc.CONSTRAINT_TYPE = 'PRIMARY KEY'
WHERE t.TABLE_SCHEMA NOT IN ('mysql', 'information_schema',
  'performance_schema', 'sys')
  AND tc.TABLE_NAME IS NULL;

8. MongoDB 优化：索引、聚合与 Schema 设计

MongoDB 的优化模式与关系型数据库不同。文档 Schema 设计、使用 ESR（等值-排序-范围）规则的复合索引，以及聚合管道优化，是高性能 MongoDB 应用的关键。

ESR Index Rule, Aggregation Pipeline, and Schema Design

MongoDB compound indexes follow the ESR (Equality-Sort-Range) rule: place equality match fields first in the index definition, sort fields next, and range query fields last. This order ensures MongoDB can use the index efficiently for all three operations in a single query. Covered queries — where the projection matches index fields exactly — return results directly from the index without reading full documents, similar to covering indexes in PostgreSQL.

The aggregation pipeline should filter and project as early as possible to reduce the number of documents flowing through later stages. Schema design in MongoDB involves choosing between embedding related data (access together, 1:few relationship) and referencing (unbounded growth, many:many, accessed separately). Embedding reduces query count but increases document size; referencing is more normalized but requires additional lookups or $lookup stages.

// MongoDB optimization: indexes, aggregation, schema design

// Compound index: follow ESR rule (Equality, Sort, Range)
// E: exact match fields first
// S: sort fields next
// R: range fields last
db.orders.createIndex(
  { status: 1, created_at: -1, total: 1 },
  { name: "idx_orders_esr" }
);

// This query uses the ESR index efficiently:
db.orders.find({
  status: "completed",         // E: equality
  total: { $gte: 100 }         // R: range
}).sort({ created_at: -1 });   // S: sort

// Covered query: projection matches index fields exactly
// MongoDB returns results from index without touching documents
db.orders.find(
  { status: "completed" },
  { _id: 0, status: 1, created_at: 1, total: 1 }
);

// Aggregation pipeline optimization
// Rule: filter ($match) and project ($project) as early as possible
db.orders.aggregate([
  // Stage 1: filter first to reduce documents in pipeline
  { $match: {
    status: "completed",
    created_at: { $gte: new Date("2025-01-01") }
  }},
  // Stage 2: group after filtering
  { $group: {
    _id: { $dateToString: { format: "%Y-%m-%d", date: "$created_at" } },
    revenue: { $sum: "$total" },
    count: { $sum: 1 }
  }},
  // Stage 3: sort the smaller result set
  { $sort: { _id: -1 } },
  { $limit: 30 }
]);

// Schema design: embed vs reference
// Embed when: data is always accessed together, 1:few relationship
// Reference when: data grows unbounded, many:many, accessed separately

// Good embed: order with its items
{
  _id: ObjectId("..."),
  customer_id: ObjectId("..."),
  items: [
    { product_id: ObjectId("..."), name: "Widget", qty: 2, price: 9.99 },
    { product_id: ObjectId("..."), name: "Gadget", qty: 1, price: 29.99 }
  ],
  total: 49.97
}

// explain() to analyze query performance
db.orders.find({ status: "completed" }).explain("executionStats");

9. 数据库监控、分析与维护

持续监控和主动维护可以防止性能退化。设置慢查询日志、分析执行统计和执行常规维护任务，是必不可少的运维实践。

pg_stat_statements, Active Query Monitoring, and Index Health

The pg_stat_statements extension is essential for PostgreSQL monitoring — it tracks execution statistics for every query, including total and average execution time, call count, and rows returned. Focus optimization efforts on queries with the highest total_time (calls multiplied by mean_time) rather than just the single slowest query. Monitor active connections and their states to detect connection leaks, and identify long-running queries that may be holding locks. Regularly check for unused indexes — each unused index wastes disk space and slows down writes without providing any benefit. Table and index size monitoring helps plan capacity and identify tables that may benefit from partitioning or archiving.

-- PostgreSQL monitoring with pg_stat_statements
-- Enable in postgresql.conf:
-- shared_preload_libraries = 'pg_stat_statements'
-- pg_stat_statements.track = all

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Top 10 slowest queries by total execution time
SELECT
  round(total_exec_time::numeric, 2) AS total_time_ms,
  calls,
  round(mean_exec_time::numeric, 2) AS avg_time_ms,
  round((100 * total_exec_time /
    SUM(total_exec_time) OVER ())::numeric, 2) AS pct,
  left(query, 80) AS short_query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Active connections and their state
SELECT state, count(*), max(now() - state_change) AS max_duration
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state;

-- Find long-running queries (> 5 minutes)
SELECT pid, now() - pg_stat_activity.query_start AS duration,
  state, left(query, 100) AS query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes'
  AND state != 'idle'
ORDER BY duration DESC;

-- Table and index sizes
SELECT relname,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
  pg_size_pretty(pg_relation_size(relid)) AS table_size,
  pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 15;

-- Unused indexes (candidates for removal)
SELECT schemaname, relname, indexrelname,
  idx_scan, idx_tup_read,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE '%pkey'
ORDER BY pg_relation_size(indexrelid) DESC;

10. 扩展：读副本、WAL 与迁移策略

当单台数据库服务器达到极限时，扩展策略变得至关重要。读副本分散读负载，预写日志确保持久性，周密的迁移规划防止 Schema 变更时的停机。

Read Replicas, WAL Configuration, and Migration Patterns

Read replicas scale read capacity linearly — each replica can serve the same read throughput as the primary. PostgreSQL streaming replication continuously ships WAL records to replicas, maintaining near-real-time consistency. Synchronous replication guarantees zero data loss but adds write latency; asynchronous replication is faster but may have slight replication lag. Application-level read/write splitting routes read queries to replicas and writes to the primary.

Database migrations in production must follow the expand-contract pattern to avoid downtime. Never rename columns, drop columns, or change types in a single deployment. Instead, add the new column first, deploy code that writes to both old and new schemas, backfill data in batches, switch reads to the new schema, and only remove the old column in a later release. Creating indexes with CONCURRENTLY in PostgreSQL prevents table-level write locks during index creation. Always test migrations on a production-sized dataset copy before executing in production.

-- Read replicas: distribute read queries for linear scaling

-- PostgreSQL streaming replication setup (primary)
-- postgresql.conf on primary:
wal_level = replica
max_wal_senders = 5
wal_keep_size = '1GB'
synchronous_commit = on          -- or "off" for async

-- pg_hba.conf on primary:
-- host replication replicator replica_ip/32 md5

-- On replica:
-- pg_basebackup -h primary_ip -D /var/lib/postgresql/16/main
--   -U replicator -P -Xs -R

-- Application routing: send writes to primary, reads to replica
// Node.js example with read/write splitting
// const primary = new Pool({ host: 'primary.db.internal' });
// const replica = new Pool({ host: 'replica.db.internal' });
//
// async function query(sql, params, readOnly = false) {
//   const pool = readOnly ? replica : primary;
//   return pool.query(sql, params);
// }

-- Database migration: expand-contract pattern
-- Step 1: ADD new column (backward compatible)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Step 2: Backfill data in batches (not one massive UPDATE)
UPDATE users SET display_name = name
WHERE id BETWEEN 1 AND 10000 AND display_name IS NULL;
-- Repeat for next batch...

-- Step 3: Application reads from new column
-- Step 4: Application writes to new column
-- Step 5: (Later release) Drop old column
ALTER TABLE users DROP COLUMN name;  -- only after full migration

-- Zero-downtime index creation
CREATE INDEX CONCURRENTLY idx_users_display_name
  ON users(display_name);
-- CONCURRENTLY: does not lock table for writes (PostgreSQL)

常见问题

什么时候应该添加数据库索引？

当查询对某列进行过滤、连接或排序，且表中有数千行以上数据时，应添加索引。检查 EXPLAIN 输出中大表的顺序扫描。但要避免过度索引——每个索引都会拖慢 INSERT、UPDATE 和 DELETE 操作。重点关注最频繁、最慢查询中 WHERE 子句、JOIN 条件和 ORDER BY 使用的列。

水平分区和垂直分区有什么区别？

垂直分区按列拆分表——将不常访问或大型列（如 BLOB）移到单独的表。水平分区（跨服务器时称为分片）按行拆分——例如按日期范围分区订单，使查询只扫描相关分区。PostgreSQL 原生支持 RANGE、LIST 和 HASH 策略的声明式分区。

如何在 Redis 和 Memcached 之间选择？

需要数据结构（有序集合、列表、哈希）、持久化、发布订阅或 Lua 脚本时选 Redis。简单键值缓存且需要多线程性能和更低内存开销时选 Memcached。Redis 功能更丰富，适合大多数场景。Memcached 在高吞吐简单缓存且内存效率重要时表现更优。现代应用多选 Redis。

什么是 N+1 查询问题，如何解决？

N+1 问题是指代码用一条查询获取 N 个条目的列表，然后对每个条目再执行 N 条查询获取关联数据。例如加载 100 个用户后分别查询每人的订单，导致 101 次查询。通过 JOIN、ORM 预加载（Prisma include、TypeORM relations）、IN 子句批量查询或 GraphQL DataLoader 解决。

如何决定使用范式化还是反范式化？

写密集型 OLTP 工作负载且数据一致性关键时使用范式化（3NF），可防止更新异常并减少存储。读密集型 OLAP/分析工作负载且查询速度比写效率更重要时使用反范式化。实际中多数应用采用混合方案：核心表范式化，配合战略性反范式化（物化视图、汇总表、缓存聚合）用于热读路径。

什么是连接池，为什么它很重要？

连接池维护一组可复用的数据库连接，而非每次请求都创建新连接。创建一个 PostgreSQL 连接需 50-100ms 并消耗约 10MB 内存。使用连接池时，连接从池中借用并在使用后归还。PgBouncer（外部）、HikariCP（Java）和 pg Pool 模块（Node.js）都实现了此功能。PgBouncer 事务模式可用仅 20-50 个数据库连接支撑数千应用连接。

多久运行一次 PostgreSQL 的 VACUUM 和 ANALYZE？

Autovacuum 大多数情况下会自动处理——永远不要禁用它。默认设置在 20% 的行是死元组时触发 autovacuum。高写入表可将 autovacuum_vacuum_scale_factor 降到 0.01-0.05。批量数据加载或重大 Schema 变更后手动运行 ANALYZE。监控 pg_stat_user_tables 中 n_dead_tup 高的表。超大表可考虑分区使 VACUUM 更高效。

生产环境数据库迁移的最佳方式是什么？

遵循扩展-收缩模式：(1) 添加新列/表但不删除旧的。(2) 部署同时写入新旧 Schema 的应用代码。(3) 分批迁移现有数据。(4) 切换读取到新 Schema。(5) 在后续发布中删除旧列。永远不要在单次部署中重命名列、删除列或更改类型。使用 Flyway、Liquibase 或 Prisma Migrate 等工具。务必先在生产规模数据集副本上测试迁移。