What is database normalization and why is it important?

Database normalization is the process of organizing a relational database to reduce data redundancy and improve data integrity. It involves applying a series of normal forms (1NF, 2NF, 3NF, BCNF) to decompose tables. Normalization is important because it eliminates update anomalies, reduces storage waste, makes queries more predictable, and ensures data consistency. However, highly normalized schemas can require complex JOINs, so some denormalization is acceptable for read-heavy workloads.

What is the difference between a primary key and a foreign key?

A primary key uniquely identifies each row in a table and cannot be NULL. It enforces entity integrity. A foreign key is a column (or set of columns) in one table that references the primary key of another table, enforcing referential integrity. For example, an orders table might have a user_id foreign key that references the users table's id primary key. Foreign keys ensure that you cannot create an order for a non-existent user.

When should I use SQL vs NoSQL databases?

Use SQL (relational) databases when your data has well-defined relationships, you need ACID transactions, you require complex queries with JOINs, or data integrity is critical (banking, e-commerce orders). Use NoSQL when you need horizontal scaling across many servers, your data is unstructured or highly variable (document storage), you require extremely high write throughput (time-series, logging), or you need flexible schema evolution. Most modern applications use both: SQL for core transactional data and NoSQL for caching, sessions, or analytics.

What is an index in a database and when should I add one?

An index is a data structure (typically a B-tree) that allows the database to find rows matching a WHERE clause without scanning the entire table. Add indexes on columns used frequently in WHERE clauses, JOIN conditions, and ORDER BY clauses. Good candidates include foreign key columns, email/username fields (often used for login), and created_at columns (range queries). Avoid over-indexing — each index slows down INSERT/UPDATE/DELETE operations and consumes storage. Use EXPLAIN/EXPLAIN ANALYZE to verify that your indexes are actually being used.

What are ACID properties in database transactions?

ACID stands for Atomicity, Consistency, Isolation, and Durability. Atomicity means a transaction either fully succeeds or fully rolls back — no partial updates. Consistency means the database moves from one valid state to another, respecting all constraints. Isolation means concurrent transactions do not interfere with each other (controlled by isolation levels like READ COMMITTED, REPEATABLE READ, SERIALIZABLE). Durability means once a transaction commits, the data survives system crashes via write-ahead logging. ACID compliance is why relational databases are trusted for financial and critical business data.

How do I model a many-to-many relationship in a relational database?

Many-to-many relationships require a junction table (also called a bridge or associative table) with foreign keys pointing to both related tables. For example, to model the relationship between students and courses (a student can enroll in many courses; a course can have many students), create an enrollments table with student_id and course_id columns. The primary key of the junction table is typically a composite key of both foreign keys. You can also add extra columns to the junction table (like enrolled_at, grade) to store relationship-specific data.

What is connection pooling and why does it matter?

Connection pooling maintains a cache of reusable database connections so that each request does not pay the overhead of establishing a new TCP connection, TLS handshake, and authentication. A typical PostgreSQL connection costs 1-10ms to establish. With 1000 concurrent requests, creating 1000 new connections would overwhelm the database. A connection pool (e.g., pg-pool, HikariCP, pgBouncer) keeps 10-50 persistent connections and queues requests. Proper pool sizing is critical: too few connections create a queue; too many exhaust database memory. A common starting point is pool_size = (2 × CPU cores) + number_of_disk_spindles.

What is the difference between PostgreSQL and MySQL?

PostgreSQL is a feature-rich, standards-compliant object-relational database that excels at complex queries, advanced data types (JSONB, arrays, hstore), full-text search, window functions, CTEs, and extensibility. MySQL (and MariaDB) is faster for simple read-heavy workloads, has simpler replication setup, and is more widely hosted. PostgreSQL is generally preferred for new applications in 2026 due to its superior SQL compliance, JSONB support for semi-structured data, better concurrency via MVCC, and more active development. MySQL is a solid choice when you need broad hosting support or have an existing MySQL stack.

Database Design Guide: Normalization, ERDs, Indexing & SQL vs NoSQL (2026)

TL;DR — Database Design Quick Reference

Good database design starts with normalization (1NF→3NF), clear primary/foreign key relationships, and appropriate indexing on columns used in WHERE and JOIN clauses. Use SQL for transactional data requiring ACID guarantees; use NoSQL for scale-out, unstructured data, or caching. Always define your entities and relationships in an ERD before writing DDL. Connection pooling is essential in production. PostgreSQL is the recommended default for new projects in 2026.

Introduction: Why Database Design Matters

The database is the foundation of virtually every application. A well-designed schema makes queries fast, code simple, and future changes manageable. A poorly designed schema creates data integrity issues, performance bottlenecks, and cascading technical debt that is extremely expensive to fix later.

Database design is both a science and an art. The science involves formal normalization theory, relational algebra, and query optimization. The art involves understanding your access patterns, anticipating future requirements, and knowing when to break the rules for practical reasons.

This guide covers the full spectrum of database design knowledge you need to build robust, scalable applications in 2026 — from first principles through production patterns used by companies at scale.

Key Takeaways

Normalize to 3NF by default; selectively denormalize for read performance where measured.
Always define entities, attributes, and relationships in an ERD before writing SQL.
Primary keys should be immutable; prefer surrogate keys (UUIDs or auto-increment integers).
Index every foreign key column and any column appearing frequently in WHERE/ORDER BY clauses.
Use SQL for transactional, relational data; NoSQL for scale-out, caching, or unstructured data.
ACID transactions protect data integrity — understand isolation levels to avoid anomalies.
Connection pooling is mandatory in production; size your pool based on CPU cores and workload.
PostgreSQL is the recommended default relational database for new projects in 2026.

Normalization: From Raw Data to Clean Schema

Normalization is the process of structuring a relational database according to a series of formal rules (normal forms) to reduce data redundancy and prevent update anomalies. Each normal form builds on the previous one.

First Normal Form (1NF)

A table is in 1NF when every column contains only atomic (indivisible) values, and each row is uniquely identifiable. There must be no repeating groups or arrays stored in a single column.

Violation example — storing multiple phone numbers in one column:

-- BAD: violates 1NF — multiple values in one column
CREATE TABLE contacts (
  id       INT PRIMARY KEY,
  name     VARCHAR(100),
  phones   VARCHAR(200)  -- "555-1234, 555-5678, 555-9012"
);

-- GOOD: separate table for phone numbers (1NF compliant)
CREATE TABLE contacts (
  id   INT PRIMARY KEY,
  name VARCHAR(100)
);

CREATE TABLE contact_phones (
  id         INT PRIMARY KEY,
  contact_id INT REFERENCES contacts(id),
  phone      VARCHAR(20),
  type       VARCHAR(20)  -- 'mobile', 'home', 'work'
);

Second Normal Form (2NF)

A table is in 2NF when it is in 1NF and every non-key attribute is fully functionally dependent on the entire primary key (not just part of it). This only applies to tables with composite primary keys.

-- BAD: violates 2NF — product_name depends only on product_id, not the full composite key
CREATE TABLE order_items (
  order_id     INT,
  product_id   INT,
  product_name VARCHAR(100),  -- depends on product_id only!
  quantity     INT,
  unit_price   DECIMAL(10,2),
  PRIMARY KEY (order_id, product_id)
);

-- GOOD: 2NF compliant — separate products table
CREATE TABLE products (
  id    INT PRIMARY KEY,
  name  VARCHAR(100),
  price DECIMAL(10,2)
);

CREATE TABLE order_items (
  order_id   INT REFERENCES orders(id),
  product_id INT REFERENCES products(id),
  quantity   INT,
  unit_price DECIMAL(10,2),  -- snapshot price at time of order
  PRIMARY KEY (order_id, product_id)
);

Third Normal Form (3NF)

A table is in 3NF when it is in 2NF and every non-key attribute is directly dependent on the primary key — not on another non-key attribute (no transitive dependencies).

-- BAD: violates 3NF — zip_code determines city and state (transitive dependency)
CREATE TABLE employees (
  id         INT PRIMARY KEY,
  name       VARCHAR(100),
  zip_code   VARCHAR(10),
  city       VARCHAR(100),   -- depends on zip_code, not id!
  state      VARCHAR(2)      -- depends on zip_code, not id!
);

-- GOOD: 3NF compliant — extract zip code lookup table
CREATE TABLE zip_codes (
  zip_code VARCHAR(10) PRIMARY KEY,
  city     VARCHAR(100),
  state    VARCHAR(2)
);

CREATE TABLE employees (
  id       INT PRIMARY KEY,
  name     VARCHAR(100),
  zip_code VARCHAR(10) REFERENCES zip_codes(zip_code)
);

Boyce-Codd Normal Form (BCNF)

BCNF is a stricter version of 3NF. A table is in BCNF when for every functional dependency X → Y, X is a superkey (a column or set of columns that uniquely identifies rows). BCNF handles edge cases that 3NF misses involving multiple overlapping candidate keys. In practice, 3NF is sufficient for most production databases.

When to Denormalize

Normalization optimizes for write consistency. Denormalization trades some redundancy for read performance. Consider denormalization when:

A query requires joining 5+ tables and runs frequently (dashboard, API endpoint)
You are building a read-heavy reporting or analytics system
Latency requirements cannot be met with normalized queries even with indexes
You are using a data warehouse (star/snowflake schema is intentionally denormalized)

Common denormalization techniques include storing computed counts (post comment_count), caching aggregate values (user total_spent), or duplicating frequently joined columns (storing category_name directly on products instead of always joining categories).

ERD Design: Entity-Relationship Diagrams

An Entity-Relationship Diagram (ERD) is a visual blueprint of your database schema. Creating an ERD before writing SQL prevents design mistakes and provides a shared language between developers, DBAs, and stakeholders.

Core ERD Concepts

ERDs consist of three fundamental building blocks:

Entities — Real-world objects represented as tables (e.g., User, Product, Order)
Attributes — Properties of entities represented as columns (e.g., user.email, product.price)
Relationships — Associations between entities, characterized by cardinality (one-to-one, one-to-many, many-to-many)

Cardinality Notation

Relationship	Example	Implementation
One-to-One (1:1)	User ↔ UserProfile	FK with UNIQUE constraint
One-to-Many (1:N)	User → Orders	FK on the "many" side
Many-to-Many (M:N)	Students ↔ Courses	Junction table with two FKs

ERD Design Process

Identify entities — List all the nouns in your system (User, Product, Order, Category, Review)
Define attributes — For each entity, list its properties and data types
Identify relationships — Determine how entities relate and the cardinality
Define primary keys — Choose a unique identifier for each entity
Add foreign keys — Implement relationships with foreign key constraints
Apply normalization — Check each table against normal form rules
Review with stakeholders — Validate the design reflects real business rules

Primary and Foreign Keys

Keys are the cornerstone of relational database integrity. Choosing the right key strategy significantly impacts performance, simplicity, and scalability.

Primary Key Strategies

-- Strategy 1: Auto-increment integer (simplest, most common)
-- Pros: compact, fast to index, easy to type/debug
-- Cons: sequential IDs expose business data (user count), not globally unique
CREATE TABLE users (
  id         SERIAL PRIMARY KEY,  -- PostgreSQL: auto-increment integer
  email      VARCHAR(255) UNIQUE NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Strategy 2: UUID v4 (random, globally unique)
-- Pros: non-sequential (no ID enumeration), works across distributed systems
-- Cons: larger (16 bytes vs 4), random inserts cause B-tree page fragmentation
CREATE TABLE orders (
  id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),  -- PostgreSQL 13+
  user_id    INT REFERENCES users(id),
  total      DECIMAL(10,2),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Strategy 3: UUID v7 (time-ordered UUID) — recommended for PostgreSQL 2026
-- Pros: globally unique AND sequential (no fragmentation), sortable by time
-- Best of both worlds for most modern applications
CREATE TABLE events (
  id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),  -- use uuid_generate_v7() with extension
  type       VARCHAR(100),
  payload    JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Strategy 4: Natural key (rare, use carefully)
-- Pros: no surrogate key needed, meaningful column
-- Cons: natural keys can change (emails change, ISO codes change)
CREATE TABLE countries (
  iso_code VARCHAR(2) PRIMARY KEY,  -- 'US', 'GB', 'DE'
  name     VARCHAR(100) NOT NULL
);

Foreign Key Constraints and Referential Actions

-- Foreign key with referential integrity actions
CREATE TABLE posts (
  id         SERIAL PRIMARY KEY,
  user_id    INT NOT NULL,
  title      VARCHAR(255) NOT NULL,
  body       TEXT,
  created_at TIMESTAMPTZ DEFAULT NOW(),

  -- ON DELETE CASCADE: delete posts when user is deleted
  -- ON DELETE SET NULL: set user_id to NULL when user is deleted
  -- ON DELETE RESTRICT: prevent user deletion if posts exist (default)
  CONSTRAINT fk_posts_user
    FOREIGN KEY (user_id)
    REFERENCES users(id)
    ON DELETE CASCADE
    ON UPDATE CASCADE
);

-- Always index foreign key columns (they are NOT auto-indexed!)
CREATE INDEX idx_posts_user_id ON posts(user_id);

Indexing Strategies

Indexes are the single most impactful tool for query performance. Understanding how different index types work and when to apply them separates good database design from great database design.

B-Tree Indexes (Default)

B-tree (balanced tree) indexes are the default index type in PostgreSQL and MySQL. They support equality checks (=), range queries (>, <, BETWEEN), and ORDER BY operations efficiently.

-- Single column index — most common
CREATE INDEX idx_users_email ON users(email);

-- Verify index usage with EXPLAIN ANALYZE (PostgreSQL)
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
-- Output shows: "Index Scan using idx_users_email on users"
-- Look for: Index Scan (good), Seq Scan on large table (investigate)

-- Partial index — index only a subset of rows (smaller, faster)
-- Index only active users (status = 'active') for a users table
CREATE INDEX idx_users_active ON users(email) WHERE status = 'active';

-- Expression index — index computed values
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
-- Now this query can use the index:
-- SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';

Composite Indexes

Composite indexes cover multiple columns. The column order matters enormously — a composite index on (a, b) can be used for queries on column a alone, but NOT for queries on column b alone (the leading column rule).

-- Composite index for common query pattern
-- Query: SELECT * FROM orders WHERE user_id = 42 AND status = 'pending'
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- This index also helps: WHERE user_id = 42 (first column used)
-- This index does NOT help: WHERE status = 'pending' (missing first column)

-- Index for range queries: put the equality column first
-- Query: SELECT * FROM events WHERE user_id = 42 AND created_at > '2026-01-01'
CREATE INDEX idx_events_user_created ON events(user_id, created_at);

-- Covering index: include all columns needed by the query
-- The database can answer entirely from the index without touching the table
CREATE INDEX idx_orders_covering ON orders(user_id, status)
  INCLUDE (id, total, created_at);  -- PostgreSQL syntax

Other Index Types (PostgreSQL)

-- GIN index: for full-text search and JSONB containment queries
CREATE INDEX idx_articles_fts ON articles USING GIN(to_tsvector('english', title || ' ' || body));
-- Query: SELECT * FROM articles WHERE to_tsvector('english', title || ' ' || body) @@ plainto_tsquery('database design');

CREATE INDEX idx_products_attributes ON products USING GIN(attributes);  -- JSONB column
-- Query: SELECT * FROM products WHERE attributes @> '{"color": "red"}';

-- GiST index: for geometric data, IP ranges, exclusion constraints
CREATE INDEX idx_reservations_period ON reservations USING GiST(tstzrange(start_time, end_time));

-- BRIN index: for very large tables with naturally ordered data (logs, time-series)
-- Much smaller than B-tree, faster to build, less accurate (block-level)
CREATE INDEX idx_logs_created_brin ON logs USING BRIN(created_at);

-- Hash index: for equality-only lookups (rarely better than B-tree in practice)
CREATE INDEX idx_sessions_token ON sessions USING HASH(token);

Index Best Practices

Always index foreign key columns — PostgreSQL does NOT create them automatically
Use EXPLAIN ANALYZE to confirm index usage before and after adding indexes
Remove unused indexes — they slow down writes without helping reads
Consider partial indexes for tables with heavily skewed data (e.g., 95% inactive records)
In PostgreSQL, use CREATE INDEX CONCURRENTLY to add indexes without locking the table
Keep index column count to 3 or fewer for most use cases

SQL vs NoSQL: Choosing the Right Database

One of the most consequential decisions in system design is choosing between relational (SQL) and non-relational (NoSQL) databases. Both have legitimate use cases, and modern applications often use both.

Dimension	SQL (Relational)	NoSQL (Non-Relational)
Schema	Fixed schema, DDL migrations	Flexible/schemaless
Scaling	Vertical (scale-up); horizontal via read replicas	Horizontal (scale-out) native
Transactions	Full ACID transactions	BASE (eventually consistent); some offer ACID
Query Language	Standardized SQL	Database-specific APIs
Joins	Efficient JOINs	No native JOINs (must denormalize)
Best for	Banking, e-commerce, CRM, ERP	Caching, sessions, IoT, catalogs, feeds
Examples	PostgreSQL, MySQL, SQLite, SQL Server	MongoDB, Redis, DynamoDB, Cassandra

NoSQL Database Types

Type	Database	Best Use Case
Document Store	MongoDB, CouchDB, Firestore	Catalogs, CMS, user profiles with variable attributes
Key-Value Store	Redis, DynamoDB, Memcached	Sessions, caching, rate limiting, leaderboards
Wide-Column Store	Cassandra, HBase, Bigtable	IoT time-series, write-heavy analytics at scale
Graph Database	Neo4j, Amazon Neptune, ArangoDB	Social graphs, recommendation engines, fraud detection
Search Engine	Elasticsearch, Typesense, Meilisearch	Full-text search, log analytics, auto-complete

Relationships: One-to-Many and Many-to-Many

Understanding how to model relationships correctly is the most practically important skill in database design. Most real-world data involves complex webs of relationships.

One-to-Many Relationships

The most common relationship type. The foreign key always lives on the "many" side of the relationship.

-- One user has many posts (1:N)
CREATE TABLE users (
  id         SERIAL PRIMARY KEY,
  username   VARCHAR(50) UNIQUE NOT NULL,
  email      VARCHAR(255) UNIQUE NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE posts (
  id         SERIAL PRIMARY KEY,
  user_id    INT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  title      VARCHAR(255) NOT NULL,
  slug       VARCHAR(255) UNIQUE NOT NULL,
  body       TEXT,
  published  BOOLEAN DEFAULT FALSE,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- Index the foreign key
CREATE INDEX idx_posts_user_id ON posts(user_id);

-- Query: get all posts by a user with author info
SELECT p.id, p.title, p.created_at, u.username
FROM posts p
JOIN users u ON u.id = p.user_id
WHERE u.username = 'alice'
ORDER BY p.created_at DESC;

Many-to-Many Relationships

Many-to-many relationships require a junction table (also called a bridge, pivot, or associative table). The junction table has foreign keys to both related tables and can carry additional data about the relationship itself.

-- Many-to-many: posts can have many tags; tags can belong to many posts
CREATE TABLE tags (
  id   SERIAL PRIMARY KEY,
  name VARCHAR(50) UNIQUE NOT NULL,
  slug VARCHAR(50) UNIQUE NOT NULL
);

-- Junction table: post_tags
CREATE TABLE post_tags (
  post_id    INT REFERENCES posts(id) ON DELETE CASCADE,
  tag_id     INT REFERENCES tags(id) ON DELETE CASCADE,
  tagged_at  TIMESTAMPTZ DEFAULT NOW(),  -- extra relationship data
  PRIMARY KEY (post_id, tag_id)          -- composite PK prevents duplicates
);

CREATE INDEX idx_post_tags_tag_id ON post_tags(tag_id);

-- Query: get all posts for a given tag
SELECT p.title, p.slug, p.created_at
FROM posts p
JOIN post_tags pt ON pt.post_id = p.id
JOIN tags t ON t.id = pt.tag_id
WHERE t.slug = 'javascript'
ORDER BY p.created_at DESC;

-- Query: get all tags for a given post
SELECT t.name, t.slug
FROM tags t
JOIN post_tags pt ON pt.tag_id = t.id
WHERE pt.post_id = 42;

Self-Referencing Relationships

-- Hierarchical data: categories with subcategories
CREATE TABLE categories (
  id        SERIAL PRIMARY KEY,
  name      VARCHAR(100) NOT NULL,
  parent_id INT REFERENCES categories(id) ON DELETE SET NULL,  -- self-reference
  depth     INT DEFAULT 0  -- for fast depth queries
);

-- Example data
INSERT INTO categories (id, name, parent_id) VALUES
  (1, 'Electronics', NULL),
  (2, 'Computers', 1),
  (3, 'Laptops', 2),
  (4, 'Gaming Laptops', 3);

-- Recursive CTE to get full hierarchy
WITH RECURSIVE category_tree AS (
  SELECT id, name, parent_id, 0 AS level, name::TEXT AS path
  FROM categories WHERE parent_id IS NULL

  UNION ALL

  SELECT c.id, c.name, c.parent_id, ct.level + 1,
         ct.path || ' > ' || c.name
  FROM categories c
  JOIN category_tree ct ON ct.id = c.parent_id
)
SELECT * FROM category_tree ORDER BY path;

Performance Optimization

Database performance optimization is an iterative process. Start with good schema design, then measure, then optimize. Never optimize prematurely without measuring first.

Query Optimization with EXPLAIN ANALYZE

-- Always use EXPLAIN ANALYZE to understand query execution plans
EXPLAIN ANALYZE
SELECT u.username, COUNT(p.id) AS post_count
FROM users u
LEFT JOIN posts p ON p.user_id = u.id
WHERE u.created_at > '2026-01-01'
GROUP BY u.id, u.username
ORDER BY post_count DESC
LIMIT 20;

-- Key things to look for in the output:
-- "Seq Scan" on large tables → add an index
-- "Rows Removed by Filter: 999999" → bad index selectivity
-- "Nested Loop" with large row counts → may need a Hash Join
-- Actual time much higher than estimated → stale statistics, run ANALYZE
-- "cost=0.00..1234.56" → higher cost = more expensive

-- Update table statistics if estimates are wrong
ANALYZE users;
ANALYZE posts;

Common Performance Patterns

-- Pattern 1: Pagination with keyset (cursor) instead of OFFSET
-- BAD: OFFSET becomes slower as offset grows (scans and discards rows)
SELECT * FROM posts ORDER BY created_at DESC LIMIT 20 OFFSET 10000;

-- GOOD: keyset pagination (constant time regardless of page number)
-- "Give me 20 posts created before the last one I saw"
SELECT * FROM posts
WHERE created_at < '2026-01-15T12:00:00Z'
ORDER BY created_at DESC
LIMIT 20;

-- Pattern 2: Avoid SELECT * in production queries
-- BAD: fetches all columns including large TEXT/BYTEA columns
SELECT * FROM posts WHERE id = 42;

-- GOOD: fetch only what you need
SELECT id, title, slug, created_at FROM posts WHERE id = 42;

-- Pattern 3: Use EXISTS instead of COUNT for existence checks
-- BAD: scans entire result set to count
SELECT COUNT(*) > 0 FROM posts WHERE user_id = 42;

-- GOOD: stops at first match
SELECT EXISTS(SELECT 1 FROM posts WHERE user_id = 42);

-- Pattern 4: Batch inserts for bulk data loading
-- BAD: 1000 individual INSERT statements
INSERT INTO events (type, payload) VALUES ('click', '{}');
-- ...repeated 1000 times

-- GOOD: single multi-row INSERT (or COPY for PostgreSQL bulk loading)
INSERT INTO events (type, payload) VALUES
  ('click', '{"x":100, "y":200}'),
  ('scroll', '{"delta":300}'),
  ('keypress', '{"key":"Enter"}');
  -- ... up to a few thousand rows per statement

-- PostgreSQL COPY for maximum throughput (millions of rows)
COPY events(type, payload) FROM '/tmp/events.csv' CSV HEADER;

Materialized Views for Expensive Aggregations

-- Create a materialized view for an expensive report
CREATE MATERIALIZED VIEW monthly_revenue AS
SELECT
  DATE_TRUNC('month', created_at) AS month,
  SUM(total) AS revenue,
  COUNT(*) AS order_count,
  COUNT(DISTINCT user_id) AS unique_customers
FROM orders
WHERE status = 'completed'
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month DESC;

-- Index the materialized view
CREATE INDEX idx_monthly_revenue_month ON monthly_revenue(month);

-- Query is now instant (reads pre-computed data)
SELECT * FROM monthly_revenue WHERE month >= '2026-01-01';

-- Refresh when underlying data changes (schedule this)
REFRESH MATERIALIZED VIEW CONCURRENTLY monthly_revenue;

Transactions and ACID Properties

Transactions allow multiple SQL statements to be executed as a single atomic unit. ACID properties guarantee that your database remains in a consistent state even when things go wrong.

Using Transactions

-- Transfer $100 from account A to account B atomically
-- Both updates succeed or both roll back — never a partial state
BEGIN;

UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;

-- Check for negative balance before committing
DO $$
BEGIN
  IF (SELECT balance FROM accounts WHERE id = 1) < 0 THEN
    RAISE EXCEPTION 'Insufficient funds';
  END IF;
END $$;

COMMIT;
-- ROLLBACK; if anything goes wrong

// Transaction in Node.js with pg (node-postgres)
const { Pool } = require('pg');
const pool = new Pool();

async function transferFunds(fromId, toId, amount) {
  const client = await pool.connect();
  try {
    await client.query('BEGIN');

    const { rows } = await client.query(
      'SELECT balance FROM accounts WHERE id = $1 FOR UPDATE',
      [fromId]  // FOR UPDATE locks the row during the transaction
    );

    if (rows[0].balance < amount) {
      throw new Error('Insufficient funds');
    }

    await client.query(
      'UPDATE accounts SET balance = balance - $1 WHERE id = $2',
      [amount, fromId]
    );
    await client.query(
      'UPDATE accounts SET balance = balance + $1 WHERE id = $2',
      [amount, toId]
    );

    await client.query('COMMIT');
    console.log('Transfer successful');
  } catch (err) {
    await client.query('ROLLBACK');
    throw err;
  } finally {
    client.release();  // return connection to pool
  }
}

Transaction Isolation Levels

Isolation Level	Dirty Read	Non-Repeatable Read	Phantom Read	Use When
READ UNCOMMITTED	Possible	Possible	Possible	Almost never (highest throughput)
READ COMMITTED	Prevented	Possible	Possible	Default in PostgreSQL; most OLTP apps
REPEATABLE READ	Prevented	Prevented	Possible	Reports, consistent snapshots
SERIALIZABLE	Prevented	Prevented	Prevented	Financial systems, maximum correctness

Real-World Schema Examples

The best way to learn database design is to study real schemas. Here are production-quality schemas for three common application types.

E-Commerce Schema

-- E-Commerce: Users, Products, Orders, Reviews
CREATE TABLE users (
  id            SERIAL PRIMARY KEY,
  email         VARCHAR(255) UNIQUE NOT NULL,
  password_hash VARCHAR(255) NOT NULL,
  first_name    VARCHAR(100),
  last_name     VARCHAR(100),
  role          VARCHAR(20) DEFAULT 'customer' CHECK (role IN ('customer', 'admin')),
  created_at    TIMESTAMPTZ DEFAULT NOW(),
  updated_at    TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE categories (
  id        SERIAL PRIMARY KEY,
  name      VARCHAR(100) NOT NULL,
  slug      VARCHAR(100) UNIQUE NOT NULL,
  parent_id INT REFERENCES categories(id)
);

CREATE TABLE products (
  id          SERIAL PRIMARY KEY,
  category_id INT REFERENCES categories(id),
  name        VARCHAR(255) NOT NULL,
  slug        VARCHAR(255) UNIQUE NOT NULL,
  description TEXT,
  price       DECIMAL(10,2) NOT NULL CHECK (price >= 0),
  stock_qty   INT NOT NULL DEFAULT 0 CHECK (stock_qty >= 0),
  is_active   BOOLEAN DEFAULT TRUE,
  attributes  JSONB DEFAULT '{}',  -- flexible: {"color":"red","size":"XL"}
  created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_products_category ON products(category_id);
CREATE INDEX idx_products_active   ON products(is_active) WHERE is_active = TRUE;
CREATE INDEX idx_products_attrs    ON products USING GIN(attributes);

CREATE TABLE orders (
  id             SERIAL PRIMARY KEY,
  user_id        INT NOT NULL REFERENCES users(id),
  status         VARCHAR(20) DEFAULT 'pending'
                   CHECK (status IN ('pending','confirmed','shipped','delivered','cancelled')),
  subtotal       DECIMAL(10,2) NOT NULL,
  tax            DECIMAL(10,2) DEFAULT 0,
  shipping_cost  DECIMAL(10,2) DEFAULT 0,
  total          DECIMAL(10,2) NOT NULL,
  shipping_addr  JSONB,
  created_at     TIMESTAMPTZ DEFAULT NOW(),
  updated_at     TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status  ON orders(status, created_at DESC);

CREATE TABLE order_items (
  id         SERIAL PRIMARY KEY,
  order_id   INT NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
  product_id INT NOT NULL REFERENCES products(id),
  quantity   INT NOT NULL CHECK (quantity > 0),
  unit_price DECIMAL(10,2) NOT NULL,  -- snapshot at time of purchase
  total      DECIMAL(10,2) NOT NULL
);

CREATE INDEX idx_order_items_order   ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);

CREATE TABLE reviews (
  id         SERIAL PRIMARY KEY,
  product_id INT NOT NULL REFERENCES products(id) ON DELETE CASCADE,
  user_id    INT NOT NULL REFERENCES users(id),
  rating     SMALLINT NOT NULL CHECK (rating BETWEEN 1 AND 5),
  title      VARCHAR(255),
  body       TEXT,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  UNIQUE (product_id, user_id)  -- one review per user per product
);

CREATE INDEX idx_reviews_product ON reviews(product_id, rating);

Blog/CMS Schema

-- Blog: Authors, Posts, Categories, Tags, Comments
CREATE TABLE authors (
  id         SERIAL PRIMARY KEY,
  username   VARCHAR(50) UNIQUE NOT NULL,
  email      VARCHAR(255) UNIQUE NOT NULL,
  bio        TEXT,
  avatar_url VARCHAR(500),
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE posts (
  id            SERIAL PRIMARY KEY,
  author_id     INT NOT NULL REFERENCES authors(id),
  title         VARCHAR(255) NOT NULL,
  slug          VARCHAR(255) UNIQUE NOT NULL,
  excerpt       VARCHAR(500),
  body          TEXT NOT NULL,
  status        VARCHAR(20) DEFAULT 'draft' CHECK (status IN ('draft','published','archived')),
  published_at  TIMESTAMPTZ,
  view_count    INT DEFAULT 0,
  search_vector TSVECTOR,  -- for full-text search
  created_at    TIMESTAMPTZ DEFAULT NOW(),
  updated_at    TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_posts_author     ON posts(author_id);
CREATE INDEX idx_posts_status     ON posts(status, published_at DESC);
CREATE INDEX idx_posts_search     ON posts USING GIN(search_vector);

-- Auto-update search_vector on insert/update
CREATE OR REPLACE FUNCTION posts_search_vector_update() RETURNS TRIGGER AS $$
BEGIN
  NEW.search_vector := to_tsvector('english', NEW.title || ' ' || COALESCE(NEW.body, ''));
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER posts_search_vector_trigger
BEFORE INSERT OR UPDATE ON posts
FOR EACH ROW EXECUTE FUNCTION posts_search_vector_update();

-- Full-text search query
SELECT title, slug, excerpt, published_at
FROM posts
WHERE search_vector @@ plainto_tsquery('english', 'database design normalization')
  AND status = 'published'
ORDER BY ts_rank(search_vector, plainto_tsquery('english', 'database design normalization')) DESC
LIMIT 10;

Social Media Schema

-- Social Media: Users, Follows, Posts, Likes, Comments
CREATE TABLE users (
  id           SERIAL PRIMARY KEY,
  username     VARCHAR(50) UNIQUE NOT NULL,
  display_name VARCHAR(100),
  bio          VARCHAR(500),
  avatar_url   VARCHAR(500),
  follower_count INT DEFAULT 0,   -- denormalized counter (updated by trigger)
  following_count INT DEFAULT 0,
  post_count   INT DEFAULT 0,
  created_at   TIMESTAMPTZ DEFAULT NOW()
);

-- Follows: many-to-many self-referencing
CREATE TABLE follows (
  follower_id  INT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  following_id INT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  created_at   TIMESTAMPTZ DEFAULT NOW(),
  PRIMARY KEY  (follower_id, following_id),
  CHECK        (follower_id != following_id)  -- can't follow yourself
);

CREATE INDEX idx_follows_following ON follows(following_id);

CREATE TABLE posts (
  id           SERIAL PRIMARY KEY,
  user_id      INT NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  body         VARCHAR(280),  -- Twitter-like character limit
  media_urls   TEXT[],        -- PostgreSQL array for multiple images
  like_count   INT DEFAULT 0,
  comment_count INT DEFAULT 0,
  is_reply_to  INT REFERENCES posts(id),  -- threaded replies
  created_at   TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_posts_user_id    ON posts(user_id, created_at DESC);
CREATE INDEX idx_posts_is_reply   ON posts(is_reply_to) WHERE is_reply_to IS NOT NULL;

-- Likes: many-to-many between users and posts
CREATE TABLE likes (
  user_id    INT REFERENCES users(id) ON DELETE CASCADE,
  post_id    INT REFERENCES posts(id) ON DELETE CASCADE,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  PRIMARY KEY (user_id, post_id)
);

CREATE INDEX idx_likes_post_id ON likes(post_id);

-- User feed query: posts from users I follow, most recent first
SELECT p.*, u.username, u.avatar_url
FROM posts p
JOIN users u ON u.id = p.user_id
WHERE p.user_id IN (
  SELECT following_id FROM follows WHERE follower_id = 42
)
ORDER BY p.created_at DESC
LIMIT 50;

Connection Pooling

Connection pooling is not optional in production — it is essential. Without it, each HTTP request would need to establish a fresh database connection, incurring significant overhead and potentially overwhelming the database.

Connection Pooling with Node.js (pg-pool)

// connection-pool.js
const { Pool } = require('pg');

const pool = new Pool({
  host:     process.env.DB_HOST     || 'localhost',
  port:     parseInt(process.env.DB_PORT || '5432'),
  database: process.env.DB_NAME     || 'myapp',
  user:     process.env.DB_USER     || 'postgres',
  password: process.env.DB_PASSWORD,

  // Pool configuration
  max:             20,    // Maximum connections in pool (default: 10)
  min:             5,     // Minimum idle connections to keep open
  idleTimeoutMillis: 30000,  // Close idle clients after 30s
  connectionTimeoutMillis: 2000,  // Timeout if no connection available in 2s
  maxUses:         7500,  // Close connection after 7500 uses (prevents memory leaks)

  // SSL for production
  ssl: process.env.NODE_ENV === 'production'
    ? { rejectUnauthorized: true, ca: process.env.DB_SSL_CA }
    : false,
});

// Monitor pool health
pool.on('connect', (client) => {
  console.log('New client connected to pool');
});

pool.on('error', (err, client) => {
  console.error('Unexpected error on idle client', err);
  process.exit(-1);
});

// Graceful shutdown
process.on('SIGTERM', async () => {
  await pool.end();
  console.log('Pool has ended');
});

module.exports = pool;

// Usage in application code
async function getUserById(id) {
  // pool.query() automatically checks out and returns a client
  const result = await pool.query(
    'SELECT id, username, email FROM users WHERE id = $1',
    [id]
  );
  return result.rows[0] || null;
}

PgBouncer: External Connection Pooling

For high-traffic applications, PgBouncer is a lightweight connection pooler that sits between your application and PostgreSQL. It supports three pooling modes:

Session pooling — one server connection per client session (similar to no pooling)
Transaction pooling — server connection held only during a transaction (recommended)
Statement pooling — server connection held per statement (breaks multi-statement transactions)

; /etc/pgbouncer/pgbouncer.ini
[databases]
myapp = host=127.0.0.1 port=5432 dbname=myapp

[pgbouncer]
listen_port = 6432
listen_addr = 127.0.0.1
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction    ; recommended for most apps
max_client_conn = 1000     ; max app connections to PgBouncer
default_pool_size = 25     ; max connections to PostgreSQL per database
min_pool_size = 5
reserve_pool_size = 5
log_connections = 0        ; disable in production for performance
server_idle_timeout = 600

Best Practices and Production Tips

These are the lessons learned from operating databases at scale — patterns that prevent common mistakes and operational pain.

PostgreSQL vs MySQL: 2026 Recommendations

Feature	PostgreSQL 16	MySQL 8.4
JSON support	JSONB (binary, indexed, fast)	JSON (text, limited indexing)
Full-text search	Built-in tsvector/tsquery	FULLTEXT index (limited)
Window functions	Full support since v8.4	Support since MySQL 8.0
CTE (WITH)	Full recursive CTE support	Support since MySQL 8.0
Concurrency	MVCC (readers don't block writers)	MVCC (InnoDB engine)
Hosting support	All major cloud providers	All major cloud providers
Extensions	PostGIS, pgvector, TimescaleDB	Limited extension ecosystem
Verdict (2026)	Recommended default	Good for existing MySQL stacks

Schema Migration Best Practices

-- Safe schema changes in production (zero-downtime patterns)

-- 1. Adding a column: safe (no lock) with a default in PostgreSQL 11+
ALTER TABLE users ADD COLUMN last_login TIMESTAMPTZ DEFAULT NULL;

-- 2. Adding NOT NULL column safely (add nullable, backfill, then constrain)
-- UNSAFE: one-step NOT NULL with DEFAULT on large tables locks the table
ALTER TABLE users ADD COLUMN preferences JSONB NOT NULL DEFAULT '{}';

-- SAFE: three-step approach
ALTER TABLE users ADD COLUMN preferences JSONB;  -- Step 1: add nullable
UPDATE users SET preferences = '{}' WHERE preferences IS NULL;  -- Step 2: backfill
ALTER TABLE users ALTER COLUMN preferences SET NOT NULL;  -- Step 3: add constraint

-- 3. Adding an index without blocking reads/writes (PostgreSQL)
CREATE INDEX CONCURRENTLY idx_users_last_login ON users(last_login);

-- 4. Renaming a column (dangerous — use a view or application-side rename)
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN user_name VARCHAR(50);
-- Step 2: Sync old to new via trigger or backfill
-- Step 3: Update application to use new name
-- Step 4: Drop old column after full deployment
ALTER TABLE users DROP COLUMN username;

-- 5. Partitioning a large table (range partitioning by date)
CREATE TABLE events (
  id         SERIAL,
  type       VARCHAR(100),
  created_at TIMESTAMPTZ DEFAULT NOW()
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2026_01 PARTITION OF events
  FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE events_2026_02 PARTITION OF events
  FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

Security Best Practices

-- 1. Use parameterized queries ALWAYS (prevents SQL injection)
-- In Node.js with pg:
const result = await pool.query(
  'SELECT * FROM users WHERE email = $1 AND status = $2',
  [email, 'active']
  // Never: 'SELECT * FROM users WHERE email = ' + email
);

-- 2. Principle of least privilege: create role-specific database users
CREATE USER app_readonly WITH PASSWORD 'secret';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO app_readonly;

CREATE USER app_readwrite WITH PASSWORD 'secret2';
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO app_readwrite;

-- 3. Row-level security (RLS) for multi-tenant apps (PostgreSQL)
ALTER TABLE posts ENABLE ROW LEVEL SECURITY;

-- Users can only see their own posts
CREATE POLICY posts_user_isolation ON posts
  USING (user_id = current_setting('app.user_id')::INT);

-- Set the user context at connection/session time
SET app.user_id = '42';

-- 4. Always store password hashes, never plaintext
-- Use bcrypt or argon2 from application layer, not database

-- 5. Enable TLS for all database connections in production

Monitoring and Maintenance

-- Find slow queries (PostgreSQL — enable pg_stat_statements extension)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

SELECT
  calls,
  round(total_exec_time::numeric, 2) AS total_ms,
  round(mean_exec_time::numeric, 2) AS mean_ms,
  round(stddev_exec_time::numeric, 2) AS stddev_ms,
  rows,
  query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

-- Find missing indexes (tables with many sequential scans)
SELECT
  schemaname,
  tablename,
  seq_scan,
  seq_tup_read,
  idx_scan,
  seq_tup_read / seq_scan AS avg_rows_per_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;

-- Find unused indexes (wasteful, slow down writes)
SELECT
  indexrelname AS index_name,
  relname AS table_name,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan AS times_used
FROM pg_stat_user_indexes
JOIN pg_index USING (indexrelid)
WHERE idx_scan = 0 AND NOT indisprimary AND NOT indisunique
ORDER BY pg_relation_size(indexrelid) DESC;

-- Table bloat check (VACUUM keeps tables healthy)
SELECT relname, n_dead_tup, n_live_tup,
       round(n_dead_tup::numeric / NULLIF(n_live_tup, 0) * 100, 2) AS dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

-- Configure autovacuum for busy tables
ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.01,  -- vacuum when 1% rows are dead
  autovacuum_analyze_scale_factor = 0.005
);

Summary: Database Design Checklist

Production Database Design Checklist

Draw an ERD before writing any SQL — validate with stakeholders
Normalize to 3NF; denormalize only where measured performance requires it
Every table has a primary key; prefer auto-increment or UUID v7
Every foreign key column has an index (not created automatically)
Use parameterized queries everywhere — never string-concatenate SQL
Apply appropriate NOT NULL, UNIQUE, and CHECK constraints at the database level
Use TIMESTAMPTZ (not TIMESTAMP) for all datetime columns
Add created_at and updated_at to every table
Use EXPLAIN ANALYZE before and after adding indexes
Configure connection pooling in all production services
Use transactions for all multi-step operations
Test schema migrations on a production-sized dataset before deploying
Monitor slow queries, index usage, and table bloat in production
Keep pg_stat_statements enabled in PostgreSQL for query analytics
Enable SSL/TLS for all database connections