System design interviews test your ability to architect large-scale distributed systems. Start by clarifying requirements, then estimate scale (QPS, storage), propose a high-level design with load balancers, caching (Redis), databases (SQL + sharding or NoSQL), and message queues (Kafka). Discuss CAP theorem trade-offs, rate limiting algorithms, and walk through a concrete example like URL shortener or social media feed. Always identify bottlenecks and propose solutions.
System design interviews are among the most challenging technical interviews โ they are open-ended, require broad knowledge, and have no single correct answer. Companies like Google, Meta, Amazon, and Uber use system design interviews to evaluate senior engineers on their ability to architect scalable, reliable, and maintainable systems. This comprehensive guide covers every topic you need: from fundamental scaling concepts to designing real-world systems like URL shorteners and social media feeds, with practical patterns and back-of-envelope estimation techniques.
- Use a structured framework: requirements โ estimation โ high-level design โ deep dive โ bottlenecks
- Horizontal scaling is preferred; design stateless services from the start
- Cache aggressively at every layer: CDN, application cache (Redis), database query cache
- CAP theorem forces a trade-off: choose CP or AP based on your use case
- Shard databases by user ID or consistent hashing; use read replicas for read-heavy workloads
- Kafka for event streaming and audit logs; RabbitMQ for task queues
- Token bucket is the most practical rate limiting algorithm for most APIs
- Back-of-envelope math: 1M DAU * 10 requests = 10M req/day = ~115 QPS
How to Approach System Design Interviews
System design interviews succeed or fail based on how well you structure your thinking. Interviewers are not looking for the perfect answer โ they want to see how you break down ambiguous problems, ask the right clarifying questions, reason about trade-offs, and communicate your thinking clearly. A structured framework makes the difference.
The Five-Step Framework
STEP 1: Clarify Requirements (5 minutes)
Functional requirements:
- What does the system need to do?
- Who are the users? What actions do they take?
- Any specific features in scope / out of scope?
Non-functional requirements:
- Scale: How many daily active users (DAU)?
- Availability: 99.9% (8.7h/year downtime) vs 99.99% (52min/year)?
- Latency: P99 latency target? Read-heavy or write-heavy?
- Consistency: Strong consistency or eventual consistency acceptable?
- Durability: Can we lose any data?
STEP 2: Estimate Scale (5 minutes)
- DAU โ QPS (queries per second)
- Storage requirements per year
- Bandwidth: inbound and outbound
- Number of servers needed
STEP 3: High-Level Design (10-15 minutes)
- Draw the major components: clients, DNS, CDN, load balancers,
web servers, app servers, cache, database, message queues
- Explain the data flow for the main use case (e.g., write a tweet)
STEP 4: Deep Dive (15-20 minutes)
- Go deep on 2-3 critical components the interviewer asks about
- Justify technology choices (why PostgreSQL vs Cassandra?)
- Discuss the database schema
- Explain the API design
STEP 5: Identify Bottlenecks and Trade-offs (5 minutes)
- Single points of failure?
- How does the system handle 10x traffic?
- What fails first? How do you monitor it?
- What would you do differently given more time?Common Mistakes to Avoid
| Mistake | What to Do Instead |
|---|---|
| Jumping into design immediately | Spend 5 minutes clarifying requirements first |
| No back-of-envelope estimation | Always estimate QPS, storage, and bandwidth before designing |
| Over-engineering from the start | Start simple, then add complexity to solve identified bottlenecks |
| Ignoring the interviewer | Ask clarifying questions, look for hints and feedback |
| Forgetting failure scenarios | Discuss what happens when each component fails |
| Vague technology choices | Justify every choice: "I choose Redis because..." |
Scalability Fundamentals: Horizontal vs Vertical Scaling
Scalability is the system's ability to handle increased load. The two fundamental approaches โ vertical scaling and horizontal scaling โ have very different trade-offs. Almost every large-scale system relies on horizontal scaling, but understanding when and why matters for system design interviews.
Vertical Scaling (Scale Up)
Vertical scaling means upgrading a single server to a more powerful machine: more CPU cores, RAM, faster storage, or higher network bandwidth. It is the simplest approach โ no application changes required. However, vertical scaling has hard physical limits (you cannot add infinite RAM to a single machine), is expensive at high specs, and creates a single point of failure. It also requires downtime for hardware upgrades in many cases.
Vertical Scaling Example:
t3.micro โ 2 vCPU, 1 GB RAM ($0.01/hr)
t3.large โ 2 vCPU, 8 GB RAM ($0.08/hr)
m5.4xlarge โ 16 vCPU, 64 GB RAM ($0.77/hr)
x1e.32xlarge โ 128 vCPU, 3,904 GB RAM ($26.69/hr)
Limits:
- Single machine ceiling (AWS largest: 448 vCPU, 24 TB RAM)
- Single point of failure (machine dies = outage)
- Hardware upgrades often require downtime
- Non-linear cost increase at high specs
When to use vertical scaling:
- Databases (PostgreSQL, MySQL) โ easier to scale vertically first
- Monolithic applications with shared state
- Development/staging environments
- When the bottleneck is CPU-bound on a single processHorizontal Scaling (Scale Out)
Horizontal scaling adds more servers of the same type, distributing load across all instances behind a load balancer. It is theoretically unlimited, naturally fault-tolerant (losing one server is acceptable), and allows for cost-effective scaling using commodity hardware. The requirement is that your services must be stateless โ session data, user context, and ephemeral state must be stored externally (Redis, database) rather than in the server process.
Horizontal Scaling Example:
1 server (10K req/sec)
โ 10 servers behind load balancer (100K req/sec)
โ 100 servers behind load balancer (1M req/sec)
โ Add servers automatically with auto-scaling groups
Requirements for horizontal scaling:
โ Stateless services (no in-process session storage)
โ Shared external session store (Redis, Memcached)
โ Load balancer to distribute traffic
โ Shared file storage (S3, NFS) โ no local disk state
โ Distributed cache consistency strategy
โ Database can handle connections from N servers
Auto-scaling policy (AWS example):
- Scale out: add instance when CPU > 70% for 5 minutes
- Scale in: remove instance when CPU < 30% for 15 minutes
- Minimum: 2 instances (HA), Maximum: 50 instances
- Warm-up time: 3 minutes (let new instance start before routing)Stateless vs Stateful Services
Stateful (hard to scale horizontally):
// Session stored in server memory โ only works with 1 server
// or sticky sessions (a specific user always goes to same server)
app.post('/login', (req, res) => {
req.session.userId = user.id; // in-memory session
req.session.cart = []; // lost if server restarts
});
Stateless (scales horizontally):
// All state in Redis or JWT โ any server can handle any request
app.post('/login', async (req, res) => {
const token = jwt.sign({ userId: user.id }, SECRET, { expiresIn: '1h' });
// OR store session in Redis
await redis.setex('session:' + sessionId, 3600, JSON.stringify({ userId: user.id }));
res.json({ token });
});
// Any of 100 servers can authenticate this token
app.get('/profile', authenticate, (req, res) => {
res.json({ userId: req.user.id });
});Load Balancers: L4 vs L7 and Routing Algorithms
A load balancer distributes incoming traffic across a pool of backend servers, preventing any single server from becoming a bottleneck. Load balancers also provide health checking (automatically removing unhealthy servers), SSL termination, and connection pooling. Understanding the difference between Layer 4 and Layer 7 load balancers is critical for system design interviews.
Layer 4 (Transport) vs Layer 7 (Application) Load Balancers
| Feature | L4 Load Balancer | L7 Load Balancer |
|---|---|---|
| Operates at | TCP/UDP transport layer | HTTP/HTTPS application layer |
| Routing based on | IP address, port | URL path, headers, cookies, content |
| Performance | Very fast, minimal processing | Slower (inspects packet content) |
| SSL termination | Pass-through or terminate | Always terminates SSL |
| Content-based routing | Not possible | Yes (/api/* โ API servers, /* โ web) |
| Examples | AWS NLB, HAProxy TCP mode | AWS ALB, Nginx, HAProxy HTTP mode |
| Best for | Non-HTTP protocols, ultra-low latency | HTTP microservices, A/B testing, auth |
Load Balancing Algorithms
1. Round Robin (default for most LBs)
Request 1 โ Server A
Request 2 โ Server B
Request 3 โ Server C
Request 4 โ Server A (cycle repeats)
Best for: servers with identical specs and similar request costs
Problem: ignores current server load โ a heavy request on Server A
still sends the next request to Server A in rotation
2. Weighted Round Robin
Server A weight=3, Server B weight=1, Server C weight=2
โ A, A, A, B, C, C, A, A, A, B, C, C ...
Best for: servers with different hardware capacities
3. Least Connections
Always routes to the server with the fewest active connections
Server A: 100 connections
Server B: 50 connections โ new request goes here
Server C: 80 connections
Best for: long-lived connections (WebSocket, streaming)
where some requests take much longer than others
4. Least Response Time
Routes to the server with the lowest combination of
active connections AND average response time
Best for: maximizing end-user performance
5. IP Hash (Sticky Sessions)
hash(client_ip) % num_servers = always same server for that IP
Best for: stateful applications that cannot use external session store
Problem: uneven distribution if IP addresses cluster; breaks if a
server is removed (all its users get reassigned)
6. Random
Routes to a randomly selected server
Good for: simple setups, nearly identical to round robin at scale
7. Consistent Hashing
Used by distributed caches and databases to minimize remapping
when servers are added/removed. Only K/N keys need to be remapped
(K = number of keys, N = number of servers).Health Checks and Failover
# Nginx upstream with health checks
upstream api_servers {
server api1.internal:8080 weight=3;
server api2.internal:8080 weight=3;
server api3.internal:8080 weight=2;
# Passive health check: mark server as down after 3 failures
# within 30 seconds; retry after 30 seconds
server api4.internal:8080 max_fails=3 fail_timeout=30s backup;
}
# Active health check (Nginx Plus / HAProxy)
# Sends GET /health every 5 seconds, expects HTTP 200
# Removes server from pool if 2 consecutive checks fail
# Re-adds server after 3 consecutive successful checksCaching Strategies: CDN, Redis, and Cache Invalidation
Caching is one of the most powerful tools for scaling read-heavy systems. A well-designed caching layer can reduce database load by 90% or more, dramatically lowering latency and infrastructure costs. However, caching introduces complexity: stale data, cache invalidation, and cache stampedes. Understanding caching at every layer of the stack is essential for system design.
Caching Layers in a Web Application
Client Request Flow (with all cache layers):
Browser Cache (private)
โ miss
CDN Cache (shared, edge location, ~10ms)
โ miss
Load Balancer / Reverse Proxy Cache (Nginx, Varnish, ~1ms)
โ miss
Application Cache (Redis, Memcached, ~0.5ms)
โ miss
Database Query Cache (MySQL query cache, ~5ms)
โ miss
Database Storage (HDD: ~10ms, SSD: ~1ms, RAM: ~0.1ms)
Cache Hit Ratios (typical):
CDN: 90-99% for static assets (images, JS, CSS)
Redis: 70-90% for frequently accessed data
DB cache: 30-60% depending on query patterns
Browser: 80-95% for repeat visitors (with proper Cache-Control)Cache-Aside (Lazy Loading)
// Most common pattern โ application manages cache explicitly
async function getUserById(userId: string) {
// 1. Check cache first
const cached = await redis.get('user:' + userId);
if (cached) {
return JSON.parse(cached); // Cache hit โ return immediately
}
// 2. Cache miss โ fetch from database
const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
// 3. Populate cache for next request (TTL = 1 hour)
await redis.setex('user:' + userId, 3600, JSON.stringify(user));
return user;
}
// On user update โ invalidate cache
async function updateUser(userId: string, data: Partial<User>) {
await db.query('UPDATE users SET ... WHERE id = $1', [userId]);
await redis.del('user:' + userId); // Force cache refresh on next read
}
Pros: Only caches requested data; resilient to cache failure (falls through to DB)
Cons: First request always hits DB; potential for stale data between invalidationWrite-Through Cache
// Write to cache AND database synchronously on every write
async function updateUser(userId: string, data: Partial<User>) {
// Write to database first
const user = await db.query('UPDATE users SET ... WHERE id = $1 RETURNING *', [userId]);
// Immediately update cache โ cache is always consistent
await redis.setex('user:' + userId, 3600, JSON.stringify(user));
return user;
}
Pros: Cache always up-to-date; no stale reads after write
Cons: Write latency increases (must wait for both DB and cache);
cache may contain data that is never read (wasted memory)
Best for: user profiles, product catalog โ data read frequently after updateWrite-Back (Write-Behind) Cache
// Write to cache immediately; flush to DB asynchronously
async function updateViewCount(postId: string) {
// Instant write to cache โ returns immediately
await redis.incr('post:' + postId + ':views');
// Background worker flushes to DB every 30 seconds
// (or on cache eviction, on shutdown)
}
// Background flush job
setInterval(async () => {
const keys = await redis.keys('post:*:views');
for (const key of keys) {
const count = await redis.get(key);
const postId = key.split(':')[1];
await db.query('UPDATE posts SET views = $1 WHERE id = $2', [count, postId]);
}
}, 30_000);
Pros: Extremely fast writes; reduces DB write pressure; good for counters
Cons: Data loss risk if cache fails before flush; complex consistency management
Best for: analytics counters, like counts, view counts, leaderboardsCache Invalidation Strategies
1. TTL (Time-to-Live) โ Simplest approach
redis.setex('key', 300, value) // Expires in 5 minutes
Problem: stale data for up to TTL duration
2. Event-Driven Invalidation
On database write โ publish invalidation event
Cache service subscribes โ deletes affected keys
// Using Redis Pub/Sub or Kafka
publisher.publish('cache-invalidate', { key: 'user:' + userId });
3. Cache Versioning (best for complex invalidation)
// Store version number in cache key
const version = await redis.incr('user:' + userId + ':version');
await redis.set('user:' + userId + ':v' + version, userData);
// Old versions automatically become unreachable
4. Cache Stampede Prevention
// Problem: Many requests miss cache simultaneously, all hit DB
// Solution: Use mutex lock or probabilistic early expiration
async function getCachedUser(userId: string) {
const cached = await redis.get('user:' + userId);
if (cached) return JSON.parse(cached);
// Only one request fetches from DB; others wait
const lock = await redis.set('lock:user:' + userId, '1',
'EX', 5, 'NX'); // SET if Not eXists, expire 5s
if (lock) {
const user = await db.fetchUser(userId);
await redis.setex('user:' + userId, 3600, JSON.stringify(user));
await redis.del('lock:user:' + userId);
return user;
} else {
// Wait for lock holder to populate cache
await sleep(100);
return getCachedUser(userId);
}
}Database Scaling: Read Replicas, Sharding, and Partitioning
The database is typically the first bottleneck in a scaling system. A single database server can handle millions of queries per day, but beyond that, you need read replicas, sharding, or a purpose-built NoSQL database. Understanding when and how to scale your database is one of the most important system design skills.
Read Replicas
Read Replica Setup:
Primary (writes) โ Replica 1 (reads)
โ Replica 2 (reads)
โ Replica 3 (reads)
Replication lag: typically 10msโ100ms (async replication)
Use case: 80% reads, 20% writes (typical web app)
Application code:
const primaryDB = new Pool({ host: 'primary.db.internal' });
const replicaDB = new Pool({ host: 'replica.db.internal' });
// Writes always go to primary
async function createUser(user: User) {
return primaryDB.query('INSERT INTO users ...', [user]);
}
// Reads use replica (with eventual consistency accepted)
async function getUser(id: string) {
return replicaDB.query('SELECT * FROM users WHERE id = $1', [id]);
}
// Critical reads (e.g., after just writing) use primary
async function getUserAfterUpdate(id: string) {
return primaryDB.query('SELECT * FROM users WHERE id = $1', [id]);
}
Scaling read replicas:
- Add replicas to handle more read traffic
- Use a read load balancer (PgBouncer, ProxySQL) to distribute reads
- Hot data: cache in Redis to avoid even hitting replicas
- RDS Aurora: 15 replicas, auto-failover in 30 secondsSharding (Horizontal Partitioning)
Sharding splits data across multiple database instances (shards), each responsible for a subset of rows. Unlike read replicas (which copy all data), sharding partitions data so no single shard holds everything. This allows each shard to be a smaller, more manageable database. Sharding is necessary when a single database (even with replicas) cannot handle write throughput or when the dataset is too large for one server.
Sharding Strategies:
1. Range-Based Sharding
user_id 1โ1M โ Shard 1
user_id 1Mโ2M โ Shard 2
user_id 2Mโ3M โ Shard 3
Pros: Simple to reason about; easy range queries
Cons: Hotspot problem โ new users concentrate on latest shard
2. Hash-Based Sharding
shard = hash(user_id) % num_shards
user_id=1001: hash(1001) % 4 = Shard 2
user_id=1002: hash(1002) % 4 = Shard 0
user_id=1003: hash(1003) % 4 = Shard 3
Pros: Even distribution; no hotspots
Cons: Range queries require hitting all shards;
adding shards requires rehashing (most data moves)
3. Consistent Hashing (best for dynamic shard count)
Place shards on a virtual ring (0โ2^32)
Each shard owns a range of the ring
Adding a shard: only the adjacent shard's data moves
// Virtual nodes (vnodes) improve balance
// Each physical shard has 150 virtual positions on the ring
4. Directory-Based Sharding
Lookup table: user_id โ shard_id
Most flexible but adds lookup overhead;
directory itself can become a bottleneck
Routing:
Application โ Shard Router โ Correct Shard
// Router must know the sharding key to route correctly
function getShardForUser(userId: string): Database {
const shardIndex = hashCode(userId) % NUM_SHARDS;
return shards[shardIndex];
}SQL vs NoSQL for Scaling
| Aspect | SQL (PostgreSQL, MySQL) | NoSQL (DynamoDB, Cassandra, MongoDB) |
|---|---|---|
| Schema | Rigid, defined upfront | Flexible, schema-less or schema-on-read |
| ACID transactions | Full ACID support | Limited or eventual consistency |
| Scaling | Vertical primary; sharding complex | Horizontal scaling built-in |
| Query flexibility | Arbitrary SQL queries, JOINs | Limited queries; must know access patterns |
| Write throughput | ~10Kโ50K writes/sec per node | ~100Kโ1M writes/sec (Cassandra) |
| Best for | Financial data, user accounts, complex queries | Time-series, social graphs, logs, catalogs |
CAP Theorem and Consistency Models
CAP theorem (Brewer's theorem) is a fundamental principle of distributed systems. It states that a distributed data store can only guarantee two of three properties simultaneously: Consistency, Availability, and Partition Tolerance. Since network partitions are inevitable in any distributed system, you must choose between consistency and availability during a partition event.
The Three Properties Explained
C โ Consistency:
Every read receives the most recent write or an error.
All nodes see the same data at the same time.
Example: After writing balance=$100, all reads return $100.
A โ Availability:
Every request receives a response (not an error),
but the response might not be the most recent data.
System stays responsive even if some nodes are down.
P โ Partition Tolerance:
System continues to operate despite network partitions
(messages being dropped or delayed between nodes).
You CANNOT avoid partitions in a distributed system.
Therefore P is always required โ choose C or A.
CP Systems (sacrifice availability):
- Choose consistency over availability during partitions
- May return errors or refuse writes to avoid stale reads
- Examples: HBase, Zookeeper, Redis (with certain configs)
- Use case: banking, financial ledgers, inventory management
AP Systems (sacrifice consistency):
- Choose availability over consistency during partitions
- Returns potentially stale data rather than errors
- Examples: DynamoDB (default), Cassandra, CouchDB
- Use case: social media, product catalogs, analytics
CA Systems โ impossible in a real distributed system
(can only exist if you never have network partitions, i.e., single node)Consistency Models Beyond CAP
1. Strong Consistency (Linearizability)
All reads see the most recent committed write.
After write completes, all nodes return new value immediately.
Cost: Higher latency (must wait for all replicas to acknowledge).
Example: Google Spanner (uses TrueTime for global consistency)
2. Eventual Consistency
All nodes will eventually converge to the same value.
Reads may return stale data during convergence window.
Cost: Potential for stale reads.
Example: Amazon DynamoDB (default), Cassandra
3. Read-Your-Writes Consistency
After you write data, you will always read your own writes.
Other users may still see stale data.
Implementation: Route your reads to the primary for 1 second
after a write, then switch to replica.
Example: Facebook "reading your own posts" guarantee
4. Monotonic Read Consistency
Once you've read a value, you will never read an older value.
Example: Sticky sessions to same replica
5. Causal Consistency
Causally related operations are seen in order.
Example: Reply to a post always seen after the original post.
PACELC Model (extends CAP):
When no Partition: trade-off between Latency and Consistency
DynamoDB: PA/EL โ prefer availability and low latency
Spanner: PC/EC โ prefer consistency even at the cost of latencyMessage Queues and Event Streaming: Kafka, RabbitMQ, SQS
Message queues decouple services, enabling asynchronous communication, load leveling, and fault tolerance. Instead of Service A calling Service B directly (tight coupling), A publishes a message to a queue that B consumes at its own pace. This pattern is essential for building resilient, scalable systems. Understanding Kafka vs RabbitMQ vs SQS trade-offs is a common system design interview topic.
When to Use Message Queues
Synchronous (direct) call โ use when:
- You need the response immediately
- The operation is fast (< 100ms)
- You need transactional guarantees
- Example: Check user authentication, query product price
Asynchronous (message queue) โ use when:
- The operation is slow (email, video encoding, reports)
- You want to decouple producer from consumer
- Consumer is unreliable (can fail and retry)
- You need to fan out to multiple consumers
- Load leveling: absorb traffic spikes without dropping requests
Real-world examples:
User registers โ publish to queue โ email welcome (async)
Order placed โ publish to queue โ inventory, shipping, billing (fan-out)
Video uploaded โ publish to queue โ transcoding workers (1-N relationship)
Payment processed โ publish to audit log โ compliance system (event streaming)Kafka vs RabbitMQ vs AWS SQS
| Feature | Apache Kafka | RabbitMQ | AWS SQS |
|---|---|---|---|
| Model | Distributed log (pull) | Message broker (push) | Managed queue (pull) |
| Message retention | Days to forever (configurable) | Until acknowledged | Up to 14 days |
| Throughput | Millions/sec per cluster | Thousands/sec | Thousands/sec (auto-scales) |
| Ordering | Per partition (strict) | Per queue | FIFO (optional), best-effort standard |
| Consumer groups | Yes (independent cursors) | Competing consumers | Multiple consumers (competing) |
| Replay | Yes (seek to any offset) | No (once consumed, gone) | No |
| Best for | Event sourcing, analytics, audit | Task queues, RPC, routing | AWS-native, simple task queues |
Kafka Architecture Essentials
Kafka Concepts:
Topic: Named stream of messages (e.g., "user-events")
Partition: Ordered, append-only log (a topic has N partitions)
Offset: Position of a message within a partition
Producer: Writes messages to topics
Consumer: Reads messages from topics
Consumer Group: Multiple consumers sharing partitions for parallel processing
Broker: Kafka server (typically 3+ for HA)
Replication Factor: Each partition replicated to R brokers (typically 3)
Ordering guarantee:
Messages in the same partition are strictly ordered.
To ensure related messages are processed in order:
โ Publish with the same partition key (e.g., userId)
โ All messages for userId=123 always go to the same partition
Consumer Group parallelism:
Topic with 4 partitions + Consumer Group with 4 consumers
โ Each consumer reads exactly 1 partition in parallel
Topic with 4 partitions + Consumer Group with 2 consumers
โ Each consumer reads 2 partitions
Rule: You cannot have more active consumers than partitionsRate Limiting Algorithms
Rate limiting protects APIs from abuse, ensures fair usage, and maintains service stability. Four main algorithms are used in production systems, each with different properties around burst handling, memory usage, and accuracy.
Token Bucket Algorithm
Token Bucket:
- Bucket holds up to N tokens (burst capacity)
- Tokens added at fixed rate R (refill rate)
- Each request consumes 1 token
- Request rejected if bucket is empty
Example: 100 tokens/min, burst up to 100
At 9:00:00 โ bucket full (100 tokens)
9:00:01 โ 10 requests โ 90 tokens remain
9:00:30 โ 50 more tokens added โ 140... capped at 100
9:00:31 โ 100 burst requests โ bucket empty
9:01:00 โ 100 tokens refilled โ burst allowed again
Pros: Allows controlled bursting; widely used (AWS, Stripe, GitHub)
Cons: Requires atomic operations (Redis INCR + EXPIRE)
Redis implementation (Lua for atomicity):
local tokens = redis.call('get', KEYS[1])
if tokens == false then tokens = tonumber(ARGV[1]) end
tokens = tonumber(tokens)
if tokens >= 1 then
redis.call('set', KEYS[1], tokens - 1)
redis.call('expire', KEYS[1], tonumber(ARGV[2]))
return 1 -- allowed
end
return 0 -- rate limitedLeaky Bucket Algorithm
Leaky Bucket:
- Requests enter a queue (the "bucket")
- Requests processed at a fixed rate (the "leak")
- If queue is full, reject new requests
Acts like a queue with a constant drain rate.
Guarantees a smooth, constant output rate.
No bursting: requests always processed at rate R.
Analogy: A bucket with a hole. Water (requests) enters the top;
drains out the bottom at a constant rate.
If more water enters than drains, bucket overflows.
Pros: Smooth output rate; good for protecting backend services
Cons: Requests can be delayed (queued); no burst handling
Best for: API gateway protecting a downstream service with
a fixed processing capacitySliding Window Counter
Sliding Window:
Tracks request timestamps in a moving time window.
No boundary burst problem (unlike fixed window).
Implementation with Redis sorted sets:
async function isAllowed(userId: string, limit: number, windowMs: number) {
const now = Date.now();
const windowStart = now - windowMs;
const key = 'ratelimit:' + userId;
// Remove expired timestamps
await redis.zremrangebyscore(key, 0, windowStart);
// Count requests in window
const count = await redis.zcard(key);
if (count < limit) {
// Add current request
await redis.zadd(key, now, now.toString());
await redis.expire(key, Math.ceil(windowMs / 1000));
return true; // allowed
}
return false; // rate limited
}
Memory usage: O(requests_per_window) โ higher than fixed window
Best for: APIs needing accurate rate limiting without boundary burstsRate Limiting at Scale
Distributed Rate Limiting:
Problem: 10 API servers, each with local rate limiter
โ user can make 10x the limit by hitting different servers
Solution 1: Centralized Redis
All servers check same Redis key โ perfectly accurate
Risk: Redis becomes a bottleneck / single point of failure
Solution 2: Redis Cluster with local approximation
Each server has local counter (fast)
Periodically syncs with Redis (every 100ms)
Slightly less accurate but much more scalable
Solution 3: Approximate counting
Each server allows limit/N requests locally (N = server count)
Simple, no shared state, slightly overestimates usage
Rate limiting headers (follow RFC 6585 and Twitter conventions):
X-RateLimit-Limit: 100 # max requests per window
X-RateLimit-Remaining: 74 # remaining requests in current window
X-RateLimit-Reset: 1709482800 # Unix timestamp when window resets
Retry-After: 30 # seconds until retry (only on 429)
HTTP 429 Too Many Requests
{ "error": "Rate limit exceeded", "retryAfter": 30 }System Design Example: URL Shortener (bit.ly)
The URL shortener is a classic system design interview question. It appears simple but requires reasoning about ID generation, database choice, caching, analytics, and scale. Here is a complete walkthrough.
Step 1: Requirements Clarification
Functional Requirements:
- User pastes a long URL, receives a short URL (e.g., bit.ly/abc123)
- Clicking the short URL redirects to the original URL
- Optional: custom aliases (bit.ly/my-custom-link)
- Optional: link expiration
- Optional: analytics (click count, geographic distribution)
Non-Functional Requirements:
- 100M new URLs created per day
- 10:1 read/write ratio โ 1B redirects per day
- URL records stored for 5 years
- Availability: 99.99% (links must always resolve)
- Redirect latency: < 50ms P99
- Short URLs are NOT predictable (no sequential IDs visible)Step 2: Back-of-Envelope Estimation
Writes:
100M URLs/day = 100M / 86,400s โ 1,160 writes/sec
Reads (10:1 ratio):
1B redirects/day = 1B / 86,400s โ 11,600 reads/sec
Storage:
Each URL record: ~500 bytes (short code + long URL + metadata)
100M URLs/day ร 365 days ร 5 years = 182.5B records
182.5B ร 500 bytes = ~91.25 TB over 5 years
With compression: ~30-40 TB
Short code length:
Base62 (a-z, A-Z, 0-9): 62 characters
6 characters: 62^6 = 56.8 billion unique codes
7 characters: 62^7 = 3.5 trillion unique codes
โ 7 characters is safe for 5 years at 100M URLs/day
Cache:
20% of URLs generate 80% of traffic (Zipf's law / Pareto)
Cache top 20% = 100M ร 20% = 20M URLs
20M ร 500B = 10 GB โ easily fits in RedisStep 3: High-Level Design
URL Shortening Service Architecture:
Client
โ
CDN (cache popular redirects at edge)
โ
Load Balancer
โ โ โ
Web Servers (stateless, horizontally scaled)
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Redis Cache โ โ Hot short codes (20M, ~10GB)
โ short_code โ long_url โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ (cache miss)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Primary DB (writes) โ
โ Replica DB x3 (reads) โ โ MySQL or Cassandra
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Analytics Service (async via Kafka)
โ
ClickHouse / BigQuery (analytics DB)Step 4: Short Code Generation
Option 1: MD5/SHA256 Hash + Take First 7 chars
md5("https://verylong.url/path?query=1") = "d41d8cd98f00b204..."
shortCode = "d41d8cd" โ first 7 chars
Problem: ~0.01% hash collision rate (birthday paradox at 100M URLs)
Must check DB for collision and rehash
Option 2: Auto-Increment ID + Base62 Encoding (recommended)
id = 1000000001 (auto-increment from DB)
shortCode = toBase62(1000000001) = "1QLbPm"
function toBase62(num: number): string {
const chars = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
let result = '';
while (num > 0) {
result = chars[num % 62] + result;
num = Math.floor(num / 62);
}
return result;
}
Problem: Sequential IDs expose creation order, volume is guessable.
Fix: Shuffle the base62 alphabet deterministically using a secret seed.
Option 3: Key Generation Service (KGS)
Pre-generate millions of unique random 7-char codes
Store in a "keys_not_used" table in Redis
Application requests a key from KGS โ immediate, no collision check
After use, move to "keys_used" table
Pros: Fast, no collision, no counter
Cons: KGS is another service to maintain; single point of failure (mitigate with HA pair)Step 5: Redirect and Analytics
Redirect implementation:
GET /abc1234
1. Check Redis cache for short code
2. If hit: return 301 (permanent) or 302 (temporary redirect)
3. If miss: query DB, populate Redis, redirect
301 vs 302:
301 Permanent โ browser caches redirect; future requests go directly
to destination without hitting your server.
Better performance; bad for analytics (lose click data).
302 Temporary โ browser always requests your server first.
Enables analytics; slight extra latency.
bit.ly uses 301 to minimize server load;
analytics are captured by embedding a tracking pixel on destination.
Analytics (async, does not block redirect):
On each click โ publish to Kafka topic "link-clicks"
{
shortCode: "abc1234",
timestamp: 1709482800,
userAgent: "Mozilla/5.0...",
ip: "203.0.113.42",
country: "US"
}
โ Consumer aggregates and writes to ClickHouse (analytics DB)
โ Dashboard queries: SELECT count(*) FROM clicks WHERE shortCode='abc1234'System Design Example: Social Media Feed (Twitter/Instagram)
Designing a social media feed is a common and challenging system design question. It requires reasoning about the fan-out problem, timeline generation strategies, ranking and personalization, and handling celebrity users with millions of followers.
The Fan-Out Problem
Problem:
User A (1M followers) posts a tweet.
How do you update 1M followers' feeds efficiently?
Naive approach: On each user's feed request, query all their followees' posts.
SELECT p.* FROM posts p
JOIN follows f ON f.followee_id = p.user_id
WHERE f.follower_id = :userId
ORDER BY p.created_at DESC LIMIT 20;
At scale: If User B follows 500 accounts, this query joins millions of rows.
With 10K concurrent users: system becomes unusable.
Two Main Strategies:
Fan-out on Write (Push Model):
When celebrity posts โ immediately push to all followers' feed caches.
User A posts โ write tweet to 1M pre-computed feed caches (fan-out)
Feed read = O(1) (just read from cache)
Write amplification: 1 post ร 1M followers = 1M cache writes
Fan-out on Read (Pull Model):
No precomputation on write.
When user requests feed โ query followees' recent posts on the fly.
Feed read = O(followees) โ can be slow for users following 1000 accounts.
Hybrid (used by Twitter/X, Instagram):
Regular users (< 10K followers): Fan-out on write
Celebrity users (> 1M followers): Fan-out on read (skip precomputation)
On read: merge precomputed feeds of regular followees + on-the-fly
query for celebrity followees โ merge and rankFeed Storage Architecture
Data Models:
posts table:
id, user_id, content, media_urls[], created_at, like_count, reply_count
follows table:
follower_id, followee_id, created_at
Index: (follower_id, followee_id)
feed_cache (Redis):
Key: "feed:user:{userId}"
Value: Sorted Set, score=timestamp, member=postId
Size: Keep last 800 posts per user (older posts: query DB on scroll)
Timeline generation (fan-out on write):
// On new post published
async function fanOutPost(postId: string, authorId: string) {
const followers = await getFollowers(authorId); // paginated
const chunks = chunk(followers, 100);
// Parallel fan-out using Kafka
for (const chunk of chunks) {
await kafka.publish('fanout-queue', { postId, followers: chunk });
}
}
// Fanout consumer (multiple workers)
kafka.subscribe('fanout-queue', async ({ postId, followers, timestamp }) => {
const pipeline = redis.pipeline();
for (const followerId of followers) {
pipeline.zadd('feed:user:' + followerId, timestamp, postId);
pipeline.zremrangebyrank('feed:user:' + followerId, 0, -801); // keep 800
}
await pipeline.exec();
});
Feed read:
async function getFeed(userId: string, cursor?: number) {
const postIds = await redis.zrevrangebyscore(
'feed:user:' + userId,
cursor || '+inf',
'-inf',
'LIMIT', 0, 20 // 20 posts per page
);
const posts = await batchGetPosts(postIds); // Redis pipeline or DB IN query
return posts;
}CDN and Content Delivery
A Content Delivery Network (CDN) is a globally distributed network of edge servers that cache content close to end users. CDNs dramatically reduce latency for static assets (images, JS, CSS, videos) and can also accelerate dynamic content via optimized routing.
How CDNs Work
Without CDN:
User in Tokyo โ Origin server in Virginia โ 200ms round trip
With CDN:
User in Tokyo โ CDN edge in Tokyo (cached) โ 5ms round trip
First request (cache miss):
Tokyo edge โ Origin in Virginia (200ms, but only once)
CDN caches response with TTL
Subsequent requests (cache hit):
Tokyo edge returns cached content โ 5ms
CDN Caching by Content Type:
Static assets (images, fonts, CSS, JS bundles):
Cache-Control: public, max-age=31536000, immutable
โ Cache for 1 year; use content hashing for cache busting
โ main.abc123.js (hash changes when content changes)
HTML pages:
Cache-Control: no-cache (or short TTL like 5 minutes)
โ Always validate with origin; stale HTML is risky
API responses:
Cache-Control: no-store (for personalized or sensitive data)
โ Or public, max-age=60 for public, non-personalized endpoints
CDN Providers:
Cloudflare: Global anycast network, DDoS protection, free tier
AWS CloudFront: Tight AWS integration (S3, Lambda@Edge)
Fastly: Real-time purging API, streaming support
Akamai: Enterprise, most PoPs worldwideMicroservices vs Monolith Trade-offs
The choice between microservices and a monolith is one of the most consequential architectural decisions. Most companies should start with a monolith and extract services only when specific pain points justify the overhead. Understanding the trade-offs helps you give nuanced answers in system design interviews.
Decision Framework
| Factor | Monolith | Microservices |
|---|---|---|
| Team size | Small (< 10 engineers) | Large (multiple teams, 50+ engineers) |
| Development speed | Faster initial development | Slower (service contracts, deployment pipelines) |
| Deployment complexity | Single deployment unit | Independent deployments per service |
| Scaling granularity | Scale everything or nothing | Scale individual services independently |
| Fault isolation | One bug can crash everything | Failure in one service is isolated |
| Technology diversity | One tech stack | Best tool for each service |
| Network latency | In-process function calls (~microseconds) | Network calls (~1-10ms per hop) |
| Distributed transactions | Simple ACID transactions | Complex (Saga pattern, eventual consistency) |
| Observability | Simple log files | Distributed tracing required (Jaeger, Zipkin) |
The Strangler Fig Pattern
How to migrate from monolith to microservices incrementally:
Phase 1: Identify boundaries
- Find bounded contexts (DDD) or high-churn modules
- Start with services that have clear, stable interfaces
- Extract: Authentication, Notification, File Upload are common first extractions
Phase 2: Strangler Fig migration
Monolith <-- API Gateway --> New Auth Service
All other endpoints still handled by monolith
1. Introduce API Gateway in front of monolith
2. Extract one service (e.g., auth)
3. Route /auth/* to new service; everything else to monolith
4. Repeat for next service
5. Eventually monolith is "strangled" โ just a legacy facade
Phase 3: Database decomposition
Hardest part: shared DB becomes per-service DB
Strategy:
1. Create new schema/table for the new service
2. Dual-write to both old and new tables
3. Backfill historical data
4. Cut over reads to new table
5. Remove old table from monolith
Tools: Debezium (CDC), Flyway (migrations), LiquibaseBack-of-Envelope Calculations
Back-of-envelope estimation is a required skill in system design interviews. Interviewers use these calculations to verify you understand the scale of the problem and to drive architecture decisions. You do not need exact numbers โ order-of-magnitude estimates are sufficient.
Essential Numbers to Memorize
Time:
1 second = 1,000 milliseconds = 1,000,000 microseconds
1 day = 86,400 seconds โ 100,000 seconds (useful approximation)
1 year = 365 days โ 3.15 ร 10^7 seconds โ 30M seconds
Latency Numbers (approximate):
L1 cache: ~0.5 ns
L2 cache: ~7 ns
RAM access: ~100 ns
SSD read: ~100 ฮผs (100,000 ns)
HDD read: ~10 ms (10,000,000 ns)
Network within DC: ~0.5 ms
Network cross-DC: ~50โ150 ms
Packet USโEurope: ~75 ms
Data sizes:
1 char = 1 byte
1 UUID = 16 bytes
1 tweet (280 chars) = ~280 bytes
1 medium image = ~300 KB
1 HD video minute = ~100 MB
1 TB = 10^12 bytes = 1,000 GB = 1,000,000 MB
Powers of 10:
10^3 = 1 thousand (1K)
10^6 = 1 million (1M)
10^9 = 1 billion (1B)
10^12 = 1 trillion (1T)
Throughput:
1 Gbps network = 125 MB/s
SSD sequential read: ~500 MB/s
SSD random IOPS: ~100K IOPS
HDD sequential read: ~100 MB/s
HDD random IOPS: ~200 IOPSQPS Estimation Template
Template: DAU โ MAU โ QPS
Example: Twitter-like system with 500M DAU
Daily Active Users: 500M
Average tweets read per user per day: 20
Read QPS = 500M ร 20 / 86,400 โ 115,740 โ ~116K reads/sec
Average tweets written per user per day: 1 (most users read more than write)
Write QPS = 500M ร 1 / 86,400 โ 5,787 โ ~6K writes/sec
Peak QPS (typically 2-3x average):
Peak read QPS: ~250K reads/sec
Peak write QPS: ~15K writes/sec
Storage per day:
6K writes/sec ร 86,400s = ~518M tweets/day
Per tweet: 280 chars + metadata = ~500 bytes
Daily storage: 518M ร 500 bytes = ~259 GB/day
Per year: 259 GB ร 365 = ~94.5 TB/year
Media (20% of tweets have an image, avg 300KB):
518M ร 20% ร 300KB = ~31 TB/day for images
Yearly: ~11.3 PB/year โ use CDN + object storage (S3)Common Estimation Scenarios
WhatsApp: 2B users, 100B messages/day
Write QPS = 100B / 86,400 โ 1.16M messages/sec
Average message size = 200 bytes
Daily storage = 100B ร 200B = 20 TB/day
YouTube: 1B hours watched per day
1B hours ร 3600s ร 5 Mbps (HD bitrate) = 18,000 PB/day served
โ Most served by CDN (not origin servers)
Upload: 500 hours of video uploaded per minute
500 hours ร 60 min/hr ร 100 MB/min โ 3 TB/min raw video
Uber: 25M rides/day
Write QPS: 25M / 86,400 โ 290 location updates/sec... but
Driver location updates: every 5 seconds ร 5M active drivers
= 1M location updates/sec
โ Use time-series DB (InfluxDB) or Redis geospatial indexes
Instagram: 100M photos uploaded/day
Write QPS: 100M / 86,400 โ 1,157 photo uploads/sec
Storage per photo (with 3 resolutions): 3 ร 300KB = 900KB
Daily storage: 100M ร 900KB = 90 TB/day
โ S3 + CDN; generate thumbnails via LambdaSystem Design Best Practices and Common Patterns
Beyond specific technologies, experienced system designers internalize a set of patterns that apply across many different problems. Recognizing which patterns apply to a given scenario is what separates senior engineers from junior ones in system design interviews.
Reliability Patterns
1. Circuit Breaker
Prevent cascade failures: if Service B is slow/failing,
stop calling it for 30 seconds instead of waiting for timeouts.
States: Closed โ Open (after N failures) โ Half-Open (probe) โ Closed
Libraries: Resilience4j (Java), opossum (Node.js)
2. Retry with Exponential Backoff + Jitter
// Avoid thundering herd: all clients retrying at same time
const delay = Math.min(initialDelay * 2^attempt, maxDelay);
const jitter = Math.random() * delay * 0.1; // 10% jitter
await sleep(delay + jitter);
3. Bulkhead Pattern
Isolate resources per consumer to prevent one slow consumer
from starving all others.
Example: Separate thread pools / connection pools per downstream service.
Netflix uses bulkheads in Hystrix to isolate service dependencies.
4. Saga Pattern (distributed transactions)
For multi-service transactions (no 2-phase commit):
Choreography: Each service emits events; others react.
Orchestration: Central saga orchestrator calls each service.
Compensation: If step N fails, run compensating transactions
for steps 1..N-1 to undo changes.
5. Idempotent Operations
Design APIs so retrying a request has the same effect as calling once.
Use idempotency keys: POST /payments with Idempotency-Key: uuid
Server deduplicates by key โ safe to retry on network failure.Data Patterns
1. CQRS (Command Query Responsibility Segregation)
Separate write model (commands) from read model (queries).
Write side: normalized DB optimized for writes
Read side: denormalized read models optimized for queries
Sync via events or CDC (Change Data Capture)
Use when: read and write workloads have very different shapes
Example: Writes go to PostgreSQL; reads come from Elasticsearch
2. Event Sourcing
Store the full history of state changes as events.
Current state = replay all events from the beginning.
Benefits: Audit log, time travel, event replay for new features.
Challenges: Eventual consistency, event schema evolution.
3. Denormalization for Read Performance
Store redundant data to avoid expensive JOINs at read time.
Example: Store author name in each post row instead of joining users table.
Trade-off: Write complexity (update all copies on change) vs read speed.
4. Two-Phase Commit (2PC) โ use sparingly
Coordinator asks all participants: "Can you commit?"
If all say yes โ coordinator sends commit to all.
Problems: Blocking protocol, coordinator is SPOF, slow.
Prefer: Saga pattern for most distributed transaction needs.Quick-Reference Architecture Checklist
| Component | Technology Options | Use Case |
|---|---|---|
| Load Balancer | AWS ALB, Nginx, HAProxy | Distribute traffic, SSL termination, health checks |
| Cache | Redis, Memcached | Session store, rate limiting, hot data caching |
| Relational DB | PostgreSQL, MySQL, Aurora | Structured data, ACID transactions, complex queries |
| NoSQL (wide-column) | Cassandra, DynamoDB, HBase | Time-series, write-heavy, huge scale |
| Search | Elasticsearch, OpenSearch | Full-text search, faceted filtering |
| Message Queue | Kafka, RabbitMQ, SQS | Async processing, event streaming, decoupling |
| Object Storage | S3, GCS, Azure Blob | Images, videos, backups, static assets |
| CDN | Cloudflare, CloudFront, Fastly | Static assets, global latency reduction |
| API Gateway | Kong, AWS API GW, Nginx | Auth, rate limiting, routing, observability |
| Monitoring | Prometheus + Grafana, Datadog | Metrics, alerting, dashboards |
| Distributed Tracing | Jaeger, Zipkin, OpenTelemetry | Request tracing across microservices |
Conclusion
Mastering system design requires understanding not just individual components but how they fit together to form scalable, reliable, and maintainable systems. The most important skills are: asking the right clarifying questions, estimating scale before designing, knowing when to use SQL vs NoSQL vs cache, understanding the CAP theorem trade-offs in practice, and communicating your reasoning clearly.
Start every design with a simple, working system and only add complexity when you can point to a specific bottleneck that justifies it. A single PostgreSQL database with Redis caching serves millions of users when properly indexed and optimized โ you do not always need Kafka and Cassandra from day one. The best system designers know which tools to apply and, equally importantly, when not to add more complexity.
Practice with common interview questions: design a URL shortener, a social media feed, a ride-sharing service, a distributed key-value store, or a video streaming platform. Work through each using the five-step framework: requirements, estimation, high-level design, deep dive, and bottlenecks. With enough practice, the patterns become second nature.