What is the best way to decompose a monolith into microservices?

Start with Domain-Driven Design to identify bounded contexts. Use the Strangler Fig pattern to incrementally migrate functionality — route traffic through an API gateway and replace monolith endpoints one by one. Avoid a big-bang rewrite. Prioritize domains with the most independent business logic and fewest cross-cutting dependencies.

When should I use synchronous vs asynchronous communication between microservices?

Use synchronous communication (REST or gRPC) when the caller needs an immediate response and the latency is acceptable. Use asynchronous messaging (Kafka, RabbitMQ, NATS) when you need temporal decoupling, higher throughput, or the operation can tolerate eventual consistency. Prefer async for cross-domain events and sync for queries within the same bounded context.

What is the Saga pattern and when should I use it?

The Saga pattern manages distributed transactions across multiple microservices by breaking them into a sequence of local transactions, each with a compensating action for rollback. Use orchestration when you need centralized coordination and visibility, and choreography when services should remain loosely coupled. Sagas replace traditional two-phase commits that do not scale in distributed systems.

What is the difference between CQRS and event sourcing?

CQRS (Command Query Responsibility Segregation) separates read and write models, allowing each to be optimized independently. Event sourcing stores every state change as an immutable event rather than overwriting current state. They are complementary but independent — you can use CQRS without event sourcing, and vice versa. Combined, they provide a powerful audit trail and enable temporal queries.

How does the circuit breaker pattern prevent cascading failures?

The circuit breaker monitors calls to a downstream service. When failures exceed a threshold, it trips to an open state and immediately returns errors without making the call, preventing resource exhaustion. After a timeout it enters half-open state, allowing a few test requests through. If those succeed, it closes; otherwise it reopens. Libraries like Resilience4j, Polly, and Hystrix implement this pattern.

What is the bulkhead pattern in microservices?

The bulkhead pattern isolates resources (thread pools, connection pools, memory) so that a failure in one component does not exhaust resources needed by others. Named after ship bulkheads that prevent flooding from spreading, it limits the blast radius of failures. Implementations include separate thread pools per downstream dependency, container resource limits, and rate limiting per tenant.

How does distributed tracing work with OpenTelemetry?

OpenTelemetry propagates a trace context (trace ID, span ID, flags) through HTTP headers or message metadata across service boundaries. Each service creates spans representing units of work, which are exported to backends like Jaeger, Zipkin, or Grafana Tempo. This provides end-to-end visibility into request flows, latency breakdowns, and error localization across the entire microservices topology.

Should each microservice have its own database?

Yes, the database-per-service pattern is strongly recommended. Each service owns its data schema and storage engine, preventing tight coupling at the database layer. Sharing a database across services creates implicit contracts, makes independent deployment impossible, and leads to schema change conflicts. Use events or APIs for cross-service data access, and accept eventual consistency as a trade-off for autonomy.

微服务模式完全指南：服务设计、通信、Saga、CQRS、事件溯源与可观测性

全面掌握微服务架构模式 — 服务分解（DDD 限界上下文、绞杀者模式）、服务间通信（REST、gRPC、异步消息）、API 网关、Saga 编排与协调、CQRS、事件溯源、熔断器、隔舱、分布式追踪（OpenTelemetry）、服务网格及数据管理策略，配合生产级代码示例。

TL;DR — 60 秒速览微服务模式

使用 DDD 限界上下文分解服务，用绞杀者模式渐进迁移
同步通信（REST/gRPC）用于查询，异步消息用于事件驱动解耦
API 网关统一入口：路由、认证、限流、协议转换
Saga 管理分布式事务 — 编排式（集中控制）或协调式（事件驱动）
CQRS 分离读写模型，事件溯源保留完整状态变更历史
熔断器 + 隔舱 + 指数退避重试 = 弹性三剑客
OpenTelemetry 分布式追踪 + 服务网格（Istio/Linkerd）实现全链路可观测性
每个服务独立数据库，杜绝共享数据库反模式

核心要点

微服务不是银弹 — 只在团队规模和业务复杂度需要时采用
服务边界应与业务能力（而非技术层）对齐
拥抱最终一致性，放弃分布式强一致性的幻想
可观测性（日志、指标、追踪）是微服务的生命线，而非可选项
先用绞杀者模式渐进迁移，避免大爆炸重写
每个模式都有代价 — 评估收益是否超过引入的复杂性

1. 服务分解策略

DDD 限界上下文

Domain-Driven Design（领域驱动设计）的限界上下文是划分微服务边界最可靠的方法。每个限界上下文封装一个完整的业务领域，拥有独立的领域模型、通用语言和数据存储。

// E-commerce bounded contexts example
// Each context becomes a candidate microservice

// Order Context — owns order lifecycle
interface Order {
  orderId: string;
  customerId: string;
  items: OrderItem[];
  status: "pending" | "confirmed" | "shipped" | "delivered";
  totalAmount: number;
}

// Inventory Context — owns stock management
interface InventoryItem {
  sku: string;
  warehouseId: string;
  quantityAvailable: number;
  reservedQuantity: number;
}

// Payment Context — owns payment processing
interface Payment {
  paymentId: string;
  orderId: string;  // reference, not a foreign key
  amount: number;
  method: "credit_card" | "paypal" | "bank_transfer";
  status: "pending" | "authorized" | "captured" | "refunded";
}

// Shipping Context — owns delivery logistics
interface Shipment {
  shipmentId: string;
  orderId: string;
  carrier: string;
  trackingNumber: string;
  estimatedDelivery: Date;
}

绞杀者无花果模式

绞杀者模式允许你渐进式地从单体迁移到微服务。通过 API 网关或反向代理，将流量逐步从单体路由到新的微服务，直到单体被完全替换。

# Nginx configuration for Strangler Fig migration
# Route new endpoints to microservices, legacy to monolith

upstream monolith {
    server monolith-app:8080;
}

upstream order-service {
    server order-service:3001;
}

upstream inventory-service {
    server inventory-service:3002;
}

server {
    listen 80;

    # Migrated: orders now served by microservice
    location /api/v2/orders {
        proxy_pass http://order-service;
    }

    # Migrated: inventory now served by microservice
    location /api/v2/inventory {
        proxy_pass http://inventory-service;
    }

    # Everything else still goes to the monolith
    location / {
        proxy_pass http://monolith;
    }
}

2. 服务间通信

同步通信：REST 与 gRPC

REST 使用 HTTP/JSON 实现简单的请求-响应通信，适合外部 API 和简单查询。gRPC 使用 Protocol Buffers 和 HTTP/2，提供更高性能、类型安全和双向流，适合服务间内部通信。

特性	REST	gRPC
协议	HTTP/1.1 or HTTP/2	HTTP/2
序列化	JSON (text)	Protobuf (binary)
类型安全	无（需 OpenAPI 生成）	内置（proto 文件生成）
流式传输	有限（SSE/WebSocket）	原生双向流
浏览器支持	原生支持	需要 grpc-web 代理
最佳场景	公开 API、CRUD 操作	内部服务通信、高性能

// gRPC service definition (order.proto)
syntax = "proto3";

package order;

service OrderService {
  rpc CreateOrder (CreateOrderRequest) returns (OrderResponse);
  rpc GetOrder (GetOrderRequest) returns (OrderResponse);
  rpc StreamOrderUpdates (GetOrderRequest) returns (stream OrderEvent);
}

message CreateOrderRequest {
  string customer_id = 1;
  repeated OrderItem items = 2;
}

message OrderItem {
  string sku = 1;
  int32 quantity = 2;
  double unit_price = 3;
}

message OrderResponse {
  string order_id = 1;
  string status = 2;
  double total_amount = 3;
}

异步消息传递

异步消息通过消息代理（Kafka、RabbitMQ、NATS）实现时间解耦。发送方不需要等待接收方处理完成，适合事件驱动架构和最终一致性场景。

// Publishing domain events with Kafka (Node.js + kafkajs)
import { Kafka, Partitioners } from "kafkajs";

const kafka = new Kafka({
  clientId: "order-service",
  brokers: ["kafka-1:9092", "kafka-2:9092"],
});

const producer = kafka.producer({
  createPartitioner: Partitioners.DefaultPartitioner,
});

interface OrderCreatedEvent {
  eventType: "OrderCreated";
  orderId: string;
  customerId: string;
  totalAmount: number;
  timestamp: string;
}

async function publishOrderCreated(order: OrderCreatedEvent) {
  await producer.connect();
  await producer.send({
    topic: "order-events",
    messages: [
      {
        key: order.orderId,
        value: JSON.stringify(order),
        headers: {
          "event-type": "OrderCreated",
          "correlation-id": crypto.randomUUID(),
        },
      },
    ],
  });
}

// Consuming events in inventory-service
const consumer = kafka.consumer({ groupId: "inventory-group" });

async function startConsumer() {
  await consumer.connect();
  await consumer.subscribe({ topic: "order-events" });
  await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
      const event = JSON.parse(message.value!.toString());
      if (event.eventType === "OrderCreated") {
        await reserveInventory(event.orderId, event.items);
      }
    },
  });
}

3. API 网关模式

API 网关是所有客户端请求的单一入口点，负责路由、认证、限流、协议转换和请求聚合。它隐藏了内部微服务拓扑，简化了客户端交互。

// Express.js API Gateway with routing + auth + rate limiting
import express from "express";
import { createProxyMiddleware } from "http-proxy-middleware";
import rateLimit from "express-rate-limit";

const app = express();

// Global rate limiting
app.use(rateLimit({
  windowMs: 60 * 1000,    // 1 minute
  max: 100,               // 100 requests per window
  standardHeaders: true,
}));

// JWT authentication middleware
function authenticate(req, res, next) {
  const token = req.headers.authorization?.split(" ")[1];
  if (!token) return res.status(401).json({ error: "Unauthorized" });
  try {
    req.user = jwt.verify(token, process.env.JWT_SECRET);
    next();
  } catch {
    res.status(403).json({ error: "Invalid token" });
  }
}

// Route to microservices
app.use("/api/orders", authenticate, createProxyMiddleware({
  target: "http://order-service:3001",
  changeOrigin: true,
  pathRewrite: { "^/api/orders": "/orders" },
}));

app.use("/api/inventory", authenticate, createProxyMiddleware({
  target: "http://inventory-service:3002",
  changeOrigin: true,
  pathRewrite: { "^/api/inventory": "/inventory" },
}));

app.use("/api/payments", authenticate, createProxyMiddleware({
  target: "http://payment-service:3003",
  changeOrigin: true,
}));

app.listen(8080, () => console.log("API Gateway on :8080"));

服务发现

服务发现允许微服务在运行时动态定位其他服务，无需硬编码地址。客户端发现（服务自己查询注册中心）和服务端发现（负载均衡器查询注册中心）是两种主要模式。

# docker-compose.yml with Consul for service discovery
version: "3.8"

services:
  consul:
    image: hashicorp/consul:1.17
    ports:
      - "8500:8500"
    command: agent -server -bootstrap-expect=1 -ui -client=0.0.0.0

  order-service:
    build: ./order-service
    environment:
      CONSUL_HOST: consul
      SERVICE_NAME: order-service
      SERVICE_PORT: 3001
    depends_on:
      - consul

  inventory-service:
    build: ./inventory-service
    environment:
      CONSUL_HOST: consul
      SERVICE_NAME: inventory-service
      SERVICE_PORT: 3002
    depends_on:
      - consul

4. Saga 模式：分布式事务管理

Saga 模式将分布式事务分解为一系列本地事务，每个本地事务都有对应的补偿操作。当某个步骤失败时，已完成的步骤会被逆序补偿，实现最终一致性。

编排式 Saga（集中控制器）

编排式 Saga 使用一个中心协调器（Saga Orchestrator）来管理整个事务流程。协调器知道所有步骤的执行顺序，并在失败时触发补偿操作。

// Saga Orchestrator for Order Processing
interface SagaStep {
  name: string;
  execute: (context: SagaContext) => Promise<void>;
  compensate: (context: SagaContext) => Promise<void>;
}

interface SagaContext {
  orderId: string;
  customerId: string;
  items: Array<{ sku: string; qty: number }>;
  paymentId?: string;
  shipmentId?: string;
}

class SagaOrchestrator {
  private steps: SagaStep[] = [];
  private completedSteps: SagaStep[] = [];

  addStep(step: SagaStep) {
    this.steps.push(step);
    return this;
  }

  async execute(context: SagaContext): Promise<void> {
    for (const step of this.steps) {
      try {
        console.log("Executing: " + step.name);
        await step.execute(context);
        this.completedSteps.push(step);
      } catch (error) {
        console.error("Failed at: " + step.name + ", compensating...");
        await this.compensate(context);
        throw error;
      }
    }
  }

  private async compensate(context: SagaContext): Promise<void> {
    for (const step of this.completedSteps.reverse()) {
      try {
        console.log("Compensating: " + step.name);
        await step.compensate(context);
      } catch (err) {
        console.error("Compensation failed for: " + step.name);
        // Log for manual intervention
      }
    }
  }
}

// Usage
const orderSaga = new SagaOrchestrator()
  .addStep({
    name: "ReserveInventory",
    execute: async (ctx) => { /* call inventory-service */ },
    compensate: async (ctx) => { /* release reserved stock */ },
  })
  .addStep({
    name: "ProcessPayment",
    execute: async (ctx) => { /* call payment-service */ },
    compensate: async (ctx) => { /* refund payment */ },
  })
  .addStep({
    name: "CreateShipment",
    execute: async (ctx) => { /* call shipping-service */ },
    compensate: async (ctx) => { /* cancel shipment */ },
  });

协调式 Saga（事件驱动）

协调式 Saga 没有中央控制器。每个服务监听事件并发布新事件，形成事件链。服务之间完全解耦，但事务流程更难追踪和调试。

编排 vs 协调选择指南：步骤少于 4 个且服务高度独立时用协调式；步骤复杂、需要全局可见性或涉及条件分支时用编排式。

5. CQRS — 命令查询职责分离

CQRS 将应用的读（Query）和写（Command）模型分离为不同的数据模型甚至不同的数据库。写模型针对一致性和业务规则优化，读模型针对查询性能优化。

// CQRS implementation with separate read/write models

// ── Command Side (Write Model) ──
interface CreateOrderCommand {
  type: "CreateOrder";
  customerId: string;
  items: Array<{ sku: string; quantity: number; price: number }>;
}

class OrderCommandHandler {
  constructor(
    private writeRepo: OrderWriteRepository,
    private eventBus: EventBus
  ) {}

  async handle(cmd: CreateOrderCommand): Promise<string> {
    // Business validation
    if (cmd.items.length === 0) throw new Error("Order must have items");

    const order = {
      id: crypto.randomUUID(),
      customerId: cmd.customerId,
      items: cmd.items,
      status: "pending" as const,
      total: cmd.items.reduce((s, i) => s + i.price * i.quantity, 0),
      createdAt: new Date(),
    };

    await this.writeRepo.save(order);

    // Publish event to sync read model
    await this.eventBus.publish({
      type: "OrderCreated",
      payload: order,
      timestamp: new Date().toISOString(),
    });

    return order.id;
  }
}

// ── Query Side (Read Model) ──
interface OrderReadModel {
  orderId: string;
  customerName: string;      // denormalized
  itemCount: number;         // precomputed
  totalFormatted: string;    // preformatted
  status: string;
  createdAt: string;
}

// Read model projection — listens to events and updates view
class OrderProjection {
  constructor(private readRepo: OrderReadRepository) {}

  async onOrderCreated(event: OrderCreatedEvent) {
    const customer = await lookupCustomer(event.payload.customerId);
    await this.readRepo.upsert({
      orderId: event.payload.id,
      customerName: customer.name,
      itemCount: event.payload.items.length,
      totalFormatted: "$" + event.payload.total.toFixed(2),
      status: event.payload.status,
      createdAt: event.timestamp,
    });
  }
}

6. 事件溯源

事件溯源将每个状态变更存储为不可变事件，而不是只保存当前状态。当前状态通过重放事件序列来重建。这提供了完整的审计日志，支持时间旅行查询，并与 CQRS 天然配合。

// Event Sourcing for an Order aggregate
type OrderEvent =
  | { type: "OrderCreated"; data: { id: string; customerId: string; items: any[] } }
  | { type: "OrderConfirmed"; data: { confirmedAt: string } }
  | { type: "ItemAdded"; data: { sku: string; quantity: number } }
  | { type: "OrderShipped"; data: { trackingNumber: string } }
  | { type: "OrderCancelled"; data: { reason: string } };

interface EventStore {
  append(streamId: string, events: OrderEvent[]): Promise<void>;
  readStream(streamId: string): Promise<OrderEvent[]>;
}

class OrderAggregate {
  private id = "";
  private status = "draft";
  private items: any[] = [];
  private changes: OrderEvent[] = [];

  // Rebuild state from event history
  static fromHistory(events: OrderEvent[]): OrderAggregate {
    const order = new OrderAggregate();
    for (const event of events) {
      order.apply(event);
    }
    return order;
  }

  private apply(event: OrderEvent) {
    switch (event.type) {
      case "OrderCreated":
        this.id = event.data.id;
        this.items = event.data.items;
        this.status = "pending";
        break;
      case "OrderConfirmed":
        this.status = "confirmed";
        break;
      case "OrderShipped":
        this.status = "shipped";
        break;
      case "OrderCancelled":
        this.status = "cancelled";
        break;
    }
  }

  // Command that produces new events
  confirm() {
    if (this.status !== "pending") throw new Error("Only pending orders");
    const event: OrderEvent = {
      type: "OrderConfirmed",
      data: { confirmedAt: new Date().toISOString() },
    };
    this.apply(event);
    this.changes.push(event);
  }

  getUncommittedChanges(): OrderEvent[] {
    return [...this.changes];
  }
}

7. 熔断器模式

熔断器模式防止级联故障。当下游服务调用失败率超过阈值时，熔断器打开，直接返回错误而不发起请求，保护上游服务的资源。经过超时后进入半开状态，允许少量探测请求通过。

// Circuit Breaker implementation in TypeScript
type CircuitState = "CLOSED" | "OPEN" | "HALF_OPEN";

class CircuitBreaker {
  private state: CircuitState = "CLOSED";
  private failureCount = 0;
  private successCount = 0;
  private lastFailureTime = 0;

  constructor(
    private failureThreshold: number = 5,
    private resetTimeoutMs: number = 30000,
    private halfOpenMaxCalls: number = 3
  ) {}

  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "OPEN") {
      if (Date.now() - this.lastFailureTime > this.resetTimeoutMs) {
        this.state = "HALF_OPEN";
        this.successCount = 0;
      } else {
        throw new Error("Circuit is OPEN — request rejected");
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    if (this.state === "HALF_OPEN") {
      this.successCount++;
      if (this.successCount >= this.halfOpenMaxCalls) {
        this.state = "CLOSED";
        this.failureCount = 0;
      }
    } else {
      this.failureCount = 0;
    }
  }

  private onFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();
    if (this.failureCount >= this.failureThreshold) {
      this.state = "OPEN";
    }
  }
}

// Usage
const breaker = new CircuitBreaker(5, 30000, 3);

async function getOrderFromService(orderId: string) {
  return breaker.call(async () => {
    const res = await fetch(
      "http://order-service:3001/orders/" + orderId
    );
    if (!res.ok) throw new Error("Service error: " + res.status);
    return res.json();
  });
}

Resilience4j 配置示例（Java/Spring Boot）

# application.yml — Resilience4j circuit breaker config
resilience4j:
  circuitbreaker:
    instances:
      orderService:
        registerHealthIndicator: true
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 10
        failureRateThreshold: 50
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 3
        automaticTransitionFromOpenToHalfOpenEnabled: true
        recordExceptions:
          - java.io.IOException
          - java.util.concurrent.TimeoutException
  retry:
    instances:
      orderService:
        maxAttempts: 3
        waitDuration: 1s
        enableExponentialBackoff: true
        exponentialBackoffMultiplier: 2
  bulkhead:
    instances:
      orderService:
        maxConcurrentCalls: 25
        maxWaitDuration: 500ms

8. 隔舱模式

隔舱模式借鉴了船舶设计中的水密隔舱概念。通过隔离资源池（线程池、连接池、内存），确保一个下游服务的故障不会耗尽整个系统的资源，限制故障的影响范围。

// Bulkhead pattern — isolated resource pools per service
class Bulkhead {
  private activeCount = 0;
  private queue: Array<{ resolve: Function; reject: Function }> = [];

  constructor(
    private maxConcurrent: number,
    private maxQueueSize: number,
    private timeoutMs: number
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.activeCount >= this.maxConcurrent) {
      if (this.queue.length >= this.maxQueueSize) {
        throw new Error("Bulkhead full — request rejected");
      }
      await new Promise<void>((resolve, reject) => {
        const timer = setTimeout(
          () => reject(new Error("Bulkhead queue timeout")),
          this.timeoutMs
        );
        this.queue.push({
          resolve: () => { clearTimeout(timer); resolve(); },
          reject,
        });
      });
    }

    this.activeCount++;
    try {
      return await fn();
    } finally {
      this.activeCount--;
      if (this.queue.length > 0) {
        const next = this.queue.shift()!;
        next.resolve();
      }
    }
  }
}

// Separate bulkheads per downstream dependency
const orderBulkhead = new Bulkhead(10, 20, 5000);
const paymentBulkhead = new Bulkhead(5, 10, 3000);
const inventoryBulkhead = new Bulkhead(15, 30, 5000);

9. 指数退避重试

瞬时故障（网络抖动、暂时过载）可以通过重试解决，但固定间隔重试可能造成雷鸣群效应。指数退避加抖动可以有效分散重试请求，避免同时向下游服务发起大量重试。

// Retry with exponential backoff + jitter
interface RetryConfig {
  maxAttempts: number;
  baseDelayMs: number;
  maxDelayMs: number;
  jitter: boolean;
}

async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  config: RetryConfig
): Promise<T> {
  let lastError: Error | undefined;

  for (let attempt = 0; attempt < config.maxAttempts; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;

      if (attempt === config.maxAttempts - 1) break;

      // Calculate delay: base * 2^attempt
      let delay = Math.min(
        config.baseDelayMs * Math.pow(2, attempt),
        config.maxDelayMs
      );

      // Add jitter: random value between 0 and delay
      if (config.jitter) {
        delay = Math.random() * delay;
      }

      console.log(
        "Attempt " + (attempt + 1) + " failed. " +
        "Retrying in " + Math.round(delay) + "ms..."
      );
      await new Promise(r => setTimeout(r, delay));
    }
  }

  throw lastError;
}

// Usage
const result = await retryWithBackoff(
  () => fetch("http://payment-service:3003/charge"),
  { maxAttempts: 4, baseDelayMs: 500, maxDelayMs: 8000, jitter: true }
);

10. 分布式追踪与 OpenTelemetry

OpenTelemetry 是 CNCF 的可观测性标准。它通过在服务间传播 trace context（trace ID + span ID），为每个请求提供端到端的链路追踪、延迟分析和错误定位。

// OpenTelemetry setup for a Node.js microservice
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from
  "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from
  "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import {
  SEMRESATTRS_SERVICE_NAME,
  SEMRESATTRS_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: "order-service",
    [SEMRESATTRS_SERVICE_VERSION]: "1.2.0",
  }),
  traceExporter: new OTLPTraceExporter({
    url: "http://jaeger:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
console.log("OpenTelemetry tracing initialized");

// Custom span for business logic
import { trace, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("order-service");

async function processOrder(orderId: string) {
  return tracer.startActiveSpan(
    "processOrder",
    async (span) => {
      try {
        span.setAttribute("order.id", orderId);
        const order = await fetchOrder(orderId);
        span.setAttribute("order.total", order.total);
        span.setAttribute("order.items_count", order.items.length);

        await validateOrder(order);
        await chargePayment(order);

        span.setStatus({ code: SpanStatusCode.OK });
        return order;
      } catch (error) {
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: (error as Error).message,
        });
        span.recordException(error as Error);
        throw error;
      } finally {
        span.end();
      }
    }
  );
}

11. 服务网格（Istio / Linkerd）

服务网格通过 sidecar 代理透明地处理服务间通信，提供流量管理、安全（mTLS）、可观测性和弹性功能，无需修改应用代码。Istio 和 Linkerd 是两个最流行的实现。

特性	Istio	Linkerd
代理	Envoy	linkerd2-proxy (Rust)
复杂度	高（功能丰富）	低（轻量简洁）
mTLS	内置，自动	内置，默认开启
流量管理	高级（金丝雀、故障注入、镜像）	基础（流量分割）
资源开销	较高	较低
最佳场景	大规模企业级部署	中小规模、快速上手

# Istio VirtualService for canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
spec:
  hosts:
    - order-service
  http:
    - route:
        - destination:
            host: order-service
            subset: stable
          weight: 90
        - destination:
            host: order-service
            subset: canary
          weight: 10
      retries:
        attempts: 3
        perTryTimeout: 2s
      timeout: 10s
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service
spec:
  host: order-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 60s
  subsets:
    - name: stable
      labels:
        version: v1
    - name: canary
      labels:
        version: v2

12. 数据管理策略

每服务独立数据库

每个微服务拥有自己的数据库实例或 schema，确保数据隔离和独立部署。服务之间通过 API 或事件进行数据交换，而非共享数据库。

# docker-compose.yml — database per service
services:
  order-db:
    image: postgres:16
    environment:
      POSTGRES_DB: orders
      POSTGRES_USER: order_svc
      POSTGRES_PASSWORD_FILE: /run/secrets/order_db_pw
    volumes:
      - order-data:/var/lib/postgresql/data
    networks:
      - order-net    # isolated network

  inventory-db:
    image: mongo:7
    environment:
      MONGO_INITDB_DATABASE: inventory
    volumes:
      - inventory-data:/data/db
    networks:
      - inventory-net

  payment-db:
    image: postgres:16
    environment:
      POSTGRES_DB: payments
      POSTGRES_USER: payment_svc
      POSTGRES_PASSWORD_FILE: /run/secrets/payment_db_pw
    volumes:
      - payment-data:/var/lib/postgresql/data
    networks:
      - payment-net

volumes:
  order-data:
  inventory-data:
  payment-data:

networks:
  order-net:
  inventory-net:
  payment-net:

共享数据库反模式

避免共享数据库：多个服务直接访问同一数据库会导致隐式耦合、schema 变更冲突、无法独立部署和扩展。这是微服务架构中最常见的反模式之一。如果必须共享数据，使用事件驱动同步或 API 调用。

13. 健康检查与就绪探针

健康检查（liveness probe）确认服务进程是否存活；就绪探针（readiness probe）确认服务是否准备好接收流量。在 Kubernetes 中，这两个探针决定了 Pod 的生命周期管理和流量路由。

// Health check endpoints in Express.js
import express from "express";

const app = express();
let isReady = false;

// Liveness — is the process alive?
app.get("/health/live", (req, res) => {
  res.status(200).json({ status: "alive" });
});

// Readiness — can it accept traffic?
app.get("/health/ready", async (req, res) => {
  if (!isReady) {
    return res.status(503).json({ status: "not ready" });
  }

  try {
    // Check critical dependencies
    await checkDatabaseConnection();
    await checkCacheConnection();
    res.status(200).json({
      status: "ready",
      checks: { database: "ok", cache: "ok" },
    });
  } catch (error) {
    res.status(503).json({
      status: "not ready",
      error: (error as Error).message,
    });
  }
});

// Startup sequence
async function bootstrap() {
  await connectToDatabase();
  await connectToCache();
  await warmUpCaches();
  isReady = true;
  console.log("Service is ready to accept traffic");
}

bootstrap();
app.listen(3001);

# Kubernetes deployment with health probes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
        - name: order-service
          image: order-service:1.2.0
          ports:
            - containerPort: 3001
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3001
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3001
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          startupProbe:
            httpGet:
              path: /health/live
              port: 3001
            initialDelaySeconds: 0
            periodSeconds: 5
            failureThreshold: 30
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi

14. 模式决策矩阵

下表帮助你根据具体场景选择合适的微服务模式。

场景	推荐模式	关键考量
从单体迁移	Strangler Fig	渐进式，低风险
跨服务事务	Saga (Orchestration)	需要补偿逻辑
高频读低频写	CQRS	读写模型独立扩展
完整审计日志	Event Sourcing	事件存储增长需管理
防止级联故障	Circuit Breaker + Bulkhead	配合重试和降级
请求链路追踪	OpenTelemetry	所有服务必须集成
零信任安全	Service Mesh (Istio)	自动 mTLS，运维开销较大
事件驱动解耦	Async Messaging (Kafka)	接受最终一致性

总结

微服务架构不是目标，而是实现业务敏捷性和可扩展性的手段。成功的关键在于：理解每个模式的适用场景和代价，以 DDD 限界上下文驱动服务分解，用 Saga 管理分布式事务，用 CQRS/事件溯源解决复杂查询和审计需求，用熔断器和隔舱保障弹性，用 OpenTelemetry 实现可观测性。从小处着手，按需引入模式，避免过度设计。

记住：分布式系统的复杂性是真实的代价。如果一个模块化的单体能满足需求，那就是最好的选择。只有当团队规模、部署频率和业务域复杂度真正需要时，才逐步拆分为微服务。

微服务模式指南：Saga、CQRS、事件溯源、服务网格与领域驱动设计