What is OpenTelemetry and why should I use it?

OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework for generating, collecting, and exporting telemetry data (traces, metrics, and logs). It is a CNCF incubating project formed by merging OpenTracing and OpenCensus. You should use it because it provides a single, standardized API across languages, avoids vendor lock-in, and has broad industry adoption from all major cloud providers and observability platforms.

What are the three pillars of observability in OpenTelemetry?

The three pillars (signals) in OpenTelemetry are: Traces (distributed traces that follow a request across services, composed of spans), Metrics (numerical measurements over time like counters, gauges, and histograms), and Logs (timestamped text records with structured metadata). OpenTelemetry unifies all three signals with correlated context, enabling engineers to move seamlessly between traces, metrics, and logs during debugging.

What is the OpenTelemetry Collector and how does it work?

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It has three main components: Receivers (accept data in various formats like OTLP, Jaeger, Zipkin, Prometheus), Processors (transform data by batching, filtering, sampling, or adding attributes), and Exporters (send data to backends like Jaeger, Prometheus, Datadog, or any OTLP-compatible endpoint). Pipelines connect these components for each signal type.

How does auto-instrumentation work in OpenTelemetry?

Auto-instrumentation automatically captures telemetry data from popular libraries and frameworks without code changes. In Node.js, you use the @opentelemetry/auto-instrumentations-node package. In Python, the opentelemetry-instrument CLI wraps your application. In Java, a Java agent JAR is attached at startup. Auto-instrumentation patches HTTP clients, database drivers, message queues, and web frameworks to automatically create spans and propagate context.

What is context propagation in OpenTelemetry?

Context propagation is the mechanism that links spans across service boundaries to form a complete distributed trace. When a service makes an outbound request, the current trace context (trace ID, span ID, trace flags) is injected into request headers using W3C Trace Context or B3 format. The receiving service extracts this context and creates child spans under the same trace, enabling end-to-end visibility across microservices.

What sampling strategies does OpenTelemetry support?

OpenTelemetry supports several sampling strategies: AlwaysOn (record everything), AlwaysOff (record nothing), TraceIdRatioBased (sample a percentage based on trace ID), ParentBased (respect the parent span decision), and Tail-based sampling (make decisions in the Collector after collecting all spans). Tail-based sampling is the most powerful as it can keep all error traces while sampling successful ones.

How do I deploy OpenTelemetry in Kubernetes?

The recommended approach is using the OpenTelemetry Operator, which provides CRDs for managing Collectors and auto-instrumentation. Deploy Collectors as a DaemonSet (one per node) or Deployment (centralized). Use the Instrumentation CRD to inject auto-instrumentation into pods via annotations. The Operator also supports sidecar mode for per-pod Collector instances.

How does OpenTelemetry integrate with Grafana, Datadog, and New Relic?

OpenTelemetry integrates via exporters. Grafana receives OTLP data into Tempo (traces), Mimir (metrics), and Loki (logs). Datadog accepts OTLP via its Agent or OTLP ingest endpoints. New Relic has native OTLP support where you send data to their endpoint with a license key header. All three support the standard OTLP protocol, making backend migration straightforward.

OpenTelemetry 完全指南：现代应用的统一可观测性

TL;DROpenTelemetry 是 CNCF 的开源可观测性框架，统一了分布式追踪、指标和日志。它提供跨语言的标准 API，通过 Collector 实现厂商无关的数据采集和导出，支持自动埋点和手动埋点，是构建现代可观测性平台的基石。

核心要点

OTel 统一三大信号：追踪、指标和日志，并通过上下文关联
架构分为 API（接口）、SDK（实现）和 Collector（数据管道）
自动埋点可零代码修改生成遥测数据
OTLP 是标准协议，所有主流后端均支持
尾部采样可在保留错误追踪的同时降低成本
Kubernetes Operator 简化了集群内的部署和管理

什么是 OpenTelemetry？

OpenTelemetry（简称 OTel）是一个厂商中立的开源可观测性框架，用于生成、采集和导出遥测数据。它由 CNCF 托管，是 OpenTracing 和 OpenCensus 合并的产物，提供统一的 API、SDK 和工具覆盖追踪、指标和日志三大信号。

OpenTelemetry 架构

OTel 架构由三层组成：API（定义接口）、SDK（提供实现）和 Collector（数据管道）。

Application Layer           Collector Layer            Backend Layer
+--------------------+     +--------------------+    +------------+
| OTel API           |     | Receivers          |    | Jaeger     |
|  TracerProvider     | --> |  otlp, jaeger,     | -> | Tempo      |
|  MeterProvider      |     |  prometheus, zipkin|    | Zipkin     |
|  LoggerProvider     |     +--------------------+    +------------+
+--------------------+     | Processors         |    | Prometheus |
| OTel SDK           |     |  batch, filter,    | -> | Mimir      |
|  SpanProcessor      |     |  attributes, sample|    | Datadog    |
|  MetricReader       |     +--------------------+    +------------+
|  LogRecordProcessor |     | Exporters          |    | New Relic  |
|  OTLP Exporter      |     |  otlp, prometheus, | -> | Grafana    |
+--------------------+     |  datadog, debug    |    | Loki       |
                           +--------------------+    +------------+

API 层

API 定义零依赖接口（TracerProvider、MeterProvider、LoggerProvider）。库作者安全依赖 API 埋点而不引入特定实现。

SDK 层

SDK 提供 API 的具体实现，包括 Span 处理器、指标聚合器和导出器。应用开发者在入口配置 SDK。

Collector 层

Collector 是独立服务，负责接收、处理和导出数据。它解耦应用与后端，支持批处理、重试和多目标导出。

三大信号

分布式追踪（Traces）

追踪记录请求在分布式系统中的完整路径。一个 Trace 由多个 Span 组成，通过 parent-child 关系形成树状结构。

Trace: [trace_id: abc123]
|
+-- Span A: API Gateway (root span, 250ms)
|   attributes: http.method=GET, http.url=/api/orders
|   +-- Span B: Order Service (200ms)
|   |   +-- Span C: DB Query (45ms, db.system=postgresql)
|   |   +-- Span D: Cache Lookup (3ms, db.system=redis)
|   +-- Span E: Payment Service (35ms, status=ERROR)

指标（Metrics）

OTel 指标定义了 Counter（单调递增）、Histogram（分布统计）、Gauge（瞬时值）和 UpDownCounter（可增可减）。

日志（Logs）

OTel 日志通过 Bridge API 与现有日志框架（Log4j、SLF4J、Python logging）集成，将日志与追踪上下文关联。

安装与设置

Node.js

npm install @opentelemetry/api @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http

// tracing.ts - Initialize OpenTelemetry
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from
  '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from
  '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from
  '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from
  '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://localhost:4318/v1/metrics',
    }),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'my-node-service',
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Python

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp opentelemetry-instrumentation

# Auto-instrument - no code changes needed:
opentelemetry-instrument \
  --service_name my-python-service \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --exporter_otlp_endpoint http://localhost:4317 \
  python app.py

Go

go get go.opentelemetry.io/otel \
  go.opentelemetry.io/otel/sdk \
  go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp

// main.go
func initTracer() (func(context.Context) error, error) {
  exporter, err := otlptracehttp.New(
    context.Background(),
    otlptracehttp.WithEndpoint("localhost:4318"),
    otlptracehttp.WithInsecure(),
  )
  if err != nil { return nil, err }
  tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),
    sdktrace.WithResource(resource.NewWithAttributes(
      semconv.SchemaURL,
      semconv.ServiceNameKey.String("my-go-service"),
    )),
  )
  otel.SetTracerProvider(tp)
  return tp.Shutdown, nil
}

Java

# Download the Java agent and run with your app
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/\
opentelemetry-java-instrumentation/releases/latest/\
download/opentelemetry-javaagent.jar

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-java-service \
  -Dotel.exporter.otlp.endpoint=http://localhost:4317 \
  -jar myapp.jar

自动埋点

自动埋点通过补丁或代理机制自动拦截常用库调用，无需修改业务代码即可生成 Span 和传播上下文。各语言支持的常见库：

Node.js: Express, Fastify, HTTP, gRPC, pg, mysql2, Redis, MongoDB, AWS SDK
Python: Flask, Django, FastAPI, requests, psycopg2, SQLAlchemy, Redis, Celery
Go: net/http, gRPC, database/sql, Gin, Echo
Java: Spring Boot, Servlet, JDBC, Hibernate, Kafka, gRPC, OkHttp

手动埋点

手动埋点让你完全控制遥测数据，适合添加业务特定的 Span、自定义属性和指标。

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      span.addEvent('order.validation.started');
      const order = await validateOrder(orderId);
      span.addEvent('order.validation.completed', {
        'order.items_count': order.items.length,
      });
      // Nested span for payment
      await tracer.startActiveSpan('processPayment',
        async (paymentSpan) => {
          paymentSpan.setAttribute(
            'payment.method', order.paymentMethod);
          await chargePayment(order);
          paymentSpan.end();
        });
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

自定义指标

import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service', '1.0.0');

// Counter
const orderCounter = meter.createCounter('orders.processed.total',
  { description: 'Total orders processed', unit: 'orders' });

// Histogram
const durationHist = meter.createHistogram(
  'orders.processing.duration',
  { description: 'Processing time', unit: 'ms' });

// Observable Gauge
const activeGauge = meter.createObservableGauge(
  'orders.active.count',
  { description: 'Active orders' });
activeGauge.addCallback((r) => r.observe(getActiveCount()));

orderCounter.add(1, { 'order.type': 'standard' });
durationHist.record(245, { 'order.type': 'standard' });

上下文传播

上下文传播将跨服务的 Span 链接成完整追踪。W3C Trace Context 通过 traceparent 头注入 trace ID、span ID 和采样标志。

// W3C Trace Context header:
// traceparent: 00-<trace-id>-<parent-span-id>-<flags>

import { context, propagation } from '@opentelemetry/api';

// Inject context into outgoing request
function makeRequest(url: string) {
  const headers: Record<string, string> = {};
  propagation.inject(context.active(), headers);
  return fetch(url, { headers });
}

// Extract context from incoming request
function handleRequest(req: Request) {
  const ctx = propagation.extract(
    context.active(), req.headers);
  return context.with(ctx, () => {
    return tracer.startActiveSpan('handle', (span) => {
      // child of caller span
      span.end();
    });
  });
}

Span 属性与事件

属性是键值对元数据，事件是 Span 内的时间点记录。OTel 语义约定标准化了常见属性名。

// Semantic Conventions examples:
// HTTP: http.request.method, http.response.status_code, url.full
// DB:   db.system, db.statement, db.operation.name
// RPC:  rpc.system, rpc.service, rpc.method

span.setAttribute('http.request.method', 'POST');
span.setAttribute('http.response.status_code', 200);
span.setAttribute('db.system', 'postgresql');
span.setAttribute('db.statement', 'SELECT * FROM orders WHERE id=?');

span.addEvent('cache.miss', {
  'cache.key': 'user:1234',
  'cache.backend': 'redis',
});

span.recordException(new Error('Connection timeout'));

导出器

导出器将遥测数据发送到后端。OTLP 是原生协议，所有主流后端均支持。

OTLP (gRPC / HTTP): 推荐标准协议，支持所有三种信号
Jaeger: 直接导出到 Jaeger（Thrift 或 gRPC）
Zipkin: Zipkin 兼容后端
Prometheus: 暴露 /metrics 端点
Console/Debug: 开发调试用

Collector 配置

Collector 通过 YAML 定义接收器、处理器、导出器和管道。以下是完整的生产配置：

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
  prometheus:
    config:
      scrape_configs:
        - job_name: app-metrics
          scrape_interval: 15s
          static_configs:
            - targets: ['app:9090']
  jaeger:
    protocols:
      thrift_http: { endpoint: 0.0.0.0:14268 }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  otlp/mimir:
    endpoint: mimir:4317
    tls: { insecure: true }
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [memory_limiter, filter, batch, resource]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/mimir]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [debug]

运行 Collector

# Docker
docker run -d --name otel-collector \
  -p 4317:4317 -p 4318:4318 \
  -v ./otel-collector-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector-contrib:latest

# Docker Compose
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otelcol/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8888:8888"   # Collector metrics

采样策略

高流量系统中采集每条追踪不实际。采样策略在可观测性和成本之间取得平衡。

头部采样（SDK 端）

import {
  TraceIdRatioBasedSampler,
  ParentBasedSampler,
} from '@opentelemetry/sdk-trace-base';

// ParentBased: respect parent decision, sample 10% of roots
const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1),
});

const sdk = new NodeSDK({ sampler, /* ... */ });

尾部采样（Collector 端）

尾部采样在 Collector 中看到完整追踪后决策，适合保留所有错误和高延迟追踪。

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: error-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: latency-policy
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
      - name: string-attr-policy
        type: string_attribute
        string_attribute:
          key: priority
          values: [high, critical]

集成可观测性后端

Grafana Stack (Tempo + Mimir + Loki)

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  otlphttp/mimir:
    endpoint: http://mimir:9009/otlp
  otlphttp/loki:
    endpoint: http://loki:3100/otlp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/mimir]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/loki]

Datadog

exporters:
  datadog:
    api:
      key: "\${DD_API_KEY}"
      site: datadoghq.com
    traces:
      span_name_as_resource_name: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [datadog]

New Relic

exporters:
  otlp/newrelic:
    endpoint: otlp.nr-data.net:4317
    headers:
      api-key: "\${NEW_RELIC_LICENSE_KEY}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/newrelic]

Kubernetes 部署

OpenTelemetry Operator 是 K8s 中管理 OTel 的推荐方式，提供 CRD 管理 Collector 和自动埋点注入。

# Install cert-manager + OTel Operator
kubectl apply -f https://github.com/cert-manager/cert-manager/\
releases/download/v1.14.0/cert-manager.yaml

helm repo add open-telemetry \
  https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-operator \
  open-telemetry/opentelemetry-operator \
  --namespace otel-system --create-namespace

Collector CRD (DaemonSet)

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: otel-system
spec:
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc: { endpoint: 0.0.0.0:4317 }
          http: { endpoint: 0.0.0.0:4318 }
    processors:
      batch: { timeout: 5s }
      k8sattributes:
        extract:
          metadata:
            - k8s.pod.name
            - k8s.namespace.name
            - k8s.deployment.name
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
    exporters:
      otlp:
        endpoint: tempo.observability:4317
        tls: { insecure: true }
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp]

自动埋点注入

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector.otel-system:4317
  propagators: [tracecontext, baggage]
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/\
autoinstrumentation-nodejs:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/\
autoinstrumentation-python:latest
---
# Annotate Deployment for auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-node-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-nodejs: "true"
    spec:
      containers:
        - name: app
          image: my-node-app:latest

最佳实践

1. 遵循语义约定

使用标准属性名（如 http.request.method 而非 method），确保跨服务一致性并让后端自动解析。

2. 设置资源属性

import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT_NAME,
} from '@opentelemetry/semantic-conventions';

const resource = new Resource({
  [ATTR_SERVICE_NAME]: 'order-service',
  [ATTR_SERVICE_VERSION]: '2.1.0',
  [ATTR_DEPLOYMENT_ENVIRONMENT_NAME]: 'production',
});

3. 控制 Span 粒度

为跨网络调用、数据库操作和关键业务操作创建 Span。避免在紧密循环中创建 Span。

4. 正确处理 Span 生命周期

始终在 finally 块中结束 Span。对异步操作使用 startActiveSpan 保持上下文。

5. 生产环境采样

不要用 AlwaysOn。从 ParentBased + TraceIdRatio(0.1) 开始，配合尾部采样保留错误追踪。

6. 使用 Collector

Collector 提供缓冲、重试、批处理和多目标导出。减少应用端网络连接和资源消耗。

7. 关联三大信号

// Inject trace context into logs (Node.js + Winston)
import { trace, context } from '@opentelemetry/api';
import winston from 'winston';

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format((info) => {
      const span = trace.getSpan(context.active());
      if (span) {
        const ctx = span.spanContext();
        info.trace_id = ctx.traceId;
        info.span_id = ctx.spanId;
      }
      return info;
    })(),
    winston.format.json()
  ),
  transports: [new winston.transports.Console()],
});
// Output: {"message":"Order processed",
//  "trace_id":"abc...","span_id":"def..."}

8. Collector 资源限制

配置 memory_limiter 防 OOM，在 K8s 中设置 resources.limits，监控 Collector 自身指标。

总结

OpenTelemetry 正在成为可观测性的事实标准。厂商中立设计让你一次埋点、导出到任何后端；三大信号的统一上下文关联让调试分布式系统不再是噩梦；Collector 的灵活管道让数据处理变得简单。从自动埋点开始快速获得价值，逐步添加手动埋点、优化采样、部署 Collector，最终建立生产就绪的全栈可观测性平台。