DevToolBox免费
博客

Prometheus 完全指南:现代基础设施监控与告警

22 min read作者 DevToolBox Team
TL;DR

Prometheus 是开源的监控和告警工具包,采用拉取模型从 /metrics 端点收集时间序列数据。它内置强大的 PromQL 查询语言、多维数据模型和原生告警功能。配合 Alertmanager 处理告警路由,配合 Grafana 构建仪表盘,配合 Thanos/Cortex 实现长期存储。Prometheus 是 Kubernetes 监控的事实标准。

关键要点
  • Prometheus 使用拉取模型,从目标的 /metrics 端点主动抓取指标
  • 四种指标类型:Counter、Gauge、Histogram 和 Summary
  • PromQL 是强大的函数式查询语言,支持实时选择和聚合时间序列
  • 告警分为两部分:Prometheus 定义规则,Alertmanager 处理路由和通知
  • 丰富的导出器生态系统覆盖数据库、硬件、消息队列等
  • Thanos 和 Cortex 解决长期存储和全局查询视图需求

什么是 Prometheus?

Prometheus 是一个开源的系统监控和告警工具包,最初由 SoundCloud 于 2012 年构建。2016 年,Prometheus 成为继 Kubernetes 之后第二个加入云原生计算基金会 (CNCF) 的项目,并于 2018 年毕业。它采用多维数据模型,通过指标名称和键值标签对标识时间序列数据。

Prometheus 的核心特性包括:拉取式 HTTP 采集模型、强大的 PromQL 查询语言、不依赖分布式存储的本地时序数据库、通过服务发现或静态配置发现目标、支持多种图形和仪表盘模式、以及内置告警管理。

架构与组件

Prometheus 生态系统由多个组件组成,大部分是可选的。理解这些组件如何协同工作是有效运维 Prometheus 的基础。

组件职责
Prometheus Server抓取和存储时间序列数据
Alertmanager处理告警去重、分组、路由和通知
Pushgateway允许短生命周期任务推送指标
Exporters将第三方系统指标转换为 Prometheus 格式
Client Libraries在应用代码中埋点并暴露指标
Service Discovery自动发现抓取目标

安装 Prometheus

使用 Docker 安装

使用 Docker 是最快的启动方式。挂载配置文件和数据卷以实现持久化。

# Pull and run Prometheus with Docker
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v prometheus-data:/prometheus \
  prom/prometheus:latest

# Verify it is running
curl http://localhost:9090/-/healthy

下载二进制文件安装

# Download Prometheus binary (Linux amd64)
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
tar xvfz prometheus-2.53.0.linux-amd64.tar.gz
cd prometheus-2.53.0.linux-amd64

# Start Prometheus
./prometheus --config.file=prometheus.yml

# Create a systemd service for production
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp prometheus promtool /usr/local/bin/
sudo cp prometheus.yml /etc/prometheus/

Docker Compose 完整栈

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command: ["--config.file=/etc/prometheus/prometheus.yml",
              "--storage.tsdb.retention.time=30d"]
  alertmanager:
    image: prom/alertmanager:latest
    ports: ["9093:9093"]
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
volumes:
  prometheus-data:

配置 prometheus.yml

prometheus.yml 是核心配置文件,定义了全局设置、抓取配置、告警规则文件路径和 Alertmanager 地址。

# prometheus.yml - complete example
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate rules
  scrape_timeout: 10s       # Timeout per scrape request

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "alert-rules.yml"
  - "recording-rules.yml"

scrape_configs:
  # Monitor Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Monitor node exporter
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]
    scrape_interval: 10s

  # Monitor application with relabeling
  - job_name: "webapp"
    metrics_path: "/metrics"
    scheme: "https"
    static_configs:
      - targets: ["app1:8080", "app2:8080"]
        labels:
          env: "production"

指标类型

Prometheus 定义了四种核心指标类型,每种都适用于不同的测量场景。选择正确的指标类型对有效监控至关重要。

类型行为示例
Counter单调递增,仅在重启时重置http_requests_total
Gauge可以上升或下降的数值node_memory_available_bytes
Histogram将观测值分配到可配置的桶中http_request_duration_seconds
Summary在滑动窗口上计算分位数rpc_duration_seconds

以下是各类型在 /metrics 端点上的示例输出。

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="POST",status="201"} 56

# HELP node_memory_available_bytes Available memory in bytes
# TYPE node_memory_available_bytes gauge
node_memory_available_bytes 4.294967296e+09

# HELP http_request_duration_seconds Request duration histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 2400
http_request_duration_seconds_bucket{le="0.1"} 2650
http_request_duration_seconds_bucket{le="0.5"} 2800
http_request_duration_seconds_bucket{le="+Inf"} 2834
http_request_duration_seconds_sum 150.72
http_request_duration_seconds_count 2834

PromQL 基础

PromQL 是 Prometheus 的函数式查询语言,用于实时选择和聚合时间序列数据。它是编写仪表盘和告警规则的核心。

选择器与匹配器

# Instant vector - select all time series for a metric
http_requests_total

# Label matching - exact match
http_requests_total{method="GET"}

# Regex matching
http_requests_total{status=~"5.."}

# Negative matching
http_requests_total{method!="DELETE"}

# Range vector - select 5 minutes of data
http_requests_total{method="GET"}[5m]

# Offset - query data from 1 hour ago
http_requests_total offset 1h

常用函数

# rate() - per-second average rate of increase (for counters)
rate(http_requests_total[5m])

# irate() - instant rate based on last two data points
irate(http_requests_total[5m])

# increase() - total increase over a range
increase(http_requests_total[1h])

# histogram_quantile() - calculate percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# predict_linear() - predict value N seconds from now
predict_linear(node_filesystem_avail_bytes[6h], 24*3600)

# delta() - difference between first and last value
delta(process_resident_memory_bytes[1h])

聚合操作

# Sum across all instances
sum(rate(http_requests_total[5m]))

# Sum by specific label
sum by (method) (rate(http_requests_total[5m]))

# Average across instances
avg by (instance) (node_cpu_seconds_total)

# Top 5 by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))

# Count of targets with >80% CPU
count(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80)

记录规则

记录规则预计算频繁使用或计算开销大的 PromQL 表达式,将结果存储为新的时间序列。这提高了仪表盘查询性能,也简化了告警规则的编写。

# recording-rules.yml
groups:
  - name: http_rules
    interval: 15s
    rules:
      # Request rate per service
      - record: job:http_requests_total:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # Error rate percentage
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # 95th percentile latency
      - record: job:http_duration_seconds:p95
        expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

告警规则与 Alertmanager

告警在 Prometheus 中分两个阶段:Prometheus 服务器评估告警规则并发送触发的告警到 Alertmanager,Alertmanager 则负责去重、分组、静默、抑制和路由告警到正确的接收者。

告警规则示例

# alert-rules.yml
groups:
  - name: critical_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ \$labels.job }}"
          description: "Error rate is {{ \$value | humanizePercentage }} for 5 min."

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ \$labels.instance }} is down"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10% on {{ \$labels.instance }}"

Alertmanager 配置

# alertmanager.yml
route:
  group_by: ["alertname", "job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default-email"
  routes:
    - match: { severity: critical }
      receiver: "pagerduty-critical"
    - match: { severity: warning }
      receiver: "slack-warnings"

receivers:
  - name: "default-email"
    email_configs:
      - to: "team@example.com"
  - name: "slack-warnings"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T00/B00/XXXX"
        channel: "#alerts"
  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "your-pagerduty-key"

服务发现

Prometheus 支持多种服务发现机制,用于自动发现需要抓取的目标,无需手动维护静态配置。

scrape_configs:
  # File-based service discovery
  - job_name: "file-sd"
    file_sd_configs:
      - files:
          - "/etc/prometheus/targets/*.json"
        refresh_interval: 30s

  # Consul service discovery
  - job_name: "consul"
    consul_sd_configs:
      - server: "consul.example.com:8500"
        services: ["webapp", "api"]

  # DNS-based discovery
  - job_name: "dns"
    dns_sd_configs:
      - names: ["_prometheus._tcp.example.com"]
        type: SRV
        refresh_interval: 30s

  # EC2 discovery
  - job_name: "ec2"
    ec2_sd_configs:
      - region: us-east-1
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Environment]
        target_label: env

导出器

导出器将第三方系统的指标转换为 Prometheus 格式。以下是最常用的导出器。

导出器端口用途
node_exporter9100Linux 硬件和操作系统指标
blackbox_exporter9115HTTP/TCP/ICMP/DNS 探测
mysqld_exporter9104MySQL 服务器指标
postgres_exporter9187PostgreSQL 服务器指标
redis_exporter9121Redis 服务器指标
nginx-exporter9113Nginx 连接和请求指标

node_exporter 部署

# Run node_exporter with Docker
docker run -d \
  --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

# Verify metrics endpoint
curl http://localhost:9100/metrics | head -20

blackbox_exporter 配置

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      follow_redirects: true

# prometheus.yml - scrape config for blackbox
scrape_configs:
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

应用埋点

Prometheus 提供官方客户端库,让你在应用代码中定义和暴露自定义指标。以下是 Go、Python 和 Node.js 的示例。

Go

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "myapp_http_requests_total",
        Help: "Total HTTP requests.",
    },
    []string{"method", "status"},
)

func init() { prometheus.MustRegister(httpRequests) }

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Python

# pip install prometheus-client
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter("myapp_requests_total", "Total requests", ["method", "endpoint"])
REQUEST_LATENCY = Histogram(
    "myapp_request_duration_seconds", "Request latency",
    ["endpoint"], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

def handle_request(method, endpoint):
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    with REQUEST_LATENCY.labels(endpoint=endpoint).time():
        process_request()

start_http_server(8000)  # Expose metrics on :8000/metrics

Node.js

// npm install prom-client express
const client = require("prom-client");
const express = require("express");
const app = express();

client.collectDefaultMetrics();

const httpRequests = new client.Counter({
  name: "myapp_http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "route", "status"],
});

app.use((req, res, next) => {
  res.on("finish", () => {
    httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
  });
  next();
});

app.get("/metrics", async (req, res) => {
  res.set("Content-Type", client.register.contentType);
  res.end(await client.register.metrics());
});
app.listen(3000);

Grafana 集成

Grafana 是 Prometheus 最常用的可视化工具。在 Grafana 中添加 Prometheus 作为数据源后,你可以使用 PromQL 构建丰富的仪表盘。

# Grafana data source provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

推荐社区仪表盘:Node Exporter Full (ID: 1860)、Prometheus Stats (ID: 2)、Kubernetes Cluster (ID: 6417)。以下是常用面板 PromQL 查询。

# CPU Usage per Instance (percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage (percentage)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk I/O (reads per second)
rate(node_disk_reads_completed_total[5m])

# Network traffic (bytes per second)
rate(node_network_receive_bytes_total{device!="lo"}[5m])

# HTTP request rate by status code
sum by (status) (rate(http_requests_total[5m]))

联邦

联邦允许一个 Prometheus 服务器从另一个服务器抓取选定的时间序列数据。这对于多数据中心部署或按层级聚合指标非常有用。

# Global Prometheus scraping from datacenter instances
scrape_configs:
  - job_name: "federate-dc1"
    scrape_interval: 30s
    honor_labels: true
    metrics_path: "/federate"
    params:
      "match[]":
        - '{job="node"}'
        - '{job="webapp"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ["prometheus-dc1.example.com:9090"]
        labels:
          datacenter: "dc1"

长期存储:Thanos 与 Cortex

Prometheus 本地存储适合短期保留(通常 15-30 天)。对于长期存储和全局查询视图,Thanos 和 Cortex 是两个主流解决方案。

Thanos 架构

# Thanos Sidecar - runs alongside Prometheus
docker run -d --name thanos-sidecar \
  quay.io/thanos/thanos:latest sidecar \
  --tsdb.path=/prometheus \
  --prometheus.url=http://prometheus:9090 \
  --objstore.config-file=/etc/thanos/bucket.yml

# bucket.yml - S3 object storage
type: S3
config:
  bucket: "thanos-metrics"
  endpoint: "s3.amazonaws.com"

# Thanos Querier - global query view
docker run -d --name thanos-querier \
  quay.io/thanos/thanos:latest query \
  --store=thanos-sidecar-dc1:10901 \
  --store=thanos-sidecar-dc2:10901

Cortex 远程写入

# prometheus.yml - remote write to Cortex
remote_write:
  - url: http://cortex-distributor:9009/api/v1/push
    queue_config:
      max_shards: 30
      max_samples_per_send: 1000
特性ThanosCortex
数据摄入Sidecar 上传 TSDB 块通过 remote_write 接收
多租户有限支持原生支持
部署复杂度较简单,附加到现有 Prometheus更复杂,独立服务
降采样内置支持依赖外部

Kubernetes 监控

Prometheus 是 Kubernetes 监控的事实标准。kube-prometheus-stack Helm chart 提供了开箱即用的完整监控方案。

# Install kube-prometheus-stack with Helm
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set grafana.adminPassword=admin

ServiceMonitor 自定义资源

# ServiceMonitor for auto-discovering app metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: webapp-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: webapp
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - production
      - staging

Kubernetes 服务发现

# prometheus.yml - Kubernetes SD (without Operator)
scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with annotation prometheus.io/scrape=true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom port from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
      # Use custom path from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

最佳实践

以下是运营 Prometheus 的关键最佳实践,帮助你避免常见陷阱。

  • 指标命名遵循约定:使用蛇形命名法,包含单位后缀(如 _seconds, _bytes, _total)
  • 限制标签基数:避免使用用户 ID 或请求 ID 等高基数标签
  • 使用记录规则预计算频繁查询,而非每次都实时计算
  • 为告警设置合理的 for 持续时间,避免因抖动产生噪音
  • 监控 Prometheus 自身:up, prometheus_tsdb_head_series, prometheus_tsdb_compaction_duration_seconds
  • 使用 relabel_configs 在采集时过滤和转换标签,减少存储开销
  • 设置合适的保留策略:默认 15 天,根据存储容量和查询需求调整
  • 对于跨多个 Prometheus 实例的查询,使用联邦或 Thanos/Cortex

指标命名示例

# Good metric names
http_requests_total              # counter with _total suffix
http_request_duration_seconds    # histogram with unit suffix
node_memory_available_bytes      # gauge with unit suffix
process_open_fds                 # gauge, no unit needed

# Bad metric names - avoid these
httpRequests                     # no camelCase
request_duration                 # missing unit suffix
http_request_duration_ms         # use base units (seconds not ms)
requests{user_id="12345"}        # high-cardinality label

Prometheus 与其他方案对比

特性PrometheusDatadogInfluxDBVictoria Metrics
类型开源,自托管商业 SaaS开源/商业开源,自托管
数据模型拉取式,多维标签推送式,标签+主机推送式,measurement+tag兼容 Prometheus
查询语言PromQL专有查询InfluxQL / FluxMetricsQL (PromQL superset)
长期存储需要 Thanos/Cortex内置内置内置,高压缩
成本免费(运维成本)按主机/指标计费开源免费/企业版付费免费(运维成本)
K8s 集成原生,事实标准通过 Agent 支持通过 Telegraf兼容 Prometheus 生态

总结

Prometheus 是现代基础设施监控的基石。它的拉取模型、多维数据模型和强大的 PromQL 使其成为云原生环境的首选监控工具。从单节点 Docker 部署到大规模 Kubernetes 集群,Prometheus 都能提供可靠的指标采集和告警能力。配合 Grafana 构建可视化仪表盘,配合 Alertmanager 实现智能告警路由,配合 Thanos 或 Cortex 扩展到长期存储,你可以构建一个完整的、生产就绪的监控平台。无论你是刚开始接触监控还是需要扩展现有方案,Prometheus 都是一个值得投入的强大工具。

𝕏 Twitterin LinkedIn
这篇文章有帮助吗?

保持更新

获取每周开发技巧和新工具通知。

无垃圾邮件,随时退订。

试试这些相关工具

{ }JSON FormatterJSON Validator

相关文章

Ansible 完全指南:简化基础设施自动化

掌握 Ansible 清单、Playbook、模块、角色、Galaxy、Vault、Jinja2 模板与动态清单。

Kubernetes开发者完整指南:Pod、Helm、RBAC和CI/CD

掌握Kubernetes的开发者指南。含Pod、Deployment、Service、Ingress、Helm、PVC、健康检查、HPA、RBAC和GitHub Actions CI/CD集成。

Docker命令:从基础到生产的完整指南

掌握Docker的完整命令指南。含docker run/build/push、Dockerfile、多阶段构建、卷、网络、Docker Compose、安全、注册表和生产部署模式。