DevOps 流水线指南:CI/CD、GitHub Actions、Docker、基础设施即代码与部署策略
全面的 DevOps 流水线指南,涵盖 CI/CD 基础、GitHub Actions 工作流、GitLab CI/CD、Docker 多阶段构建、Terraform 和 Pulumi 基础设施即代码、蓝绿和金丝雀部署、Vault 密钥管理、ArgoCD 和 Flux 的 GitOps、SAST/DAST 流水线安全以及监控策略。
CI/CD 基础
持续集成(CI)和持续交付/部署(CD)构成了现代 DevOps 流水线的骨干。CI 确保每次代码变更都自动构建和测试,尽早发现缺陷。CD 进一步将经过验证的变更自动部署到预发布或生产环境。
构建、测试、部署流水线
标准流水线有三个核心阶段。构建阶段编译代码、解析依赖并生成制品。测试阶段运行单元测试、集成测试和代码检查。部署阶段将制品推送到目标环境。每个阶段充当质量门——如果任何阶段失败,流水线停止并通知团队。
# Typical CI/CD Pipeline Stages
#
# 1. Source - Code commit triggers pipeline
# 2. Build - Compile code, install dependencies
# 3. Test - Unit tests, integration tests, linting
# 4. Security - SAST, dependency scanning
# 5. Package - Build Docker image, push to registry
# 6. Deploy - Deploy to staging environment
# 7. Verify - Smoke tests, health checks
# 8. Promote - Deploy to production (manual gate or auto)
# 9. Monitor - Track metrics, error rates, performance
# Each stage acts as a quality gate:
# If tests fail -> pipeline stops, team notified
# If scan finds critical CVE -> pipeline blocks merge
# If health check fails -> automatic rollbackGitHub Actions 工作流
GitHub Actions 是内置于 GitHub 的 CI/CD 平台。工作流定义在 .github/workflows/ 目录下的 YAML 文件中,由 push、pull_request 或 schedule 等事件触发。核心功能包括用于跨多个操作系统/语言版本测试的矩阵构建、加速构建的缓存、加密密钥管理以及 DRY 流水线定义的可复用工作流。
# .github/workflows/ci.yml
name: CI Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
node-version: [18, 20, 22]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: \${{ matrix.node-version }}
cache: "npm"
- run: npm ci
- run: npm test
- run: npm run lint
build-and-push:
needs: test
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: \${{ github.actor }}
password: \${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/\${{ github.repository }}:latest
cache-from: type=gha
cache-to: type=gha,mode=maxGitLab CI/CD
GitLab CI/CD 使用仓库根目录的 .gitlab-ci.yml 文件。它提供内置容器注册中心、零配置的 Auto DevOps 流水线、带审查应用的环境管理以及用于复杂依赖管理的 DAG(有向无环图)流水线。GitLab Runner 执行流水线任务,可以是共享的或项目专用的。
# .gitlab-ci.yml
stages:
- test
- build
- deploy
variables:
DOCKER_IMAGE: \$CI_REGISTRY_IMAGE:\$CI_COMMIT_SHA
test:
stage: test
image: node:20-alpine
cache:
key: \$CI_COMMIT_REF_SLUG
paths:
- node_modules/
script:
- npm ci
- npm test
- npm run lint
artifacts:
reports:
junit: test-results.xml
build:
stage: build
image: docker:24
services:
- docker:24-dind
script:
- docker login -u \$CI_REGISTRY_USER -p \$CI_REGISTRY_PASSWORD \$CI_REGISTRY
- docker build -t \$DOCKER_IMAGE .
- docker push \$DOCKER_IMAGE
deploy_staging:
stage: deploy
environment:
name: staging
url: https://staging.example.com
script:
- kubectl set image deployment/app app=\$DOCKER_IMAGE
only:
- mainDocker 多阶段构建
多阶段构建在单个 Dockerfile 中使用多个 FROM 语句。构建阶段包含所有开发依赖和编译器,最终阶段只将编译好的制品复制到最小化基础镜像中。这大幅减小了镜像体积和攻击面。生产环境中始终使用具体版本标签,绝不使用 latest。
# Dockerfile — Multi-stage build
# Stage 1: Build
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production && \
cp -R node_modules /prod_modules && \
npm ci
COPY . .
RUN npm run build
# Stage 2: Production
FROM node:20-alpine AS production
WORKDIR /app
RUN addgroup -g 1001 appgroup && \
adduser -u 1001 -G appgroup -s /bin/sh -D appuser
COPY --from=builder /prod_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package.json ./
USER appuser
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]
# Result: ~150MB instead of ~1.2GB
# No dev dependencies, no source code, non-root user容器注册中心
容器注册中心存储和分发 Docker 镜像。选项包括 Docker Hub(公共和私有仓库)、GitHub Container Registry(ghcr.io,与 GitHub Actions 集成)、Amazon ECR(与 AWS 服务集成)、Google Artifact Registry 和 Azure Container Registry。根据云提供商和访问控制需求选择。推送到生产注册中心前务必扫描镜像漏洞。
Terraform 基础
HashiCorp 的 Terraform 是最广泛采用的基础设施即代码工具。它使用 HCL(HashiCorp 配置语言)声明式定义基础设施。核心工作流是 terraform init(初始化提供者)、terraform plan(预览变更)和 terraform apply(执行变更)。状态存储在后端(S3、Azure Blob、GCS)以跟踪资源映射。使用模块实现可复用基础设施组件,使用工作空间进行环境隔离。
# main.tf — Terraform AWS ECS Service
terraform {
required_version = ">= 1.7"
backend "s3" {
bucket = "myapp-terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
}
}
provider "aws" {
region = var.aws_region
}
resource "aws_ecs_cluster" "main" {
name = "\${var.project}-\${var.environment}"
setting {
name = "containerInsights"
value = "enabled"
}
}
resource "aws_ecs_service" "app" {
name = "\${var.project}-svc"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.app_count
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnets
security_groups = [aws_security_group.ecs.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = var.project
container_port = 3000
}
}
# terraform init -> download providers
# terraform plan -> preview changes
# terraform apply -> execute changesPulumi——使用编程语言的基础设施即代码
Pulumi 采用不同的 IaC 方法,使用真正的编程语言(TypeScript、Python、Go、C#)而非领域特定语言。这让你可以使用循环、条件判断、类型检查、IDE 支持和测试框架。Pulumi 的状态管理类似 Terraform,支持所有主要云提供商。对于偏好用应用语言编写基础设施代码的团队特别有吸引力。
// index.ts — Pulumi AWS ECS with TypeScript
import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";
import * as awsx from "@pulumi/awsx";
const config = new pulumi.Config();
const environment = config.require("environment");
const desiredCount = config.getNumber("desiredCount") || 2;
// Create a VPC with best-practice defaults
const vpc = new awsx.ec2.Vpc("app-vpc", {
numberOfAvailabilityZones: 2,
natGateways: { strategy: "Single" },
});
// Create an ECS cluster
const cluster = new aws.ecs.Cluster("app-cluster", {
settings: [{
name: "containerInsights",
value: "enabled",
}],
});
// Build and publish Docker image
const image = new awsx.ecs.Image("app-image", {
repositoryUrl: repo.url,
context: "./app",
platform: "linux/amd64",
});
// Create Fargate service with ALB
const service = new awsx.ecs.FargateService("app-svc", {
cluster: cluster.arn,
desiredCount: desiredCount,
taskDefinitionArgs: {
container: {
image: image.imageUri,
cpu: 256,
memory: 512,
portMappings: [{ containerPort: 3000 }],
},
},
});
export const url = service.loadBalancer?.endpoint;部署策略:蓝绿部署和金丝雀部署
蓝绿部署维护两个相同的生产环境。蓝色是当前线上版本,绿色是新版本。部署并测试绿色环境后,将流量从蓝色切换到绿色。回滚是即时的——只需切换回蓝色。金丝雀部署逐步将少量流量(如 5%)路由到新版本,监控指标,如果一切正常则增加流量。滚动部署逐个更新实例,全程保持可用性。
# Blue-Green Deployment with Kubernetes
# Step 1: Deploy green (new version) alongside blue (current)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-green
labels:
app: myapp
version: green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: app
image: myapp:2.0.0
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
---
# Step 2: Switch service selector to green
apiVersion: v1
kind: Service
metadata:
name: app-service
spec:
selector:
app: myapp
version: green # Change from "blue" to "green"
ports:
- port: 80
targetPort: 3000环境管理
合理的环境管理至少需要三个层级:开发、预发布和生产。预发布环境在配置、数据结构和基础设施方面应尽可能接近生产环境。使用环境特定的配置文件、用于渐进式发布的功能标志,以及跨环境兼容的数据库迁移策略。永远不要在环境之间共享密钥。
密钥管理
永远不要将密钥存储在代码、磁盘上的环境变量或流水线配置中。使用专用的密钥管理工具。HashiCorp Vault 提供动态密钥、加密即服务和细粒度访问策略。AWS Secrets Manager 与 IAM 集成进行访问控制,支持自动轮换。对于 Kubernetes,使用 External Secrets Operator 将密钥从 Vault 或云提供商同步到 Kubernetes Secrets。
# Vault integration in GitHub Actions
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Import Secrets from Vault
uses: hashicorp/vault-action@v3
with:
url: https://vault.example.com
method: jwt
role: github-deploy
secrets: |
secret/data/prod/db DB_PASSWORD ;
secret/data/prod/api API_KEY
- name: Deploy with secrets
run: |
echo "Deploying with injected secrets..."
# Secrets are available as env vars
# DB_PASSWORD and API_KEY injected by vault-action
helm upgrade app ./chart \
--set db.password=\$DB_PASSWORD \
--set api.key=\$API_KEY
# External Secrets Operator for Kubernetes
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: app-secrets
data:
- secretKey: db-password
remoteRef:
key: secret/data/prod/db
property: password监控流水线
流水线监控超越应用监控。跟踪构建耗时趋势、测试不稳定率、部署频率、变更前置时间、平均恢复时间(MTTR)和变更失败率。这些是衡量 DevOps 绩效的 DORA 指标。使用 Grafana 仪表板、Datadog CI Visibility 或 GitHub Actions/GitLab 的内置分析来可视化流水线健康状况。
回滚策略
每次部署都需要回滚计划。策略包括:回退 Git 提交并重新运行流水线、使用容器镜像标签重新部署上一版本、Kubernetes rollout undo 实现即时回滚、数据库迁移回滚脚本(始终编写向后兼容的迁移)以及功能标志在不重新部署的情况下禁用问题功能。基于健康检查失败和错误率阈值自动化回滚。
# Kubernetes rollback commands
# View rollout history
kubectl rollout history deployment/app
# Rollback to previous version
kubectl rollout undo deployment/app
# Rollback to specific revision
kubectl rollout undo deployment/app --to-revision=3
# Check rollout status
kubectl rollout status deployment/app
# -------------------------------------------
# Automated rollback with health checks
# -------------------------------------------
# deploy.sh
#!/bin/bash
set -e
DEPLOY_NAME="app"
NAMESPACE="production"
TIMEOUT="300s"
# Apply new deployment
kubectl apply -f deployment.yaml -n \$NAMESPACE
# Wait for rollout with timeout
if ! kubectl rollout status deployment/\$DEPLOY_NAME \
-n \$NAMESPACE --timeout=\$TIMEOUT; then
echo "Rollout failed! Initiating rollback..."
kubectl rollout undo deployment/\$DEPLOY_NAME -n \$NAMESPACE
kubectl rollout status deployment/\$DEPLOY_NAME -n \$NAMESPACE
echo "Rollback complete."
exit 1
fi
echo "Deployment successful!"使用 ArgoCD 和 Flux 的 GitOps
GitOps 使用 Git 仓库作为声明式基础设施和应用的唯一事实来源。ArgoCD 和 Flux 是 Kubernetes 原生的 GitOps 操作器,持续协调 Git 中的期望状态与集群中的实际状态。变更通过 Pull Request 进行——不直接执行 kubectl apply。这提供了审计追踪、简便的回滚(git revert)和一致的环境。ArgoCD 提供 Web UI 进行可视化,而 Flux 更轻量和可组合。
# ArgoCD Application manifest
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-production
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/org/k8s-manifests.git
targetRevision: main
path: environments/production
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual cluster changes
syncOptions:
- CreateNamespace=true
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m
# Flux Kustomization
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: myapp
namespace: flux-system
spec:
interval: 5m
path: ./environments/production
prune: true
sourceRef:
kind: GitRepository
name: k8s-manifests
healthChecks:
- apiVersion: apps/v1
kind: Deployment
name: myapp
namespace: production流水线安全:SAST、DAST 和依赖扫描
将安全左移,集成到流水线中。SAST(静态应用安全测试)在不运行代码的情况下分析源代码漏洞——工具包括 Semgrep、SonarQube 和 CodeQL。DAST(动态应用安全测试)测试运行中的应用漏洞,如 XSS 和 SQL 注入——工具包括 OWASP ZAP 和 Burp Suite。依赖扫描检查第三方包的已知 CVE——工具包括 Snyk、Dependabot 和用于容器镜像的 Trivy。在每个流水线中运行这三项。
# Security scanning in GitHub Actions
name: Security Pipeline
on: [push, pull_request]
jobs:
sast:
name: Static Analysis (SAST)
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Semgrep
uses: returntocorp/semgrep-action@v1
with:
config: >-
p/security-audit
p/owasp-top-ten
p/nodejs
dependency-scan:
name: Dependency Scanning
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Snyk
uses: snyk/actions/node@master
env:
SNYK_TOKEN: \${{ secrets.SNYK_TOKEN }}
with:
args: --severity-threshold=high
container-scan:
name: Container Image Scan
runs-on: ubuntu-latest
needs: [sast]
steps:
- uses: actions/checkout@v4
- name: Build image
run: docker build -t myapp:scan .
- name: Run Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:scan
format: table
exit-code: 1
severity: CRITICAL,HIGH
dast:
name: Dynamic Analysis (DAST)
runs-on: ubuntu-latest
needs: [container-scan]
steps:
- name: Deploy to test environment
run: |
docker run -d -p 3000:3000 myapp:scan
sleep 10
- name: OWASP ZAP Baseline Scan
uses: zaproxy/action-baseline@v0.12.0
with:
target: http://localhost:3000