DevOps 通过自动化和共同责任连接开发与运维。核心支柱:CI/CD 流水线(GitHub Actions、GitLab CI)实现快速安全交付;Docker 实现可移植容器化;Kubernetes + Helm 实现编排与 GitOps;Terraform 实现基础设施即代码;Prometheus + Grafana 实现可观测性;以及 DevSecOps(SAST、密钥管理、容器扫描)将安全左移。DORA 指标——部署频率、变更前置时间、MTTR 和变更失败率——是衡量 DevOps 成熟度的黄金标准。
- CI/CD pipelines are the backbone of DevOps — automate build, test, and deploy on every commit.
- Docker multi-stage builds shrink image sizes by 60–90% and eliminate build-time secrets from final images.
- Terraform state management and workspaces enable safe, repeatable infrastructure across environments.
- GitOps (ArgoCD/Flux) makes Kubernetes deployments auditable, reversible, and pull-based.
- Prometheus + Grafana is the open-source observability stack for metrics; OpenTelemetry unifies traces and logs.
- DORA metrics — deployment frequency, lead time, MTTR, change failure rate — quantify DevOps maturity.
- DevSecOps shifts security left: SAST, container scanning, and secrets detection run in the CI pipeline.
1. DevOps Culture and Principles
DevOps is a set of practices, cultural philosophies, and tools that increase an organization's ability to deliver applications and services at high velocity. The word is a portmanteau of development and operations, but DevOps is fundamentally about breaking organizational silos — the wall between the teams that write code and the teams that run it in production.
The five CALMS pillars define what DevOps looks like in practice:
- Culture: Shared ownership of the full software lifecycle. Developers care about production reliability; operations engineers collaborate on software design. Blameless postmortems replace blame culture.
- Automation: Automate everything repetitive — builds, tests, deployments, infrastructure provisioning, and compliance checks. If a human does it twice, automate it.
- Lean: Apply lean manufacturing principles — minimize work in progress (WIP), eliminate waste (toil, manual steps), and optimize for flow through the value stream.
- Measurement: You cannot improve what you do not measure. DORA metrics quantify delivery performance; SLOs and error budgets quantify reliability.
- Sharing: Knowledge sharing through documentation, runbooks, internal wikis, and post-incident reviews. What one team learns, the whole organization benefits from.
DORA Metrics: Measuring DevOps Maturity
The DevOps Research and Assessment (DORA) team identified four metrics that reliably predict organizational performance:
DORA Metrics — Performance Tiers
Metric Elite High Medium Low
-------------------- ----- ---- ------ ---
Deployment Frequency Multiple/day 1/day-1/wk 1/wk-1/mo < 1/mo
Lead Time for Changes < 1 hour 1d – 1wk 1wk – 1mo > 6 months
Change Failure Rate 0–15% 0–15% 16–30% 16–30%
MTTR (recovery time) < 1 hour < 1 day < 1 day > 6 months
Elite performers deploy 208x more frequently than low performers.
Elite performers have 2,604x shorter lead times.
Elite performers recover from failures 2,604x faster.Key DevOps Practices
- Continuous Integration (CI): Developers merge code to a shared branch multiple times per day. Automated builds and tests verify every change within minutes.
- Continuous Delivery (CD): Every passing build is a release candidate. Deployment to production is a business decision, not a technical one.
- Shift-Left Testing: Move testing earlier in the development lifecycle. Unit tests in the IDE, integration tests in CI, security scans before merge — not after deployment.
- Infrastructure as Code (IaC): Manage servers, networks, and databases through version-controlled configuration files, not manual console clicks.
- Observability: Instrument systems so you can ask arbitrary questions about their behavior in production using metrics, logs, and distributed traces.
2. Git Workflows for DevOps Teams
The git workflow a team adopts directly affects its deployment frequency and integration pain. Two dominant models exist: GitFlow and Trunk-Based Development.
Trunk-Based Development (TBD)
TBD is the workflow associated with high-performing teams. Developers integrate into main (the trunk) at least once per day via short-lived feature branches (ideally less than 2 days old). Feature flags hide incomplete functionality from users. TBD is a prerequisite for true CI — if code is not merged daily, it is not continuous integration.
# Trunk-Based Development: typical developer workflow
# 1. Start fresh from trunk
git checkout main && git pull origin main
# 2. Create a short-lived feature branch
git checkout -b feature/add-payment-webhook
# 3. Work in small, focused commits
git add src/webhooks/payment.ts
git commit -m "feat(webhook): add Stripe payment event handler"
# 4. Push and open a PR — CI runs immediately
git push origin feature/add-payment-webhook
# 5. After CI passes and review is done, squash-merge to main
# 6. Delete the feature branch (it lived < 48 hours)
git branch -d feature/add-payment-webhookConventional Commits
Conventional Commits is a specification for adding human and machine-readable meaning to commit messages. It enables automated changelog generation, semantic versioning, and filtered git log.
# Conventional Commits format:
# <type>(<scope>): <description>
# [optional body]
# [optional footer(s)]
# Types: feat, fix, docs, style, refactor, perf, test, build, ci, chore, revert
# Examples:
git commit -m "feat(auth): add OAuth2 login with GitHub provider"
git commit -m "fix(api): handle null response from payment gateway"
git commit -m "perf(query): add index on users.email column"
git commit -m "ci(github): add matrix build for Node 20 and 22"
git commit -m "feat!: remove deprecated v1 API endpoints"
# The ! signals a breaking change — triggers a major version bumpBranch Protection Rules
# GitHub branch protection configuration (via GitHub CLI or API)
# Enforce on: main, release/*
Branch Protection Rules for "main":
require_status_checks_to_pass: true
required_checks:
- ci / lint
- ci / unit-tests
- ci / build
- security / trivy-scan
require_pull_request_reviews:
required_approving_review_count: 1
dismiss_stale_reviews: true
require_code_owner_review: true
require_linear_history: true # No merge commits
require_signed_commits: false # Optional GPG signing
allow_force_pushes: false
allow_deletions: false3. CI/CD with GitHub Actions
GitHub Actions is the most widely used CI/CD platform for open-source and cloud-native projects. Workflows are YAML files stored in .github/workflows/. For cron scheduling patterns, see our Cron Expression Examples guide.
Complete CI Pipeline with Caching and Matrix Builds
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ── Stage 1: Code Quality (runs in parallel) ──────────────────────
lint:
name: Lint & Type Check
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npx tsc --noEmit
- run: npx prettier --check .
# ── Stage 2: Test Matrix ──────────────────────────────────────────
test:
name: Test (Node ${{ matrix.node }})
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
node: ['20', '22']
fail-fast: false
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node }}
cache: 'npm'
- run: npm ci
- name: Run tests with coverage
run: npm run test -- --coverage --ci --reporters=junit
- uses: actions/upload-artifact@v4
with:
name: coverage-node-${{ matrix.node }}
path: coverage/
# ── Stage 3: Build Docker image ───────────────────────────────────
build:
name: Build & Push Image
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
permissions:
contents: read
packages: write
outputs:
image-tag: ${{ steps.meta.outputs.tags }}
image-digest: ${{ steps.push.outputs.digest }}
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=sha-
type=raw,value=latest,enable=${{ github.ref == 'refs/heads/main' }}
- name: Build and push with layer caching
id: push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
platforms: linux/amd64,linux/arm64
# ── Stage 4: Security Scanning ────────────────────────────────────
security:
name: Security Scan
runs-on: ubuntu-latest
needs: lint
steps:
- uses: actions/checkout@v4
- name: Trivy filesystem scan
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
severity: 'CRITICAL,HIGH'
format: 'sarif'
output: 'trivy-results.sarif'
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: trivy-results.sarif
- name: Detect leaked secrets
uses: trufflesecurity/trufflehog@main
with:
extra_args: --only-verifiedReusable Workflows and Secrets
# .github/workflows/deploy.yml — CD pipeline using OIDC (no long-lived credentials)
name: Deploy
on:
workflow_run:
workflows: [CI]
types: [completed]
branches: [main]
jobs:
deploy-production:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
environment:
name: production
url: https://example.com
permissions:
id-token: write # Required for OIDC
contents: read
steps:
- uses: actions/checkout@v4
# OIDC — no AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY needed
- name: Authenticate to AWS via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-deploy
aws-region: us-east-1
- name: Deploy to ECS
run: |
aws ecs update-service \
--cluster production \
--service myapp \
--force-new-deployment
- name: Wait for deployment to stabilize
run: |
aws ecs wait services-stable \
--cluster production \
--services myapp4. Docker in DevOps
Docker is the foundation of modern CI/CD. Containers provide reproducible, isolated environments from development through production. See our Docker Multi-Stage Builds guide for deep-dive optimization techniques.
Production Multi-Stage Dockerfile
# Multi-stage build for a Node.js application
# Stage 1: Install all dependencies (including devDependencies)
FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
# Stage 2: Build the application
FROM node:22-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build
# Stage 3: Production runtime (smallest possible image)
FROM node:22-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
ENV PORT=3000
# Run as non-root user for security
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
# Copy only what is needed for production
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json ./package.json
USER nextjs
EXPOSE 3000
# Health check for container orchestrators
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
CMD wget -qO- http://localhost:3000/health || exit 1
CMD ["node", "dist/server.js"]Docker Compose for Local Development
# docker-compose.yml — Full local development stack
version: '3.9'
services:
app:
build:
context: .
target: deps # Stop at deps stage for hot reload
volumes:
- .:/app
- /app/node_modules # Prevent host node_modules override
ports:
- "3000:3000"
environment:
- NODE_ENV=development
- DATABASE_URL=postgres://dev:dev@db:5432/myapp
- REDIS_URL=redis://redis:6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
command: npm run dev
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: dev
POSTGRES_PASSWORD: dev
POSTGRES_DB: myapp
volumes:
- postgres-data:/var/lib/postgresql/data
- ./scripts/init.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U dev -d myapp"]
interval: 10s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 5s
retries: 5
# Observability stack in development
prometheus:
image: prom/prometheus:latest
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
ports:
- "3001:3000"
volumes:
- grafana-data:/var/lib/grafana
volumes:
postgres-data:
grafana-data:Container Registry and Image Tagging Strategy
# Image tagging best practices
# Never use :latest in production — it is mutable and ambiguous
# Tag by git SHA (immutable, traceable)
docker build -t registry.example.com/myapp:sha-a1b2c3d .
# Tag by semantic version for releases
docker build -t registry.example.com/myapp:v2.4.1 .
# Multi-arch build for AMD64 and ARM64
docker buildx build \
--platform linux/amd64,linux/arm64 \
--tag registry.example.com/myapp:sha-a1b2c3d \
--push .
# Scan image for vulnerabilities before pushing
trivy image registry.example.com/myapp:sha-a1b2c3d \
--severity CRITICAL,HIGH \
--exit-code 15. Infrastructure as Code (IaC)
Infrastructure as Code eliminates configuration drift, enables disaster recovery, and makes infrastructure changes auditable through pull requests. Terraform is the dominant tool; Pulumi offers a general-purpose programming language alternative.
Terraform: VPC and ECS Cluster
# main.tf — AWS infrastructure for a containerized application
terraform {
required_version = ">= 1.7"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "my-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock" # State locking prevents conflicts
}
}
provider "aws" {
region = var.aws_region
}
# ── VPC ─────────────────────────────────────────────────────────────
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.5.2"
name = "${var.app_name}-${var.environment}"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
public_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
private_subnets = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
enable_nat_gateway = true
single_nat_gateway = var.environment != "production" # Cost optimization
}
# ── ECS Cluster ─────────────────────────────────────────────────────
resource "aws_ecs_cluster" "main" {
name = "${var.app_name}-${var.environment}"
setting {
name = "containerInsights"
value = "enabled"
}
}
resource "aws_ecs_task_definition" "app" {
family = var.app_name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.task_cpu
memory = var.task_memory
execution_role_arn = aws_iam_role.ecs_execution.arn
task_role_arn = aws_iam_role.ecs_task.arn
container_definitions = jsonencode([{
name = var.app_name
image = "${var.ecr_repository_url}:${var.image_tag}"
portMappings = [{ containerPort = 3000, protocol = "tcp" }]
environment = [{ name = "NODE_ENV", value = var.environment }]
secrets = [
{ name = "DATABASE_URL", valueFrom = aws_ssm_parameter.db_url.arn }
]
logConfiguration = {
logDriver = "awslogs"
options = {
awslogs-group = "/ecs/${var.app_name}"
awslogs-region = var.aws_region
awslogs-stream-prefix = "ecs"
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}])
}
# ── variables.tf ─────────────────────────────────────────────────────
variable "environment" {
description = "Deployment environment"
type = string
validation {
condition = contains(["staging", "production"], var.environment)
error_message = "Environment must be staging or production."
}
}Terraform Workspace Strategy
# Use workspaces for environment isolation
terraform workspace new staging
terraform workspace new production
# Deploy to staging
terraform workspace select staging
terraform plan -var="environment=staging" -out=staging.plan
terraform apply staging.plan
# Deploy to production (requires separate approval)
terraform workspace select production
terraform plan -var="environment=production" -out=prod.plan
terraform apply prod.plan
# Check for drift (what's changed outside Terraform)
terraform plan -refresh-onlyAnsible for Configuration Management
# playbook.yml — Configure application servers
---
- name: Configure application servers
hosts: app_servers
become: true
vars:
app_version: "{{ lookup('env', 'APP_VERSION') }}"
node_version: "22"
tasks:
- name: Install Node.js via NVM
shell: |
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash
. ~/.nvm/nvm.sh && nvm install {{ node_version }}
args:
creates: "/root/.nvm/versions/node/v{{ node_version }}"
- name: Deploy application
git:
repo: "https://github.com/org/myapp.git"
dest: /app
version: "{{ app_version }}"
force: true
- name: Install dependencies
npm:
path: /app
ci: true
production: true
- name: Ensure app service is running
systemd:
name: myapp
state: restarted
enabled: true
daemon_reload: true6. Monitoring and Observability
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars are metrics (numeric time-series), logs (structured event records), and distributed traces (request flows across services). The fourth pillar — profiles — enables continuous CPU and memory profiling in production.
Prometheus Configuration
# prometheus.yml — Scrape configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: production
region: us-east-1
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets: ['myapp:3000']
metrics_path: /metrics
scrape_interval: 10s
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']Alerting Rules
# alerts/application.yml — Prometheus alerting rules
groups:
- name: application
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High HTTP error rate: {{ $value | humanizePercentage }}"
description: "Error rate above 5% for 2 minutes. Investigate immediately."
runbook: "https://runbooks.example.com/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 2s for {{ $labels.handler }}"
- alert: ContainerOOMKilled
expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1
for: 0m
labels:
severity: warning
annotations:
summary: "Container OOMKilled: {{ $labels.container }} in {{ $labels.pod }}"
description: "Increase memory limit or investigate memory leak."
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[15m]) > 3
for: 0m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
description: "Container {{ $labels.container }} restarted 3+ times in 15 minutes."
- alert: DiskSpaceRunningLow
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space below 15% on {{ $labels.instance }}"
description: "Filesystem {{ $labels.mountpoint }} is {{ $value | humanize }}% free."Structured Logging and OpenTelemetry
// Structured logging with Pino (Node.js)
import pino from 'pino';
import { trace, context } from '@opentelemetry/api';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
// Automatically inject trace context into every log line
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const { traceId, spanId } = span.spanContext();
return { traceId, spanId };
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// Usage: structured fields, not string interpolation
logger.info({ userId: '123', action: 'checkout', items: 3 }, 'Cart checkout initiated');
logger.error({ err, orderId: '456', retryCount: 2 }, 'Payment processing failed');
// OpenTelemetry tracing setup (SDK initialization)
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'myapp',
'deployment.environment': process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTLP_ENDPOINT || 'http://otel-collector:4317',
}),
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new PgInstrumentation(), // Auto-instrument all DB queries
],
});
sdk.start();7. Kubernetes in Production
Kubernetes is the industry-standard container orchestration platform. For a complete Kubernetes reference, see our Kubernetes Complete Guide. This section focuses on production patterns: Helm, GitOps, and horizontal autoscaling.
Helm Chart Structure
myapp/
├── Chart.yaml # Chart metadata (name, version, appVersion)
├── values.yaml # Default configuration values
├── values-staging.yaml # Staging overrides
├── values-production.yaml # Production overrides
└── templates/
├── _helpers.tpl # Named templates (labels, selectors)
├── deployment.yaml
├── service.yaml
├── ingress.yaml
├── hpa.yaml
├── configmap.yaml
├── secret.yaml # External-secrets or sealed-secrets
├── serviceaccount.yaml
└── NOTES.txt # Post-install instructions# values.yaml — Default Helm chart values
replicaCount: 2
image:
repository: registry.example.com/myapp
tag: latest
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
targetPort: 3000
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
hosts:
- host: example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: example-com-tls
hosts:
- example.com
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
readinessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 3GitOps with ArgoCD
# argocd-application.yaml — Deploy via GitOps
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp-production
namespace: argocd
annotations:
# Notify Slack on sync status changes
notifications.argoproj.io/subscribe.on-sync-failed.slack: devops-alerts
notifications.argoproj.io/subscribe.on-deployed.slack: deployments
spec:
project: default
source:
repoURL: https://github.com/org/myapp
targetRevision: main
path: helm/myapp
helm:
valueFiles:
- values-production.yaml
parameters:
- name: image.tag
value: sha-a1b2c3d # Set by CI pipeline after push
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Delete resources removed from Git
selfHeal: true # Revert manual kubectl changes automatically
syncOptions:
- CreateNamespace=true
- PruneLast=true
retry:
limit: 5
backoff:
duration: 5s
maxDuration: 3m
factor: 2Horizontal Pod Autoscaler with Custom Metrics
# hpa.yaml — Scale on CPU and custom Prometheus metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second # Custom metric via Prometheus Adapter
target:
type: AverageValue
averageValue: 1000
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Scale up quickly
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Scale down slowly (5 minutes)
policies:
- type: Percent
value: 10
periodSeconds: 608. Security in DevOps (DevSecOps)
DevSecOps integrates security into every phase of the software development lifecycle. The principle is simple: security issues caught in a developer's IDE cost orders of magnitude less to fix than issues discovered in production. See our Docker Security Best Practices for container-specific hardening.
SAST and Dependency Scanning in CI
# .github/workflows/security.yml
name: Security
on:
push:
branches: [main]
pull_request:
schedule:
- cron: '0 6 * * 1' # Weekly full scan every Monday at 6am
jobs:
sast:
name: Static Analysis
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# CodeQL — GitHub's SAST engine (free for public repos)
- uses: github/codeql-action/init@v3
with:
languages: javascript, typescript
queries: +security-extended
- uses: github/codeql-action/autobuild@v3
- uses: github/codeql-action/analyze@v3
with:
category: /language:javascript
# Semgrep — fast, rule-based SAST
- uses: returntocorp/semgrep-action@v1
with:
config: >-
p/typescript
p/nodejs
p/owasp-top-ten
dependency-scan:
name: Dependency Vulnerabilities
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Snyk for npm dependency CVEs
- name: Snyk dependency check
uses: snyk/actions/node@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
args: --severity-threshold=high
# Check for known vulnerable packages
- name: npm audit
run: npm audit --audit-level=high
container-scan:
name: Container Image Scan
runs-on: ubuntu-latest
needs: [] # Run in parallel
steps:
- uses: actions/checkout@v4
- name: Build image for scanning
run: docker build -t scan-target:latest .
# Trivy — comprehensive container scanner
- name: Scan with Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: 'scan-target:latest'
format: 'sarif'
output: 'trivy-container.sarif'
severity: 'CRITICAL,HIGH'
exit-code: '1'
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: trivy-container.sarif
iac-scan:
name: IaC Security Scan
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Checkov — Terraform/Kubernetes/Dockerfile security scanning
- uses: bridgecrewio/checkov-action@master
with:
directory: terraform/
framework: terraform
output_format: sarif
output_file_path: checkov.sarif
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: checkov.sarifSecrets Management with HashiCorp Vault
# HashiCorp Vault: dynamic database credentials
# Vault issues short-lived, auto-rotating Postgres credentials
# Configure the database secrets engine
vault secrets enable database
vault write database/config/myapp-postgres \
plugin_name=postgresql-database-plugin \
allowed_roles="myapp-role" \
connection_url="postgresql://{{username}}:{{password}}@postgres:5432/myapp" \
username="vault-root" \
password="vault-root-password"
# Define a role that creates read/write credentials
vault write database/roles/myapp-role \
db_name=myapp-postgres \
creation_statements="CREATE ROLE "{{name}}" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO "{{name}}";" \
revocation_statements="DROP ROLE IF EXISTS "{{name}}";" \
default_ttl="1h" \
max_ttl="24h"
# Application reads dynamic credentials at startup (rotated automatically)
vault read database/creds/myapp-role
# Key Value
# lease_id database/creds/myapp-role/abc123
# lease_duration 1h0m0s
# lease_renewable true
# password A1a-8k9j2... (temporary, auto-expires)
# username v-myapp-role-xyz123 (temporary)9. CI/CD Platform Comparison
Choosing the right CI/CD platform depends on your hosting model, language ecosystem, and existing tool investments. Here is how the four major platforms compare:
| Feature | GitHub Actions | GitLab CI/CD | Jenkins | CircleCI |
|---|---|---|---|---|
| Hosting | Cloud (GitHub.com) or self-hosted runners | Cloud or fully self-hosted | Self-hosted only | Cloud or self-hosted |
| Config Format | YAML (.github/workflows/) | YAML (.gitlab-ci.yml) | Groovy DSL (Jenkinsfile) | YAML (.circleci/config.yml) |
| Free Tier | 2,000 min/month (public: unlimited) | 400 min/month (SaaS) | Free (self-host costs) | 6,000 min/month |
| Ecosystem | 21,000+ Actions on Marketplace | Built-in templates, CI Components | 1,800+ plugins | Orbs (reusable packages) |
| Docker Support | Native, BuildKit, QEMU | Docker-in-Docker (DinD) built-in | Plugin required | Native Docker layer caching |
| Secrets Management | Org/repo/environment secrets + OIDC | Protected CI variables + Vault integration | Credentials plugin + Vault | Contexts, environment variables |
| Matrix Builds | Native strategy.matrix | parallel keyword | Matrix plugin | Native matrix |
| Reusable Pipelines | Reusable workflows, composite actions | include, extends, CI components | Shared libraries | Orbs |
| Best For | GitHub-hosted projects, open source | Self-hosted, regulated industries | Legacy enterprise, max customization | Optimized Docker build speed |
| Weaknesses | Vendor lock-in to GitHub | Complex YAML for large pipelines | High maintenance overhead | Smaller ecosystem than Actions |
10. DevOps Toolchain Overview
A mature DevOps toolchain covers each phase of the software delivery lifecycle. The following represents the industry-standard toolkit in 2026:
DevOps Toolchain by Phase
Phase Tools (most popular first)
----------- ----------------------------------------------
Source Control Git, GitHub, GitLab, Bitbucket
CI/CD GitHub Actions, GitLab CI, Jenkins, CircleCI, Buildkite
Containerization Docker, Podman, containerd
Orchestration Kubernetes, Docker Swarm, HashiCorp Nomad
Package Management Helm, Kustomize, ArgoCD
IaC Terraform, Pulumi, Ansible, AWS CDK, Crossplane
Secrets Management HashiCorp Vault, AWS Secrets Manager, Doppler, Infisical
Artifact Registry GitHub GHCR, AWS ECR, Google Artifact Registry, JFrog Artifactory
Metrics Prometheus, Grafana, Datadog, New Relic, VictoriaMetrics
Logging Loki, ELK Stack, Datadog, Splunk, Cloudwatch
Tracing Jaeger, Tempo, Honeycomb, Datadog APM
Alerting PagerDuty, OpsGenie, Grafana Alertmanager, Signoz
SAST CodeQL, Semgrep, SonarQube, Snyk Code
Container Scanning Trivy, Grype, Snyk Container, Anchore
SCA / Deps Dependabot, Snyk, OWASP Dependency-Check
IaC Security Checkov, tfsec, Terrascan, Regula
Incident Management PagerDuty, Incident.io, Grafana Incident
Feature Flags LaunchDarkly, Unleash, Flagsmith, OpenFeatureFrequently Asked Questions
What is the difference between DevOps and SRE?
DevOps is a cultural and organizational philosophy focused on breaking silos between development and operations teams through automation, shared ownership, and continuous delivery. SRE (Site Reliability Engineering), coined by Google, is a specific implementation of DevOps principles using software engineering to solve operations problems. SREs define SLOs, manage error budgets, and own reliability at scale. DevOps is the "what and why"; SRE is one "how".
What are DORA metrics and why do they matter?
DORA metrics — Deployment Frequency, Lead Time for Changes, Mean Time to Restore (MTTR), and Change Failure Rate — are the most evidence-based way to measure software delivery performance. Elite performers deploy 208x more frequently than low performers and recover from failures 2,604x faster. They provide objective benchmarks for DevOps maturity conversations with leadership and guide where to invest improvement efforts.
What is Infrastructure as Code and should I use Terraform or Pulumi?
IaC means managing cloud resources through version-controlled configuration files. Terraform (HCL language) has the largest provider ecosystem and community. Pulumi uses TypeScript, Python, or Go — better for teams wanting loops, conditionals, and type safety. Choose Terraform for maximum community resources; choose Pulumi if your team prefers general-purpose programming languages.
What is GitOps and how does ArgoCD implement it?
GitOps makes Git the single source of truth for both application code and infrastructure configuration. ArgoCD continuously reconciles the actual Kubernetes cluster state to match the desired state in Git. All changes go through pull requests, providing a full audit trail and easy rollback (just revert a commit).
Trunk-Based Development vs GitFlow: which should I use?
Use Trunk-Based Development if you want high deployment frequency — it is the workflow of elite performers. Developers integrate into main at least once per day via short-lived branches; feature flags hide incomplete work. Use GitFlow only if your software has infrequent, versioned releases (firmware, mobile apps with app store cycles).
How do I manage secrets in a DevOps pipeline securely?
Never store secrets in Git. Use: CI/CD platform secrets (GitHub Actions Secrets, GitLab CI Variables) for simplicity; HashiCorp Vault for dynamic secrets with auto-rotation; cloud-native solutions (AWS Secrets Manager, GCP Secret Manager) for IAM-integrated access; and OIDC authentication to eliminate long-lived credentials from pipelines entirely.
What is DevSecOps and which tools implement it?
DevSecOps integrates security into every SDLC phase. Key tools: CodeQL and Semgrep for SAST; Trivy and Grype for container image scanning; Snyk and Dependabot for dependency CVEs; Checkov and tfsec for IaC security; Gitleaks and TruffleHog for secrets detection in git history. The goal: catch security issues during development, not in production.
What is the role of a DevOps engineer vs a platform engineer?
A DevOps engineer embeds within product teams, owns CI/CD, and bridges dev and ops. A Platform Engineer builds Internal Developer Platforms (IDPs) — the tools, pipelines, and abstractions that every development team uses. Platform engineering is DevOps at scale: instead of each team building their own pipeline, the platform team builds it once and everyone benefits.
Conclusion
DevOps is not a tool or a job title — it is a continuous journey of improving how software is delivered and operated. Start with the fundamentals: version control everything, automate your build and test pipeline, containerize your applications, and measure your DORA metrics to understand where you stand today. From there, incrementally add sophistication: adopt IaC for your infrastructure, implement GitOps for Kubernetes deployments, build out your observability stack, and shift security left with DevSecOps practices.
The compounding returns of DevOps investment are extraordinary — elite teams deploy hundreds of times per day with lower change failure rates than teams deploying monthly. The journey starts with a single automated pipeline.
Explore these related guides: our Docker Compose Tutorial, GitHub Actions CI/CD Guide, and Kubernetes Complete Guide. Use our Cron Expression Generator for scheduling CI jobs and our JSON Formatter for working with API responses and configuration files.