侧边栏壁纸
  • 累计撰写 74 篇文章
  • 累计创建 28 个标签
  • 累计收到 21 条评论

目 录CONTENT

文章目录

Prometheus + Grafana 全链路监控 Spring Boot 3.x 微服务实战

Administrator
2026-03-28 / 0 评论 / 0 点赞 / 0 阅读 / 0 字 / 正在检测是否收录...
温馨提示:
部分素材来自网络,若不小心影响到您的利益,请联系我们删除。

一、为什么微服务需要全链路监控?

在微服务架构中,一个请求可能跨越十余个服务,任意一环出现性能抖动都会引发级联故障。传统日志排查往往需要人工翻阅数百行日志,效率极低。

可观测性三大支柱(Observability Three Pillars)

支柱代表工具解决的问题
Metrics(指标)Prometheus + Grafana实时性能大盘、趋势分析、告警
Logs(日志)ELK / Loki问题溯源、业务审计
Traces(链路追踪)Zipkin / Jaeger跨服务调用耗时分析

本文聚焦 Metrics 这一支柱,手把手搭建 Prometheus + Grafana 监控体系,覆盖:

  • Spring Boot 3.x 接入 Micrometer 暴露指标
  • Prometheus 采集配置
  • Grafana 可视化大盘搭建
  • 告警规则配置(AlertManager)
  • Docker Compose 一键部署
  • K8s 环境下的 ServiceMonitor 集成

二、技术栈与版本说明

组件版本
Spring Boot3.2.x
Micrometer1.12.x(Boot 3.x 自带)
Prometheus2.51.x
Grafana10.4.x
AlertManager0.27.x
Docker Composev2.x

注意:Spring Boot 3.x 已内置 Micrometer,无需单独引入旧版 spring-boot-actuator-autoconfigure


三、Spring Boot 3.x 接入 Prometheus

3.1 添加依赖

<!-- pom.xml -->
<dependencies>
    <!-- Spring Boot Actuator -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>

    <!-- Micrometer Prometheus Registry -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>

    <!-- 可选:HTTP 请求指标自动装配 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
</dependencies>

3.2 配置 Actuator 端点

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: "*"           # 生产环境按需开放,建议只开 health,info,prometheus
      base-path: /actuator
  endpoint:
    health:
      show-details: always
    prometheus:
      enabled: true
  metrics:
    export:
      prometheus:
        enabled: true
    tags:
      application: ${spring.application.name}   # 每个指标自动打上应用名标签
      env: ${APP_ENV:dev}                        # 环境标签(dev/test/prod)

配置完成后,访问 http://localhost:8080/actuator/prometheus,可以看到如下格式的指标输出:

# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{application="order-service",area="heap",env="prod",...} 1.34217728E8

# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{application="order-service",exception="None",method="GET",...} 1024
http_server_requests_seconds_sum{...} 5.12

3.3 自定义业务指标

除了 JVM、HTTP 等默认指标,还可以埋点业务指标:

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;

@Service
public class OrderService {

    private final Counter orderCreatedCounter;
    private final Counter orderFailedCounter;
    private final Timer orderProcessTimer;

    public OrderService(MeterRegistry registry) {
        // 订单创建计数器
        this.orderCreatedCounter = Counter.builder("order.created.total")
                .description("Total number of created orders")
                .tag("env", "prod")
                .register(registry);

        // 订单失败计数器
        this.orderFailedCounter = Counter.builder("order.failed.total")
                .description("Total number of failed orders")
                .register(registry);

        // 订单处理耗时
        this.orderProcessTimer = Timer.builder("order.process.duration")
                .description("Order processing duration")
                .register(registry);
    }

    public void createOrder(OrderDTO dto) {
        orderProcessTimer.record(() -> {
            try {
                // 业务逻辑...
                doCreateOrder(dto);
                orderCreatedCounter.increment();
            } catch (Exception e) {
                orderFailedCounter.increment();
                throw e;
            }
        });
    }

    private void doCreateOrder(OrderDTO dto) {
        // 实际业务处理
    }
}

常用 Micrometer 指标类型

类型用途示例
Counter单调递增计数请求总数、订单总数
Gauge瞬时值线程数、队列长度、缓存大小
Timer耗时 + 次数接口响应时间
DistributionSummary值分布请求体大小、支付金额分布

四、Docker Compose 一键部署监控栈

4.1 目录结构

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml          # Prometheus 主配置
│   └── rules/
│       └── springboot.yml      # 告警规则
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── prometheus.yml  # 数据源自动配置
│       └── dashboards/
│           └── dashboard.yml   # 大盘自动加载配置
└── alertmanager/
    └── alertmanager.yml        # AlertManager 配置

4.2 docker-compose.yml

version: "3.9"

networks:
  monitoring:
    driver: bridge

services:
  # ─── Prometheus ───────────────────────────────────────────
  prometheus:
    image: prom/prometheus:v2.51.2
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=15d"     # 数据保留15天
      - "--web.enable-lifecycle"                 # 支持热重载
      - "--web.enable-admin-api"
    networks:
      - monitoring

  # ─── Grafana ──────────────────────────────────────────────
  grafana:
    image: grafana/grafana:10.4.2
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=Admin@123456
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-clock-panel
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus
    networks:
      - monitoring

  # ─── AlertManager ─────────────────────────────────────────
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
    networks:
      - monitoring

  # ─── Node Exporter(主机指标)──────────────────────────────
  node-exporter:
    image: prom/node-exporter:v1.8.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

4.3 Prometheus 采集配置

# prometheus/prometheus.yml
global:
  scrape_interval: 15s          # 每15秒抓取一次
  evaluation_interval: 15s      # 每15秒评估一次告警规则
  external_labels:
    cluster: "prod-cluster"
    region: "cn-shanghai"

# 告警规则文件
rule_files:
  - "/etc/prometheus/rules/*.yml"

# AlertManager 地址
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# 采集任务
scrape_configs:
  # ── Prometheus 自监控 ──
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # ── Node Exporter 主机指标 ──
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  # ── Spring Boot 微服务(静态配置)──
  - job_name: "springboot-services"
    metrics_path: "/actuator/prometheus"
    scrape_interval: 10s
    static_configs:
      - targets:
          - "order-service:8080"
          - "user-service:8081"
          - "payment-service:8082"
          - "gateway:8888"
        labels:
          env: "prod"
          region: "cn-shanghai"

  # ── Spring Boot 微服务(基于文件的服务发现)──
  # 适合服务实例动态变化的场景
  - job_name: "springboot-file-sd"
    metrics_path: "/actuator/prometheus"
    file_sd_configs:
      - files:
          - "/etc/prometheus/targets/*.json"
        refresh_interval: 30s

五、告警规则配置

5.1 Spring Boot 告警规则

# prometheus/rules/springboot.yml
groups:
  - name: springboot_alerts
    rules:
      # ── JVM 堆内存使用率超过 85% ──────────────────────────
      - alert: JvmHeapMemoryHigh
        expr: |
          (jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) * 100 > 85
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "JVM 堆内存使用率过高 ({{ $labels.application }})"
          description: |
            应用 {{ $labels.application }} 实例 {{ $labels.instance }}
            堆内存使用率为 {{ $value | printf "%.1f" }}%,超过阈值 85%

      # ── GC 停顿时间过长 ────────────────────────────────────
      - alert: JvmGcPauseTooLong
        expr: |
          rate(jvm_gc_pause_seconds_sum[5m]) / rate(jvm_gc_pause_seconds_count[5m]) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GC 平均停顿时间过长 ({{ $labels.application }})"
          description: "过去5分钟平均 GC 停顿 {{ $value | printf \"%.3f\" }}s,超过 500ms"

      # ── HTTP 错误率超过 5% ──────────────────────────────────
      - alert: HttpErrorRateHigh
        expr: |
          (
            rate(http_server_requests_seconds_count{status=~"5.."}[5m])
          /
            rate(http_server_requests_seconds_count[5m])
          ) * 100 > 5
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 5xx 错误率过高 ({{ $labels.application }})"
          description: |
            应用 {{ $labels.application }} 接口 {{ $labels.uri }}
            5xx 错误率为 {{ $value | printf "%.2f" }}%,超过阈值 5%

      # ── 接口 P99 响应时间超过 2s ────────────────────────────
      - alert: HttpLatencyHigh
        expr: |
          histogram_quantile(0.99,
            rate(http_server_requests_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "接口 P99 响应时间过高 ({{ $labels.application }})"
          description: "接口 {{ $labels.uri }} P99 响应时间 {{ $value | printf \"%.2f\" }}s"

      # ── 服务下线 ────────────────────────────────────────────
      - alert: ServiceDown
        expr: up{job="springboot-services"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务下线 ({{ $labels.instance }})"
          description: "实例 {{ $labels.instance }} 已经 1 分钟无法访问"

      # ── 线程池队列积压 ─────────────────────────────────────
      - alert: ThreadPoolQueueFull
        expr: executor_queue_remaining_tasks < 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "线程池队列即将耗尽 ({{ $labels.application }})"
          description: "线程池 {{ $labels.name }} 剩余队列容量仅 {{ $value }}"

5.2 AlertManager 配置(钉钉/企微通知)

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.163.com:465"
  smtp_from: "alert@yourcompany.com"
  smtp_auth_username: "alert@yourcompany.com"
  smtp_auth_password: "your-smtp-password"

route:
  group_by: ["alertname", "application", "env"]
  group_wait: 30s        # 同组告警等待30s合并后再发送
  group_interval: 5m     # 同组告警每5min发一次
  repeat_interval: 4h    # 持续告警每4h重复通知
  receiver: "default"
  routes:
    - match:
        severity: critical
      receiver: "critical-receiver"
      continue: true
    - match:
        severity: warning
      receiver: "warning-receiver"

receivers:
  - name: "default"
    email_configs:
      - to: "ops@yourcompany.com"

  - name: "critical-receiver"
    webhook_configs:
      - url: "http://dingtalk-webhook:8060/dingtalk/webhook1/send"
        send_resolved: true

  - name: "warning-receiver"
    webhook_configs:
      - url: "http://dingtalk-webhook:8060/dingtalk/webhook2/send"
        send_resolved: true

inhibit_rules:
  # 有 critical 告警时,抑制同应用的 warning 告警
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["application", "instance"]

六、Grafana 大盘搭建

6.1 数据源自动配置

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      httpMethod: POST
      timeInterval: "15s"
      queryTimeout: "60s"

6.2 推荐导入的 Grafana Dashboard ID

DashboardID用途
Spring Boot 3.x Statistics19004JVM、HTTP、数据库连接池全览
JVM Micrometer4701JVM 详细指标(堆/非堆/GC/线程)
Spring Boot Actuator12685Actuator 端点监控
Node Exporter Full1860主机 CPU/内存/磁盘/网络
Kubernetes Cluster7249K8s 集群资源概览

导入步骤:Grafana → Dashboards → Import → 输入 Dashboard ID → Load → 选择 Prometheus 数据源 → Import

6.3 核心 PromQL 查询语句参考

# ── JVM 堆内存使用率 ─────────────────────────────────────
(jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) * 100

# ── HTTP 请求 QPS(按接口分组)──────────────────────────
sum by (uri, method) (
  rate(http_server_requests_seconds_count[1m])
)

# ── HTTP P95 / P99 响应时间 ─────────────────────────────
histogram_quantile(0.95, sum by (le, uri) (
  rate(http_server_requests_seconds_bucket[5m])
))

# ── HTTP 5xx 错误率 ──────────────────────────────────────
(
  sum by (application)(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/
  sum by (application)(rate(http_server_requests_seconds_count[5m]))
) * 100

# ── 活跃线程数 ───────────────────────────────────────────
jvm_threads_live_threads

# ── GC 频率(每分钟次数)────────────────────────────────
rate(jvm_gc_pause_seconds_count[1m]) * 60

# ── 数据库连接池使用率 ───────────────────────────────────
hikaricp_connections_active / hikaricp_connections_max * 100

# ── 订单成功率(自定义业务指标)─────────────────────────
(
  rate(order_created_total[5m])
/
  (rate(order_created_total[5m]) + rate(order_failed_total[5m]))
) * 100

七、K8s 环境下的 ServiceMonitor 集成

如果 Spring Boot 微服务部署在 Kubernetes 上(参考上一篇 K8s 实战文章),推荐使用 kube-prometheus-stack + ServiceMonitor 实现自动服务发现。

7.1 安装 kube-prometheus-stack

# 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# 安装到 monitoring 命名空间
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword="Admin@123456" \
  --set prometheus.prometheusSpec.retention="15d" \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage="50Gi"

7.2 为 Spring Boot 服务配置 ServiceMonitor

# k8s/service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: order-service-monitor
  namespace: monitoring          # ServiceMonitor 放在 monitoring 命名空间
  labels:
    release: kube-prometheus-stack   # 必须与 Prometheus CRD 的 serviceMonitorSelector 匹配
spec:
  namespaceSelector:
    matchNames:
      - default                  # 被监控服务所在的命名空间
  selector:
    matchLabels:
      app: order-service         # 匹配 Service 的 label
  endpoints:
    - port: http                 # Service 中暴露的端口名称
      path: /actuator/prometheus
      interval: 15s
      scrapeTimeout: 10s

对应的 Spring Boot Service 需要带有匹配的 label:

# k8s/order-service.yaml(Service 部分)
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: default
  labels:
    app: order-service           # 与 ServiceMonitor.selector 匹配
spec:
  selector:
    app: order-service
  ports:
    - name: http                 # 端口名称必须与 ServiceMonitor.endpoints.port 一致
      port: 8080
      targetPort: 8080
# 应用配置
kubectl apply -f k8s/service-monitor.yaml

# 验证 ServiceMonitor 已创建
kubectl get servicemonitor -n monitoring

# 在 Prometheus UI 中验证 Targets 是否出现
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring
# 访问 http://localhost:9090/targets

7.3 K8s 环境 Grafana 访问

# 端口转发访问 Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

# 获取默认密码
kubectl get secret kube-prometheus-stack-grafana -n monitoring \
  -o jsonpath="{.data.admin-password}" | base64 --decode

八、完整启动流程

8.1 Docker Compose 部署

# 1. 克隆/创建监控目录
mkdir -p monitoring && cd monitoring

# 2. 按上文创建所有配置文件

# 3. 启动监控栈
docker compose up -d

# 4. 验证各服务状态
docker compose ps

# 5. 访问各组件
# Prometheus: http://localhost:9090
# Grafana:    http://localhost:3000  (admin / Admin@123456)
# AlertManager: http://localhost:9093

# 6. 热重载 Prometheus 配置(修改配置后无需重启)
curl -X POST http://localhost:9090/-/reload

8.2 Spring Boot 服务验证清单

# 验证指标端点可访问
curl http://localhost:8080/actuator/prometheus | head -20

# 在 Prometheus 中验证 Target 在线
# 访问 http://localhost:9090/targets → 确认所有 Target 状态为 UP

# 常用 PromQL 验证查询
# http://localhost:9090/graph
# 查询:up{job="springboot-services"}  → 应返回 1

# 导入 Grafana Dashboard(ID: 19004)
# Grafana → Dashboards → Import → 19004 → 选 Prometheus 数据源

九、生产环境最佳实践

9.1 指标命名规范

// ✅ 推荐:遵循 Prometheus 命名规范 namespace_subsystem_name_unit
Counter.builder("order.payment.success.total")    // 单位放末尾
Counter.builder("http.requests.total")

// ❌ 避免:驼峰、缩写、含义模糊
Counter.builder("orderPaySuccCnt")
Counter.builder("cnt")

9.2 标签(Label)设计原则

// ✅ 低基数标签(枚举值有限)
.tag("status", "success")       // success/failed
.tag("region", "cn-shanghai")   // 几个固定区域
.tag("env", "prod")             // dev/test/prod

// ❌ 高基数标签(会导致时序爆炸,严重影响 Prometheus 性能)
.tag("user_id", userId)         // 百万级用户 ID
.tag("order_id", orderId)       // 每个订单唯一
.tag("request_id", requestId)   // 每次请求唯一

9.3 Prometheus 存储优化

# prometheus.yml 启动参数
--storage.tsdb.retention.time=30d          # 保留30天数据
--storage.tsdb.retention.size=50GB         # 最大存储50GB
--storage.tsdb.wal-compression             # 启用 WAL 压缩
--query.max-concurrency=20                 # 最大并发查询数
--query.timeout=2m                         # 查询超时

9.4 Grafana 大盘设计建议

  • 首页大盘:展示所有服务健康状态(RED 方法:Rate、Errors、Duration)
  • 服务详情大盘:JVM 详情 + 接口维度响应时间 + 业务指标
  • 告警大盘:当前触发的告警汇总
  • 容量规划大盘:资源使用趋势 + 预测

十、常见问题排查

Q1:Prometheus 抓取 Spring Boot 指标返回 401

原因:Actuator 端点被 Spring Security 保护。

解决

@Configuration
public class SecurityConfig {
    @Bean
    public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
        http.authorizeHttpRequests(auth -> auth
            .requestMatchers("/actuator/prometheus").permitAll()  // 开放 Prometheus 端点
            .requestMatchers("/actuator/health").permitAll()
            .anyRequest().authenticated()
        );
        return http.build();
    }
}

Q2:自定义指标在 Grafana 中查不到

排查步骤

  1. 确认 /actuator/prometheus 中有该指标(curl ... | grep order_created
  2. Prometheus Targets 页面确认服务 UP
  3. Prometheus Graph 页面手动查询确认
  4. 检查指标名称是否包含 .(Micrometer 自动将 . 转为 _

例:order.created.total → PromQL 中查 order_created_total

Q3:GC 频繁告警但业务正常

原因:Young GC 频率高是正常的,应只对 Full GC 或 G1 Mixed GC 时间告警。

# 只统计 Full GC / G1 Mixed GC
rate(jvm_gc_pause_seconds_count{
  cause=~"G1 Mixed GC|G1 Humongous Allocation|Ergonomics"
}[5m]) * 60

总结

本文完整覆盖了 Prometheus + Grafana 监控 Spring Boot 3.x 微服务的全流程:

  1. Spring Boot 3.x 接入:Micrometer + Actuator 暴露指标,自定义业务指标埋点
  2. Docker Compose 一键部署:Prometheus + Grafana + AlertManager + Node Exporter
  3. 告警规则配置:JVM 内存、HTTP 错误率、P99 响应时间、服务存活告警
  4. Grafana 大盘:推荐 Dashboard ID 及核心 PromQL 参考
  5. K8s 集成:kube-prometheus-stack + ServiceMonitor 自动服务发现
  6. 生产最佳实践:指标命名、标签设计、存储优化

与上一篇 K8s 部署文章形成完整的 "部署 → 监控" 闭环。下一步可以继续探索 链路追踪(SkyWalking / Zipkin),构建完整的可观测性体系。


发布于 2026-03-28 | 标签:SpringBoot、Docker、分布式、Prometheus、Grafana

0

评论区