一、为什么微服务需要全链路监控?
在微服务架构中,一个请求可能跨越十余个服务,任意一环出现性能抖动都会引发级联故障。传统日志排查往往需要人工翻阅数百行日志,效率极低。
可观测性三大支柱(Observability Three Pillars):
| 支柱 | 代表工具 | 解决的问题 |
|---|---|---|
| Metrics(指标) | Prometheus + Grafana | 实时性能大盘、趋势分析、告警 |
| Logs(日志) | ELK / Loki | 问题溯源、业务审计 |
| Traces(链路追踪) | Zipkin / Jaeger | 跨服务调用耗时分析 |
本文聚焦 Metrics 这一支柱,手把手搭建 Prometheus + Grafana 监控体系,覆盖:
- Spring Boot 3.x 接入 Micrometer 暴露指标
- Prometheus 采集配置
- Grafana 可视化大盘搭建
- 告警规则配置(AlertManager)
- Docker Compose 一键部署
- K8s 环境下的 ServiceMonitor 集成
二、技术栈与版本说明
| 组件 | 版本 |
|---|---|
| Spring Boot | 3.2.x |
| Micrometer | 1.12.x(Boot 3.x 自带) |
| Prometheus | 2.51.x |
| Grafana | 10.4.x |
| AlertManager | 0.27.x |
| Docker Compose | v2.x |
注意:Spring Boot 3.x 已内置 Micrometer,无需单独引入旧版
spring-boot-actuator-autoconfigure。
三、Spring Boot 3.x 接入 Prometheus
3.1 添加依赖
<!-- pom.xml -->
<dependencies>
<!-- Spring Boot Actuator -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Micrometer Prometheus Registry -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<!-- 可选:HTTP 请求指标自动装配 -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
</dependencies>
3.2 配置 Actuator 端点
# application.yml
management:
endpoints:
web:
exposure:
include: "*" # 生产环境按需开放,建议只开 health,info,prometheus
base-path: /actuator
endpoint:
health:
show-details: always
prometheus:
enabled: true
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name} # 每个指标自动打上应用名标签
env: ${APP_ENV:dev} # 环境标签(dev/test/prod)
配置完成后,访问 http://localhost:8080/actuator/prometheus,可以看到如下格式的指标输出:
# HELP jvm_memory_used_bytes The amount of used memory
# TYPE jvm_memory_used_bytes gauge
jvm_memory_used_bytes{application="order-service",area="heap",env="prod",...} 1.34217728E8
# HELP http_server_requests_seconds Duration of HTTP server request handling
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{application="order-service",exception="None",method="GET",...} 1024
http_server_requests_seconds_sum{...} 5.12
3.3 自定义业务指标
除了 JVM、HTTP 等默认指标,还可以埋点业务指标:
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;
@Service
public class OrderService {
private final Counter orderCreatedCounter;
private final Counter orderFailedCounter;
private final Timer orderProcessTimer;
public OrderService(MeterRegistry registry) {
// 订单创建计数器
this.orderCreatedCounter = Counter.builder("order.created.total")
.description("Total number of created orders")
.tag("env", "prod")
.register(registry);
// 订单失败计数器
this.orderFailedCounter = Counter.builder("order.failed.total")
.description("Total number of failed orders")
.register(registry);
// 订单处理耗时
this.orderProcessTimer = Timer.builder("order.process.duration")
.description("Order processing duration")
.register(registry);
}
public void createOrder(OrderDTO dto) {
orderProcessTimer.record(() -> {
try {
// 业务逻辑...
doCreateOrder(dto);
orderCreatedCounter.increment();
} catch (Exception e) {
orderFailedCounter.increment();
throw e;
}
});
}
private void doCreateOrder(OrderDTO dto) {
// 实际业务处理
}
}
常用 Micrometer 指标类型:
| 类型 | 用途 | 示例 |
|---|---|---|
Counter | 单调递增计数 | 请求总数、订单总数 |
Gauge | 瞬时值 | 线程数、队列长度、缓存大小 |
Timer | 耗时 + 次数 | 接口响应时间 |
DistributionSummary | 值分布 | 请求体大小、支付金额分布 |
四、Docker Compose 一键部署监控栈
4.1 目录结构
monitoring/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml # Prometheus 主配置
│ └── rules/
│ └── springboot.yml # 告警规则
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── prometheus.yml # 数据源自动配置
│ └── dashboards/
│ └── dashboard.yml # 大盘自动加载配置
└── alertmanager/
└── alertmanager.yml # AlertManager 配置
4.2 docker-compose.yml
version: "3.9"
networks:
monitoring:
driver: bridge
services:
# ─── Prometheus ───────────────────────────────────────────
prometheus:
image: prom/prometheus:v2.51.2
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d" # 数据保留15天
- "--web.enable-lifecycle" # 支持热重载
- "--web.enable-admin-api"
networks:
- monitoring
# ─── Grafana ──────────────────────────────────────────────
grafana:
image: grafana/grafana:10.4.2
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=Admin@123456
- GF_USERS_ALLOW_SIGN_UP=false
- GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-clock-panel
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
networks:
- monitoring
# ─── AlertManager ─────────────────────────────────────────
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
networks:
- monitoring
# ─── Node Exporter(主机指标)──────────────────────────────
node-exporter:
image: prom/node-exporter:v1.8.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
4.3 Prometheus 采集配置
# prometheus/prometheus.yml
global:
scrape_interval: 15s # 每15秒抓取一次
evaluation_interval: 15s # 每15秒评估一次告警规则
external_labels:
cluster: "prod-cluster"
region: "cn-shanghai"
# 告警规则文件
rule_files:
- "/etc/prometheus/rules/*.yml"
# AlertManager 地址
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 采集任务
scrape_configs:
# ── Prometheus 自监控 ──
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# ── Node Exporter 主机指标 ──
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
# ── Spring Boot 微服务(静态配置)──
- job_name: "springboot-services"
metrics_path: "/actuator/prometheus"
scrape_interval: 10s
static_configs:
- targets:
- "order-service:8080"
- "user-service:8081"
- "payment-service:8082"
- "gateway:8888"
labels:
env: "prod"
region: "cn-shanghai"
# ── Spring Boot 微服务(基于文件的服务发现)──
# 适合服务实例动态变化的场景
- job_name: "springboot-file-sd"
metrics_path: "/actuator/prometheus"
file_sd_configs:
- files:
- "/etc/prometheus/targets/*.json"
refresh_interval: 30s
五、告警规则配置
5.1 Spring Boot 告警规则
# prometheus/rules/springboot.yml
groups:
- name: springboot_alerts
rules:
# ── JVM 堆内存使用率超过 85% ──────────────────────────
- alert: JvmHeapMemoryHigh
expr: |
(jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) * 100 > 85
for: 2m
labels:
severity: warning
annotations:
summary: "JVM 堆内存使用率过高 ({{ $labels.application }})"
description: |
应用 {{ $labels.application }} 实例 {{ $labels.instance }}
堆内存使用率为 {{ $value | printf "%.1f" }}%,超过阈值 85%
# ── GC 停顿时间过长 ────────────────────────────────────
- alert: JvmGcPauseTooLong
expr: |
rate(jvm_gc_pause_seconds_sum[5m]) / rate(jvm_gc_pause_seconds_count[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "GC 平均停顿时间过长 ({{ $labels.application }})"
description: "过去5分钟平均 GC 停顿 {{ $value | printf \"%.3f\" }}s,超过 500ms"
# ── HTTP 错误率超过 5% ──────────────────────────────────
- alert: HttpErrorRateHigh
expr: |
(
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
/
rate(http_server_requests_seconds_count[5m])
) * 100 > 5
for: 3m
labels:
severity: critical
annotations:
summary: "HTTP 5xx 错误率过高 ({{ $labels.application }})"
description: |
应用 {{ $labels.application }} 接口 {{ $labels.uri }}
5xx 错误率为 {{ $value | printf "%.2f" }}%,超过阈值 5%
# ── 接口 P99 响应时间超过 2s ────────────────────────────
- alert: HttpLatencyHigh
expr: |
histogram_quantile(0.99,
rate(http_server_requests_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "接口 P99 响应时间过高 ({{ $labels.application }})"
description: "接口 {{ $labels.uri }} P99 响应时间 {{ $value | printf \"%.2f\" }}s"
# ── 服务下线 ────────────────────────────────────────────
- alert: ServiceDown
expr: up{job="springboot-services"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务下线 ({{ $labels.instance }})"
description: "实例 {{ $labels.instance }} 已经 1 分钟无法访问"
# ── 线程池队列积压 ─────────────────────────────────────
- alert: ThreadPoolQueueFull
expr: executor_queue_remaining_tasks < 10
for: 2m
labels:
severity: warning
annotations:
summary: "线程池队列即将耗尽 ({{ $labels.application }})"
description: "线程池 {{ $labels.name }} 剩余队列容量仅 {{ $value }}"
5.2 AlertManager 配置(钉钉/企微通知)
# alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: "smtp.163.com:465"
smtp_from: "alert@yourcompany.com"
smtp_auth_username: "alert@yourcompany.com"
smtp_auth_password: "your-smtp-password"
route:
group_by: ["alertname", "application", "env"]
group_wait: 30s # 同组告警等待30s合并后再发送
group_interval: 5m # 同组告警每5min发一次
repeat_interval: 4h # 持续告警每4h重复通知
receiver: "default"
routes:
- match:
severity: critical
receiver: "critical-receiver"
continue: true
- match:
severity: warning
receiver: "warning-receiver"
receivers:
- name: "default"
email_configs:
- to: "ops@yourcompany.com"
- name: "critical-receiver"
webhook_configs:
- url: "http://dingtalk-webhook:8060/dingtalk/webhook1/send"
send_resolved: true
- name: "warning-receiver"
webhook_configs:
- url: "http://dingtalk-webhook:8060/dingtalk/webhook2/send"
send_resolved: true
inhibit_rules:
# 有 critical 告警时,抑制同应用的 warning 告警
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["application", "instance"]
六、Grafana 大盘搭建
6.1 数据源自动配置
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
httpMethod: POST
timeInterval: "15s"
queryTimeout: "60s"
6.2 推荐导入的 Grafana Dashboard ID
| Dashboard | ID | 用途 |
|---|---|---|
| Spring Boot 3.x Statistics | 19004 | JVM、HTTP、数据库连接池全览 |
| JVM Micrometer | 4701 | JVM 详细指标(堆/非堆/GC/线程) |
| Spring Boot Actuator | 12685 | Actuator 端点监控 |
| Node Exporter Full | 1860 | 主机 CPU/内存/磁盘/网络 |
| Kubernetes Cluster | 7249 | K8s 集群资源概览 |
导入步骤:Grafana → Dashboards → Import → 输入 Dashboard ID → Load → 选择 Prometheus 数据源 → Import
6.3 核心 PromQL 查询语句参考
# ── JVM 堆内存使用率 ─────────────────────────────────────
(jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}) * 100
# ── HTTP 请求 QPS(按接口分组)──────────────────────────
sum by (uri, method) (
rate(http_server_requests_seconds_count[1m])
)
# ── HTTP P95 / P99 响应时间 ─────────────────────────────
histogram_quantile(0.95, sum by (le, uri) (
rate(http_server_requests_seconds_bucket[5m])
))
# ── HTTP 5xx 错误率 ──────────────────────────────────────
(
sum by (application)(rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/
sum by (application)(rate(http_server_requests_seconds_count[5m]))
) * 100
# ── 活跃线程数 ───────────────────────────────────────────
jvm_threads_live_threads
# ── GC 频率(每分钟次数)────────────────────────────────
rate(jvm_gc_pause_seconds_count[1m]) * 60
# ── 数据库连接池使用率 ───────────────────────────────────
hikaricp_connections_active / hikaricp_connections_max * 100
# ── 订单成功率(自定义业务指标)─────────────────────────
(
rate(order_created_total[5m])
/
(rate(order_created_total[5m]) + rate(order_failed_total[5m]))
) * 100
七、K8s 环境下的 ServiceMonitor 集成
如果 Spring Boot 微服务部署在 Kubernetes 上(参考上一篇 K8s 实战文章),推荐使用 kube-prometheus-stack + ServiceMonitor 实现自动服务发现。
7.1 安装 kube-prometheus-stack
# 添加 Helm 仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 安装到 monitoring 命名空间
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword="Admin@123456" \
--set prometheus.prometheusSpec.retention="15d" \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage="50Gi"
7.2 为 Spring Boot 服务配置 ServiceMonitor
# k8s/service-monitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: order-service-monitor
namespace: monitoring # ServiceMonitor 放在 monitoring 命名空间
labels:
release: kube-prometheus-stack # 必须与 Prometheus CRD 的 serviceMonitorSelector 匹配
spec:
namespaceSelector:
matchNames:
- default # 被监控服务所在的命名空间
selector:
matchLabels:
app: order-service # 匹配 Service 的 label
endpoints:
- port: http # Service 中暴露的端口名称
path: /actuator/prometheus
interval: 15s
scrapeTimeout: 10s
对应的 Spring Boot Service 需要带有匹配的 label:
# k8s/order-service.yaml(Service 部分)
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: default
labels:
app: order-service # 与 ServiceMonitor.selector 匹配
spec:
selector:
app: order-service
ports:
- name: http # 端口名称必须与 ServiceMonitor.endpoints.port 一致
port: 8080
targetPort: 8080
# 应用配置
kubectl apply -f k8s/service-monitor.yaml
# 验证 ServiceMonitor 已创建
kubectl get servicemonitor -n monitoring
# 在 Prometheus UI 中验证 Targets 是否出现
kubectl port-forward svc/kube-prometheus-stack-prometheus 9090:9090 -n monitoring
# 访问 http://localhost:9090/targets
7.3 K8s 环境 Grafana 访问
# 端口转发访问 Grafana
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
# 获取默认密码
kubectl get secret kube-prometheus-stack-grafana -n monitoring \
-o jsonpath="{.data.admin-password}" | base64 --decode
八、完整启动流程
8.1 Docker Compose 部署
# 1. 克隆/创建监控目录
mkdir -p monitoring && cd monitoring
# 2. 按上文创建所有配置文件
# 3. 启动监控栈
docker compose up -d
# 4. 验证各服务状态
docker compose ps
# 5. 访问各组件
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin / Admin@123456)
# AlertManager: http://localhost:9093
# 6. 热重载 Prometheus 配置(修改配置后无需重启)
curl -X POST http://localhost:9090/-/reload
8.2 Spring Boot 服务验证清单
# 验证指标端点可访问
curl http://localhost:8080/actuator/prometheus | head -20
# 在 Prometheus 中验证 Target 在线
# 访问 http://localhost:9090/targets → 确认所有 Target 状态为 UP
# 常用 PromQL 验证查询
# http://localhost:9090/graph
# 查询:up{job="springboot-services"} → 应返回 1
# 导入 Grafana Dashboard(ID: 19004)
# Grafana → Dashboards → Import → 19004 → 选 Prometheus 数据源
九、生产环境最佳实践
9.1 指标命名规范
// ✅ 推荐:遵循 Prometheus 命名规范 namespace_subsystem_name_unit
Counter.builder("order.payment.success.total") // 单位放末尾
Counter.builder("http.requests.total")
// ❌ 避免:驼峰、缩写、含义模糊
Counter.builder("orderPaySuccCnt")
Counter.builder("cnt")
9.2 标签(Label)设计原则
// ✅ 低基数标签(枚举值有限)
.tag("status", "success") // success/failed
.tag("region", "cn-shanghai") // 几个固定区域
.tag("env", "prod") // dev/test/prod
// ❌ 高基数标签(会导致时序爆炸,严重影响 Prometheus 性能)
.tag("user_id", userId) // 百万级用户 ID
.tag("order_id", orderId) // 每个订单唯一
.tag("request_id", requestId) // 每次请求唯一
9.3 Prometheus 存储优化
# prometheus.yml 启动参数
--storage.tsdb.retention.time=30d # 保留30天数据
--storage.tsdb.retention.size=50GB # 最大存储50GB
--storage.tsdb.wal-compression # 启用 WAL 压缩
--query.max-concurrency=20 # 最大并发查询数
--query.timeout=2m # 查询超时
9.4 Grafana 大盘设计建议
- 首页大盘:展示所有服务健康状态(RED 方法:Rate、Errors、Duration)
- 服务详情大盘:JVM 详情 + 接口维度响应时间 + 业务指标
- 告警大盘:当前触发的告警汇总
- 容量规划大盘:资源使用趋势 + 预测
十、常见问题排查
Q1:Prometheus 抓取 Spring Boot 指标返回 401
原因:Actuator 端点被 Spring Security 保护。
解决:
@Configuration
public class SecurityConfig {
@Bean
public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
http.authorizeHttpRequests(auth -> auth
.requestMatchers("/actuator/prometheus").permitAll() // 开放 Prometheus 端点
.requestMatchers("/actuator/health").permitAll()
.anyRequest().authenticated()
);
return http.build();
}
}
Q2:自定义指标在 Grafana 中查不到
排查步骤:
- 确认
/actuator/prometheus中有该指标(curl ... | grep order_created) - Prometheus Targets 页面确认服务 UP
- Prometheus Graph 页面手动查询确认
- 检查指标名称是否包含
.(Micrometer 自动将.转为_)
例:
order.created.total→ PromQL 中查order_created_total
Q3:GC 频繁告警但业务正常
原因:Young GC 频率高是正常的,应只对 Full GC 或 G1 Mixed GC 时间告警。
# 只统计 Full GC / G1 Mixed GC
rate(jvm_gc_pause_seconds_count{
cause=~"G1 Mixed GC|G1 Humongous Allocation|Ergonomics"
}[5m]) * 60
总结
本文完整覆盖了 Prometheus + Grafana 监控 Spring Boot 3.x 微服务的全流程:
- ✅ Spring Boot 3.x 接入:Micrometer + Actuator 暴露指标,自定义业务指标埋点
- ✅ Docker Compose 一键部署:Prometheus + Grafana + AlertManager + Node Exporter
- ✅ 告警规则配置:JVM 内存、HTTP 错误率、P99 响应时间、服务存活告警
- ✅ Grafana 大盘:推荐 Dashboard ID 及核心 PromQL 参考
- ✅ K8s 集成:kube-prometheus-stack + ServiceMonitor 自动服务发现
- ✅ 生产最佳实践:指标命名、标签设计、存储优化
与上一篇 K8s 部署文章形成完整的 "部署 → 监控" 闭环。下一步可以继续探索 链路追踪(SkyWalking / Zipkin),构建完整的可观测性体系。
发布于 2026-03-28 | 标签:SpringBoot、Docker、分布式、Prometheus、Grafana
评论区