侧边栏壁纸
  • 累计撰写 75 篇文章
  • 累计创建 28 个标签
  • 累计收到 21 条评论

目 录CONTENT

文章目录

ELK Stack 日志全链路实战:Spring Boot 3.x 微服务日志采集、分析与告警

Administrator
2026-03-29 / 0 评论 / 0 点赞 / 0 阅读 / 0 字 / 正在检测是否收录...
温馨提示:
部分素材来自网络,若不小心影响到您的利益,请联系我们删除。

在微服务架构下,服务实例动辄数十上百,靠 tail -f 看日志已成历史。本文以 Spring Boot 3.x 微服务 + ELK Stack(Elasticsearch 8.x + Logstash 8.x + Kibana 8.x)+ Filebeat 为技术底座,从零搭建一套生产级日志全链路平台,涵盖结构化日志输出、日志采集、索引管理、可视化分析,以及与 Prometheus 告警联动,与前文 [Prometheus + Grafana 全链路监控] 形成完整的微服务可观测性三支柱(Metrics + Logs + Traces)体系。


一、为什么微服务场景必须上 ELK?

痛点传统方式ELK 方案
日志分散在各容器逐一 kubectl logs集中汇聚,一站搜索
排查链路问题多窗口交叉对比TraceId 一键检索全链路
日志存储无上限磁盘爆满被迫删除ILM 自动滚动归档
实时告警ElastAlert2 分钟级触发
日志格式五花八门无法统一分析Logstash Grok/JSON 结构化

可观测性三支柱:Metrics(Prometheus+Grafana)+ Logs(ELK) + Traces(SkyWalking/Jaeger)。三者缺一不可,本文补齐 Logs 这一支柱。


二、整体架构

┌─────────────────────────────────────────────────────────────────┐
│                     微服务集群(Docker / K8s)                    │
│                                                                  │
│  [order-service]  [user-service]  [gateway]  ...                │
│       │                │               │                         │
│   log/*.log        log/*.log       log/*.log                     │
│       └────────────────┴───────────────┘                        │
│                         │  Filebeat(轻量采集)                  │
└─────────────────────────┼────────────────────────────────────────┘
                          │
              ┌───────────▼────────────┐
              │  Logstash(过滤/解析)  │
              └───────────┬────────────┘
                          │
              ┌───────────▼────────────┐
              │  Elasticsearch 8.x     │
              │  (存储/索引/搜索)     │
              └───────────┬────────────┘
                          │
              ┌───────────▼────────────┐
              │  Kibana(可视化/告警)  │
              └────────────────────────┘

为什么加 Filebeat? Logstash 是重量级进程(JVM),直接部署在每个容器中资源消耗大。Filebeat 是 Go 编写的轻量采集器,负责读取日志文件并转发给 Logstash,由 Logstash 统一完成解析过滤,职责分离。


三、Spring Boot 3.x 结构化日志配置

3.1 依赖引入

<!-- pom.xml -->
<dependencies>
    <!-- Spring Boot Starter(含 Logback) -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- logstash-logback-encoder:输出 JSON 格式日志 -->
    <dependency>
        <groupId>net.logstash.logback</groupId>
        <artifactId>logstash-logback-encoder</artifactId>
        <version>7.4</version>
    </dependency>

    <!-- MDC TraceId 自动注入(需配合 Spring Cloud Sleuth 或 Micrometer Tracing) -->
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-tracing-bridge-brave</artifactId>
    </dependency>
</dependencies>

3.2 logback-spring.xml 配置

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <!-- 从 application.yml 读取应用名 -->
    <springProperty scope="context" name="APP_NAME" source="spring.application.name" defaultValue="app"/>
    <springProperty scope="context" name="LOG_PATH" source="logging.file.path" defaultValue="./logs"/>

    <!-- 控制台:开发环境可读格式 -->
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} [%X{traceId}] - %msg%n</pattern>
        </encoder>
    </appender>

    <!-- 文件:JSON 格式,供 Filebeat 采集 -->
    <appender name="FILE_JSON" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${LOG_PATH}/${APP_NAME}.log</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <!-- 每天滚动,单文件最大 200MB,保留 7 天 -->
            <fileNamePattern>${LOG_PATH}/${APP_NAME}-%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <maxFileSize>200MB</maxFileSize>
            <maxHistory>7</maxHistory>
            <totalSizeCap>2GB</totalSizeCap>
        </rollingPolicy>
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <!-- 添加自定义字段 -->
            <customFields>{"app":"${APP_NAME}","env":"${spring.profiles.active:-dev}"}</customFields>
            <!-- 包含 MDC(traceId, spanId 等) -->
            <includeMdcKeyName>traceId</includeMdcKeyName>
            <includeMdcKeyName>spanId</includeMdcKeyName>
            <includeMdcKeyName>userId</includeMdcKeyName>
        </encoder>
    </appender>

    <!-- 异步包装,提升吞吐量 -->
    <appender name="ASYNC_FILE" class="ch.qos.logback.classic.AsyncAppender">
        <discardingThreshold>0</discardingThreshold>
        <queueSize>512</queueSize>
        <appender-ref ref="FILE_JSON"/>
    </appender>

    <root level="INFO">
        <appender-ref ref="CONSOLE"/>
        <appender-ref ref="ASYNC_FILE"/>
    </root>

    <!-- 业务日志单独输出到独立文件,方便告警 -->
    <logger name="biz" level="INFO" additivity="false">
        <appender-ref ref="ASYNC_FILE"/>
    </logger>
</configuration>

3.3 MDC 全链路追踪 Filter

/**
 * 为每条请求注入 traceId 到 MDC,日志自动携带,便于全链路检索
 */
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class TraceIdFilter implements Filter {

    private static final String TRACE_ID = "traceId";
    private static final String USER_ID   = "userId";

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
            throws IOException, ServletException {
        HttpServletRequest req = (HttpServletRequest) request;
        try {
            // 优先从请求头获取(网关透传),否则新建
            String traceId = Optional.ofNullable(req.getHeader("X-Trace-Id"))
                    .filter(s -> !s.isBlank())
                    .orElse(UUID.randomUUID().toString().replace("-", "").substring(0, 16));

            MDC.put(TRACE_ID, traceId);
            // 从 JWT 解析后的 SecurityContext 获取用户 ID
            Optional.ofNullable(SecurityContextHolder.getContext().getAuthentication())
                    .map(auth -> (String) auth.getPrincipal())
                    .ifPresent(uid -> MDC.put(USER_ID, uid));

            // 响应头回传 traceId,方便前端定位
            ((HttpServletResponse) response).setHeader("X-Trace-Id", traceId);
            chain.doFilter(request, response);
        } finally {
            MDC.clear(); // 避免线程池复用污染
        }
    }
}

输出的 JSON 日志示例

{
  "@timestamp": "2026-03-29T08:15:32.456+08:00",
  "level": "ERROR",
  "logger_name": "com.example.order.service.OrderService",
  "message": "创建订单失败:库存不足",
  "app": "order-service",
  "env": "prod",
  "traceId": "a3f5c8d2e1b40967",
  "spanId":  "b1c2d3e4",
  "userId":  "10086",
  "stack_trace": "java.lang.RuntimeException: 库存不足\n\tat com.example..."
}

四、Docker Compose 部署 ELK Stack

4.1 目录结构

elk/
├── docker-compose.yml
├── elasticsearch/
│   └── config/
│       └── elasticsearch.yml
├── logstash/
│   ├── config/
│   │   └── logstash.yml
│   └── pipeline/
│       └── springboot.conf        # 核心 Pipeline
├── kibana/
│   └── config/
│       └── kibana.yml
└── filebeat/
    └── filebeat.yml

4.2 docker-compose.yml

version: "3.9"

networks:
  elk:
    driver: bridge

volumes:
  esdata:

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.13.0
    container_name: elasticsearch
    environment:
      - node.name=es01
      - cluster.name=elk-cluster
      - discovery.type=single-node
      - xpack.security.enabled=false        # 开发环境关闭认证,生产务必开启
      - xpack.security.http.ssl.enabled=false
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
      - bootstrap.memory_lock=true
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - esdata:/usr/share/elasticsearch/data
      - ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
    ports:
      - "9200:9200"
    networks:
      - elk
    healthcheck:
      test: ["CMD-SHELL", "curl -s http://localhost:9200/_cluster/health | grep -q '\"status\":\"green\\|yellow\"'"]
      interval: 20s
      timeout: 10s
      retries: 5

  logstash:
    image: docker.elastic.co/logstash/logstash:8.13.0
    container_name: logstash
    volumes:
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
      - ./logstash/pipeline:/usr/share/logstash/pipeline
    ports:
      - "5044:5044"    # Beats input(供 Filebeat 推送)
      - "9600:9600"    # Logstash 监控 API
    environment:
      - LS_JAVA_OPTS=-Xms512m -Xmx512m
    networks:
      - elk
    depends_on:
      elasticsearch:
        condition: service_healthy

  kibana:
    image: docker.elastic.co/kibana/kibana:8.13.0
    container_name: kibana
    volumes:
      - ./kibana/config/kibana.yml:/usr/share/kibana/config/kibana.yml
    ports:
      - "5601:5601"
    networks:
      - elk
    depends_on:
      - elasticsearch

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.13.0
    container_name: filebeat
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      # 挂载宿主机日志目录(所有微服务日志落盘到此处)
      - /var/log/microservices:/var/log/microservices:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - elk
    depends_on:
      - logstash

4.3 Logstash Pipeline 配置(核心)

# logstash/pipeline/springboot.conf

input {
  beats {
    port => 5044
  }
}

filter {
  # 解析 JSON 格式(logstash-logback-encoder 输出)
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "log"
    }
    # 提升关键字段到顶层
    mutate {
      rename => {
        "[log][app]"         => "app"
        "[log][env]"         => "env"
        "[log][traceId]"     => "traceId"
        "[log][spanId]"      => "spanId"
        "[log][userId]"      => "userId"
        "[log][level]"       => "level"
        "[log][logger_name]" => "logger"
        "[log][message]"     => "msg"
        "[log][stack_trace]" => "stackTrace"
      }
      remove_field => ["message", "log", "ecs", "agent", "input"]
    }
  } else {
    # 非 JSON 日志(如 Nginx access log)用 Grok 解析
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
  }

  # 统一时间戳
  date {
    match => ["[log][@timestamp]", "ISO8601"]
    target => "@timestamp"
    timezone => "Asia/Shanghai"
  }

  # 标记慢请求(响应时间 > 2000ms)
  if [msg] =~ /duration/ {
    grok {
      match => { "msg" => "duration=(?<duration_ms>\d+)" }
    }
    if [duration_ms] and [duration_ms] > "2000" {
      mutate {
        add_tag => ["slow_request"]
      }
    }
  }

  # 过滤无用日志(健康检查、静态资源)
  if [msg] =~ /actuator\/health|\.css|\.js|\.ico/ {
    drop {}
  }
}

output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    # 按应用 + 月份分索引,方便 ILM 管理
    index => "springboot-%{app}-%{+yyyy.MM}"
    action => "create"
  }

  # 调试时打开,生产关闭
  # stdout { codec => rubydebug }
}

4.4 Filebeat 配置

# filebeat/filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/microservices/**/*.log
    # 多行日志合并(异常堆栈跨多行)
    multiline.type: pattern
    multiline.pattern: '^\{'
    multiline.negate: true
    multiline.match: after
    # 添加元数据
    fields:
      source: "file"
    fields_under_root: true
    # 从上次读取位置继续,避免重复
    close_inactive: 5m

  # 同时采集 Docker 容器日志
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log
    processors:
      - add_docker_metadata:
          host: "unix:///var/run/docker.sock"

output.logstash:
  hosts: ["logstash:5044"]
  worker: 4
  bulk_max_size: 2048
  compression_level: 3

# Filebeat 监控(可选)
monitoring.enabled: false

五、Elasticsearch ILM 索引生命周期管理

生产环境日志量极大,必须配置 ILM(Index Lifecycle Management)自动滚动删除,否则磁盘会被撑爆。

# 创建 ILM Policy
curl -X PUT "http://localhost:9200/_ilm/policy/springboot-logs-policy" \
  -H "Content-Type: application/json" \
  -d '{
    "policy": {
      "phases": {
        "hot": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_primary_shard_size": "20gb",
              "max_age": "1d"
            }
          }
        },
        "warm": {
          "min_age": "7d",
          "actions": {
            "forcemerge": { "max_num_segments": 1 },
            "shrink": { "number_of_shards": 1 }
          }
        },
        "cold": {
          "min_age": "30d",
          "actions": {
            "freeze": {}
          }
        },
        "delete": {
          "min_age": "90d",
          "actions": {
            "delete": {}
          }
        }
      }
    }
  }'
# 创建索引模板,绑定 ILM Policy
curl -X PUT "http://localhost:9200/_index_template/springboot-logs-template" \
  -H "Content-Type: application/json" \
  -d '{
    "index_patterns": ["springboot-*"],
    "template": {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0,
        "index.lifecycle.name": "springboot-logs-policy",
        "index.lifecycle.rollover_alias": "springboot-logs"
      },
      "mappings": {
        "properties": {
          "@timestamp":  { "type": "date" },
          "app":         { "type": "keyword" },
          "env":         { "type": "keyword" },
          "level":       { "type": "keyword" },
          "traceId":     { "type": "keyword" },
          "spanId":      { "type": "keyword" },
          "userId":      { "type": "keyword" },
          "logger":      { "type": "keyword" },
          "msg":         { "type": "text", "analyzer": "ik_max_word" },
          "stackTrace":  { "type": "text" }
        }
      }
    }
  }'

⚠️ 注意:生产环境 msg 字段若要支持中文全文搜索,需安装 analysis-ik 插件:

docker exec -it elasticsearch \
  bin/elasticsearch-plugin install \
  https://get.infini.cloud/elasticsearch/analysis-ik/8.13.0

六、Kibana 配置与常用分析场景

6.1 创建 Data View

访问 http://localhost:5601 → Stack Management → Data Views → Create data view:

  • Index patternspringboot-*
  • Timestamp field@timestamp

6.2 常用 KQL 查询语句

# 查询所有 ERROR 级别日志
level: "ERROR"

# 全链路追踪:通过 traceId 聚合所有微服务日志
traceId: "a3f5c8d2e1b40967"

# 查询某用户的所有操作记录
userId: "10086" and level: "INFO"

# 查询订单服务最近 1 小时的异常
app: "order-service" and level: "ERROR"

# 慢请求分析
tags: "slow_request"

# 包含特定异常的日志
stackTrace: "OutOfMemoryError"

6.3 可视化 Dashboard 搭建

推荐创建以下 Panel

Panel 名称类型用途
日志级别分布Pie Chart实时掌握 ERROR/WARN 比例
各服务每分钟日志量Line Chart发现流量突刺、服务异常
TOP 10 错误类型Bar Chart快速定位高频问题
慢请求 TOP 20Data Table性能优化指引
用户操作轨迹Data Table安全审计
全链路 TraceId 搜索Search故障快速定位

七、ElastAlert2 日志告警(与 Prometheus 联动)

当日志中 ERROR 激增时,需要立即推送告警。使用 ElastAlert2 监听 Elasticsearch 数据变化触发告警。

7.1 Docker 部署 ElastAlert2

# 在 docker-compose.yml 中追加
elastalert:
  image: jertel/elastalert2:2.17.0
  container_name: elastalert2
  volumes:
    - ./elastalert/config.yaml:/opt/elastalert/config.yaml
    - ./elastalert/rules:/opt/elastalert/rules
  networks:
    - elk
  depends_on:
    - elasticsearch

7.2 告警规则配置

# elastalert/rules/high-error-rate.yaml
name: SpringBoot ERROR 告警
type: frequency          # 频率类型:N 分钟内超过 M 次触发
index: springboot-*

# 5 分钟内 ERROR 超过 50 次告警
num_events: 50
timeframe:
  minutes: 5

filter:
  - term:
      level: "ERROR"

# 告警渠道:企业微信 Webhook
alert:
  - "post"
alert_text: |
  🚨 微服务 ERROR 告警
  服务:{0}
  5分钟内 ERROR 数量:{1}
  最新错误:{2}
  时间:{3}
alert_text_args:
  - app
  - num_hits
  - msg
  - "@timestamp"

http_post_url: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY"
http_post_payload:
  msgtype: "text"
  text:
    content: "{{alert_text}}"

八、生产环境最佳实践

8.1 日志规范

// ✅ 推荐:结构化日志,使用占位符,包含上下文
log.error("创建订单失败,orderId={}, userId={}, reason={}", 
          orderId, userId, e.getMessage(), e);

// ❌ 禁止:字符串拼接(性能差)
log.error("创建订单失败,orderId=" + orderId + ", 错误:" + e.getMessage());

// ✅ 关键业务节点打印 INFO
log.info("订单支付成功,orderId={}, amount={}, payType={}", orderId, amount, payType);

// ✅ 使用 isDebugEnabled 保护高频 Debug 日志
if (log.isDebugEnabled()) {
    log.debug("库存缓存命中,skuId={}, stock={}", skuId, stock);
}

8.2 敏感信息脱敏

@Component
public class SensitiveDataMaskingConverter extends ClassicConverter {
    
    private static final Pattern PHONE = Pattern.compile("1[3-9]\\d{9}");
    private static final Pattern ID_CARD = Pattern.compile("\\d{15}|\\d{18}");
    
    @Override
    public String convert(ILoggingEvent event) {
        String msg = event.getFormattedMessage();
        msg = PHONE.matcher(msg).replaceAll(m -> m.group().substring(0, 3) + "****" + m.group().substring(7));
        msg = ID_CARD.matcher(msg).replaceAll(m -> m.group().substring(0, 4) + "**********" + m.group().substring(14));
        return msg;
    }
}

logback-spring.xml 中注册:

<conversionRule conversionWord="mask" 
    converterClass="com.example.common.log.SensitiveDataMaskingConverter"/>

8.3 Elasticsearch 性能调优

# elasticsearch.yml 生产配置
cluster.name: elk-prod
node.name: es-node-01

# 内存锁定,防止 swap 导致性能下降
bootstrap.memory_lock: true

# 关闭 swap
# echo "vm.swappiness=1" >> /etc/sysctl.conf

# 索引刷新频率(默认 1s,写入密集时可调大)
index.refresh_interval: 5s

# 副本数(单节点设 0,生产集群设 1+)
index.number_of_replicas: 1

# 写入缓冲区大小
indices.memory.index_buffer_size: 20%

九、完整验证步骤

# 1. 启动 ELK Stack
cd elk && docker-compose up -d

# 2. 检查各服务健康状态
docker-compose ps
curl http://localhost:9200/_cluster/health?pretty
curl http://localhost:9600         # Logstash API

# 3. 启动 Spring Boot 服务,触发若干请求产生日志
curl http://localhost:8080/api/orders

# 4. 验证 Filebeat 采集到日志
docker logs filebeat --tail=50

# 5. 查询 Elasticsearch 是否有数据
curl "http://localhost:9200/springboot-order-service-*/_count"

# 6. 打开 Kibana 验证 Dashboard
open http://localhost:5601

十、总结

组件职责
Filebeat轻量采集,读取日志文件/容器日志,低资源占用
Logstash结构化解析(Grok/JSON)、字段提取、过滤脏数据
Elasticsearch分布式存储与全文检索,毫秒级日志检索
Kibana可视化分析、Dashboard、DevTools 查询
ElastAlert2基于日志数据的规则告警,对接企业微信/钉钉/邮件

至此,结合前文已完成的 [Prometheus+Grafana 监控]、[Spring Cloud Gateway 网关]、[Seata 分布式事务] 等文章,博客的微服务可观测性体系已形成完整闭环:

可观测性三支柱
├── Metrics  → Prometheus + Grafana(指标监控与告警)
├── Logs     → ELK Stack(日志采集与分析)   ← 本文
└── Traces   → SkyWalking / Jaeger(分布式链路追踪)

下一篇推荐:SkyWalking 9.x 接入 Spring Boot 3.x:分布式链路追踪实战,完成三支柱闭环。


作者:hshloveyy | 博客:https://92yangyi.top | 如有问题欢迎加 QQ 群交流

0

评论区