Springboot2 Metrics之actuator集成influxdb, Grafana提供监控和报警

到目前为止，各种日志收集，统计监控开源组件数不胜数，即便如此还是会有很多人只是tail -f查看一下日志文件。随着容器化技术的成熟，日志和metrics度量统计已经不能仅仅靠tail -f来查看了，你甚至都不能进入部署的机器。因此，日志收集和metrics统计就必不可少。日志可以通过logstash或者filebeat收集到ES中用来查阅。对于各种统计指标，springboot提供了actuator组件，可以对cpu, 内存，线程，request等各种指标进行统计，并收集起来。本文将粗略的集成influxdb来实现数据收集，以及使用Grafana来展示。

最终dashboard模板： https://github.com/Ryan-Miao/boot-metrics-exporter/blob/master/grafana/grafana-dashboard-template.json

最终获得如下统计报表：

对于redis cache命中率的统计：

对于单独重要request的统计

基于health check的alert

安装influxdb和Grafana

安装influxdb:

https://www.cnblogs.com/woshimrf/p/docker-influxdb.html

安装Grafana:

https://www.cnblogs.com/woshimrf/p/docker-grafana.html

Springboot配置

可以直接使用封装好的starter:

https://github.com/Ryan-Miao/boot-metrics-exporter

或者：

引入依赖

        <dependency>

            <groupId>org.springframework.boot</groupId>

            <artifactId>spring-boot-starter-actuator</artifactId>

        </dependency>

        <dependency>

            <groupId>io.micrometer</groupId>

            <artifactId>micrometer-registry-influx</artifactId>

        </dependency>

        <dependency>

            <groupId>org.projectlombok</groupId>

            <artifactId>lombok</artifactId>

        </dependency>

定义MeterConfig, 用来统一设置一些tag，比如instance id

@Component

public class MeterConfig implements MeterRegistryCustomizer {

    private static final Logger LOGGER = LoggerFactory.getLogger(MeterConfig.class);

    @Override

    public void customize(MeterRegistry registry) {

        try {

            String hostAddress = InetAddress.getLocalHost().getHostAddress();

            if (LOGGER.isDebugEnabled()) {

                LOGGER.debug("设置metrics实例id为ip:" + hostAddress);

            }

            registry.config().commonTags("instance-id", hostAddress);

        } catch (UnknownHostException e) {

            String uuid = UUID.randomUUID().toString();

            registry.config().commonTags("instance-id", uuid);

            LOGGER.error("获取实例ip失败，设置实例id为uuid:" + uuid, e);

        }

    }

}

添加对应的配置：

management:

  metrics:

    export:

      influx:

        db: my-db

        uri: http://192.168.5.9:8086

        user-name: admin

        password: admin

        enabled: true

    web:

      server:

        auto-time-requests: true

    tags:

      app: ${spring.application.name}

这里选择将metric export到influxdb，还有很多其他存储方案可选。

网络配置

grafana和influxdb可能部署在某个vpc，比如monitor集群。而需要收集的业务系统则遍布在各个业务线的vpc内，因此需要业务集群打通访问influxdb的网络和端口。

自定义Metrics

Springboot actuator暴露的health接口只有up/down的选择，在grafana如何使用这个来判断阈值，我还没找到，于是转换成了数字。

自定义MeterBinder

import io.micrometer.core.instrument.Gauge;

import io.micrometer.core.instrument.MeterRegistry;

import io.micrometer.core.instrument.binder.MeterBinder;

import lombok.Data;

@Data

public class HealthMetrics implements MeterBinder {

    /**

     * 100  up

     * 0  down

     * 0 unknown

     */

    private Integer health = 100;

    @Override

    public void bindTo(MeterRegistry registry) {

        Gauge.builder("health", () -> health)

                .register(registry);

    }

}

定义每30s更新一下状态：

public abstract class AbstractHealthCheckStatusSetter {

    private final HealthMetrics healthMetrics;

    protected AbstractHealthCheckStatusSetter(HealthMetrics healthMetrics) {

        this.healthMetrics = healthMetrics;

    }

    /**

     * 修改health的状态定义。修改HealthMetrics.health的value。

     */

    public abstract void setHealthStatus(HealthMetrics h);

    /**

     * 定时更新health统计.

     */

    @PostConstruct

    void doSet() {

        ScheduledExecutorService scheduledExecutorService = new ScheduledThreadPoolExecutor(1);

        scheduledExecutorService.scheduleWithFixedDelay(

                () -> setHealthStatus(healthMetrics), 30L, 30L, TimeUnit.SECONDS);

    }

}

实现类

public class HealthCheckStatusSetter extends AbstractHealthCheckStatusSetter {

    private final HealthEndpoint healthEndpoint;

    public HealthCheckStatusSetter(HealthMetrics healthMetrics, HealthEndpoint healthEndpoint) {

        super(healthMetrics);

        this.healthEndpoint = healthEndpoint;

    }

    @Override

    public void setHealthStatus(HealthMetrics healthMetrics) {

        Health health = healthEndpoint.health();

        if (health != null) {

            Status status = health.getStatus();

            switch (status.getCode()) {

                case "UP": {

                    healthMetrics.setHealth(100);

                    break;

                }

                case "DOWN":

                    ;

                case "UNKNOWN":

                    ;

                default: {

                    healthMetrics.setHealth(0);

                    break;

                }

            }

        }

    }

}

加入配置

    @Bean

    @ConditionalOnMissingBean

    public HealthMetrics healthMetrics() {

        return new HealthMetrics();

    }

    /**

     * 这里采用healthEndpoint来判断系统的健康状况。如果有别的需要，可以实现AbstractHealthCheckStatusSetter，自己设置health.

     */

    @Bean

    @ConditionalOnMissingBean

    @ConditionalOnBean(HealthEndpoint.class)

    public AbstractHealthCheckStatusSetter healthCheckSchedule(HealthEndpoint healthEndpoint, HealthMetrics healthMetrics) {

        return new HealthCheckStatusSetter(healthMetrics, healthEndpoint);

    }

Redis cache命中率统计

整套metrics监控是基于Spring boot actuator来实现的，而actuator是通过io.micrometer来做统计的。那么就可以通过自定义micrometer metrics的方式来添加各种metric。比如我们常用redis作为缓存，那么缓存的命中率是我们所关注的。可以自己写一套counter来记录：命中hit+1，没命中miss+1.

也可以直接使用redisson。

我们使用RedissonCache来集成spring cache, 此时cache的命中统计指标就已经被收集好了。

Cache基本统计指标的定义：

然而，统计的结果是按行存储的：

怎么基于此计算命中率呢？

hit-rate= sum(hit)/sum(hit+miss)

因此，我手动对这个序列做了整合：

DROP CONTINUOUS QUERY cq_cache_hit ON my-db

DROP CONTINUOUS QUERY cq_cache_miss ON my-db

DROP measurement cache_hit_rate

CREATE CONTINUOUS QUERY "cq_cache_hit" ON "my-db" RESAMPLE EVERY 10m BEGIN SELECT sum("value") AS hit  INTO "cache_hit_rate"  FROM "rp_30days"."cache_gets" WHERE ( "result" = 'hit') GROUP BY time(10m),"app", "cache"  fill(0) END

CREATE CONTINUOUS QUERY "cq_cache_miss" ON "my-db" RESAMPLE EVERY 10m BEGIN SELECT sum("value") AS miss  INTO "cache_hit_rate"  FROM "rp_30days"."cache_gets" WHERE ( "result" = 'miss') GROUP BY time(10m),"app", "cache" fill(0) ENDD

监控告警

Grafana提供了alert功能，当查询的指标不满足阈值时，发出告警。

选择influxdb or Prometheus ?

关于收集metric指标的存储方案，大多数教程都是Prometheus, 生态比较完整。我当时之所以选择influxdb，仅仅是因为容器的网络问题。Prometheus需要访问实例来拉取数据，需要允许Prometheus访问业务网络，那我就得不停打通网络，而且，k8s集群不同的网络是不通的，没找到网络打通方案。而influx这种只要实例push数据。同样的，还可以选择es。

influxdb有单点局限性，以及数量大之后的稳定性等问题。需要合理的计算时间间隔的数据。比如，对于几天几个月等查询，提前汇总细粒度的统计。

还有一种据说可以无限扩展的方案就是OpenTSDB. 暂未研究。

会遇到的问题

当前demo是influxdb单点，极其脆弱，稍微长点的时间间隔查询就会挂掉，也只能用来做demo，或者只是查看最近15min这种简单的实时查看。对于近几个月，一年这种长时间聚合，只能提前做好聚合函数进行粗粒度的统计汇总。

参考

https://github.com/OpenTSDB/opentsdb