前言

Prometheus社区提供了大量的官方以及第三方Exporters，可以满足Prometheus的采纳者快速实现对关键业务，以及基础设施的监控需求。

如上所示，一个简单的应用以及环境架构。一般而言，我们通常会从几个层面进行监控指标的采集：

入口网关：这里可以是Nginx/HaProxy这一类的负载均衡器，也可以是注入Spring Cloud Zuul这一类框架提供的微服务入口。一般来说我们需要对所有Http Request相关的指标数据进行采集。如请求地址，Http Method,返回状态码，响应时长等。从而可以通过这些指标历史数据去分析业务压力，服务状态等信息。
应用服务：对于应用服务而言，基本的如应用本身的资源使用率,比如如果是Java类程序可以直接通过JVM信息来进行统计，如果是部署到容器中，则可以通过Container的资源使用情况来统计。除了资源用量外，某些特殊情况下，我们可能还会对应用中的某些业务指标进行采集。
基础设施：虚拟机或者物理机的资源使用情况等。
其它：集群环境中所使用到的数据库，缓存，消息队列等中间件状态等。

对于以上的集中场景中，除了直接使用Prometheus社区提供的Exporter外，不同的项目可能还需要实现一些自定义的Exporter用于实现对于特定目的的指标的采集和监控需求。

本文将以Spring Boot/Spring Cloud为例，介绍如果使用Prometheus SDK实现自定义监控指标的定义以及暴露，并且会介绍Prometheus中四种不同指标类型(Counter, Gauge, Histogram, Summary)的实际使用场景；

扩展Spring应用程序，支持Prometheus采集

添加Prometheus Java Client依赖

> 这里使用0.0.24的版本，在之前的版本中Spring Boot暴露的监控地址，无法正确的处理Prometheus Server的请求，详情：https://github.com/prometheus/ ... s/265

build.gradle

dependencies {
...
compile 'io.prometheus:simpleclient:0.0.24'
compile "io.prometheus:simpleclient_spring_boot:0.0.24"
compile "io.prometheus:simpleclient_hotspot:0.0.24"
}

启用Prometheus Metrics Endpoint

添加注解@EnablePrometheusEndpoint启用Prometheus Endpoint,这里同时使用了simpleclient_hotspot中提供的DefaultExporter该Exporter会在metrics endpoint中放回当前应用JVM的相关信息

@SpringBootApplication
@EnablePrometheusEndpoint
public class SpringApplication implements CommandLineRunner {
 
public static void main(String[] args) {
    SpringApplication.run(GatewayApplication.class, args);
}
 
@Override
public void run(String... strings) throws Exception {
    DefaultExports.initialize();
}
}

默认情况下Prometheus暴露的metrics endpoint为 /prometheus，可以通过endpoint配置进行修改

endpoints:
prometheus:
id: metrics
metrics:
id: springmetrics
sensitive: false
enabled: true

启动应用程序访问 http://localhost:8080/metrics 可以看到以下输出：

HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.

TYPE jvm_gc_collection_seconds summary

jvm_gc_collection_seconds_count{gc="PS Scavenge",} 11.0
jvm_gc_collection_seconds_sum{gc="PS Scavenge",} 0.18
jvm_gc_collection_seconds_count{gc="PS MarkSweep",} 2.0
jvm_gc_collection_seconds_sum{gc="PS MarkSweep",} 0.121

HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM

TYPE jvm_classes_loaded gauge

jvm_classes_loaded 8376.0

HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution

TYPE jvm_classes_loaded_total counter

...

添加拦截器，为监控埋点做准备

除了获取应用JVM相关的状态以外，我们还可能需要添加一些自定义的监控Metrics实现对系统性能，以及业务状态进行采集，以提供日后优化的相关支撑数据。首先我们使用拦截器处理对应用的所有请求。

继承WebMvcConfigurerAdapter类，复写addInterceptors方法，对所有请求/**添加拦截器

@SpringBootApplication
@EnablePrometheusEndpoint
public class SpringApplication extends WebMvcConfigurerAdapter implements CommandLineRunner {
@Override
public void addInterceptors(InterceptorRegistry registry) {
    registry.addInterceptor(new PrometheusMetricsInterceptor()).addPathPatterns("/**");
}
}

PrometheusMetricsInterceptor集成HandlerInterceptorAdapter，通过复写父方法，实现对请求处理前/处理完成的处理。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {
@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    return super.preHandle(request, response, handler);
}
 
@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    super.afterCompletion(request, response, handler, ex);
}
}

自定义Metrics指标

Prometheus提供了4中不同的Metrics类型:Counter,Gauge,Histogram,Summary

1）Counter:只增不减的计数器

计数器可以用于记录只会增加不会减少的指标类型,比如记录应用请求的总量(http_requests_total)，cpu使用时间(process_cpu_seconds_total)等。

对于Counter类型的指标，只包含一个inc()方法，用于计数器+1

一般而言，Counter类型的metrics指标在命名中我们使用_total结束。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {
 
static final Counter requestCounter = Counter.build()
        .name("io_namespace_http_requests_total").labelNames("path", "method", "code")
        .help("Total requests.").register();
 
@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    String requestURI = request.getRequestURI();
    String method = request.getMethod();
    int status = response.getStatus();
 
    requestCounter.labels(requestURI, method, String.valueOf(status)).inc();
    super.afterCompletion(request, response, handler, ex);
}
}

使用Counter.build()创建Counter metrics，name()方法，用于指定该指标的名称 labelNames()方法，用于声明该metrics拥有的维度label。在addInterceptors方法中，我们获取当前请求的，RequesPath，Method以及状态码。并且调用inc()方法，在每次请求发生时计数+1。

Counter.build()...register(),会像Collector中注册该指标，并且当访问/metrics地址时，返回该指标的状态。

通过指标io_namespace_http_requests_total我们可以：

查询应用的请求总量

PromQL

sum(io_namespace_http_requests_total)

查询每秒Http请求量

PromQL

sum(rate(io_wise2c_gateway_requests_total[5m]))

查询当前应用请求量Top N的URI

PromQL

topk(10, sum(io_namespace_http_requests_total) by (path))

2）Gauge: 可增可减的仪表盘

对于这类可增可减的指标，可以用于反应应用的__当前状态__,例如在监控主机时，主机当前空闲的内存大小(node_memory_MemFree)，可用内存大小(node_memory_MemAvailable)。或者容器当前的cpu使用率,内存使用率。

对于Gauge指标的对象则包含两个主要的方法inc()以及dec(),用户添加或者减少计数。在这里我们使用Gauge记录当前正在处理的Http请求数量。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {
 
...省略的代码
static final Gauge inprogressRequests = Gauge.build()
        .name("io_namespace_http_inprogress_requests").labelNames("path", "method", "code")
        .help("Inprogress requests.").register();
 
@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    ...省略的代码
    // 计数器+1
    inprogressRequests.labels(requestURI, method, String.valueOf(status)).inc();
    return super.preHandle(request, response, handler);
}
 
@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    ...省略的代码
    // 计数器-1
    inprogressRequests.labels(requestURI, method, String.valueOf(status)).dec();
 
    super.afterCompletion(request, response, handler, ex);
}
}

通过指标io_namespace_http_inprogress_requests我们可以直接查询应用当前正在处理中的Http请求数量:

PromQL

io_namespace_http_inprogress_requests{}

3）Histogram：自带buckets区间用于统计分布统计图

主要用于在指定分布范围内(Buckets)记录大小(如http request bytes)或者事件发生的次数。

以请求响应时间requests_latency_seconds为例，假如我们需要记录http请求响应时间符合在分布范围{.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10}中的次数时。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {
 
static final Histogram requestLatencyHistogram = Histogram.build().labelNames("path", "method", "code")
        .name("io_namespace_http_requests_latency_seconds_histogram").help("Request latency in seconds.")
        .register();
 
private Histogram.Timer histogramRequestTimer;
 
@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    ...省略的代码
    histogramRequestTimer = requestLatencyHistogram.labels(requestURI, method, String.valueOf(status)).startTimer();
    ...省略的代码
}
 
@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    ...省略的代码
    histogramRequestTimer.observeDuration();
    ...省略的代码
}
}

使用Histogram构造器可以创建Histogram监控指标。默认的buckets范围为{.005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10}。如何需要覆盖默认的buckets，可以使用.buckets(double... buckets)覆盖。

Histogram会自动创建3个指标，分别为：

事件发生总次数： basename_count

实际含义：当前一共发生了2次http请求

io_namespace_http_requests_latency_seconds_histogram_count{path="/",method="GET",code="200",} 2.0

所有事件产生值的大小的总和: basename_sum

实际含义：发生的2次http请求总的响应时间为13.107670803000001 秒

io_namespace_http_requests_latency_seconds_histogram_sum{path="/",method="GET",code="200",} 13.107670803000001

事件产生的值分布在bucket中的次数： basename_bucket{le="上包含"}

在总共2次请求当中,http请求响应时间 <=0.005 秒的请求次数为0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.005",} 0.0

在总共2次请求当中,http请求响应时间 <=0.01 秒的请求次数为0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.01",} 0.0

在总共2次请求当中,http请求响应时间 <=0.025 秒的请求次数为0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.025",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.05",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.075",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.1",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.25",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.5",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="0.75",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="1.0",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="2.5",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="5.0",} 0.0
io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="7.5",} 2.0

在总共2次请求当中,http请求响应时间 <=10 秒的请求次数为0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="10.0",} 2.0

在总共2次请求当中,ttp请求响应时间 10 秒的请求次数为0

io_namespace_http_requests_latency_seconds_histogram_bucket{path="/",method="GET",code="200",le="+Inf",} 2.0

Summary: 客户端定义的数据分布统计图

Summary和Histogram非常类型相似，都可以统计事件发生的次数或者大小，以及其分布情况。

Summary和Histogram都提供了对于事件的计数_count以及值的汇总_sum。因此使用_count,和_sum时间序列可以计算出相同的内容，例如http每秒的平均响应时间：rate(basename_sum[5m]) / rate(basename_count[5m])。

同时Summary和Histogram都可以计算和统计样本的分布情况，比如中位数，9分位数等等。其中 0.0<= 分位数Quantiles <= 1.0。

不同在于Histogram可以通过histogram_quantile函数在服务器端计算分位数。而Sumamry的分位数则是直接在客户端进行定义。因此对于分位数的计算。 Summary在通过PromQL进行查询时有更好的性能表现，而Histogram则会消耗更多的资源。相对的对于客户端而言Histogram消耗的资源更少。

public class PrometheusMetricsInterceptor extends HandlerInterceptorAdapter {
 
static final Summary requestLatency = Summary.build()
        .name("io_namespace_http_requests_latency_seconds_summary")
        .quantile(0.5, 0.05)
        .quantile(0.9, 0.01)
        .labelNames("path", "method", "code")
        .help("Request latency in seconds.").register();
 
@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) throws Exception {
    ...省略的代码
    requestTimer = requestLatency.labels(requestURI, method, String.valueOf(status)).startTimer();
    ...省略的代码
}
 
@Override
public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) throws Exception {
    ...省略的代码
    requestTimer.observeDuration();
    ...省略的代码
}
}

使用Summary指标，会自动创建多个时间序列：

事件发生总的次数

含义：当前http请求发生总次数为12次

io_namespace_http_requests_latency_seconds_summary_count{path="/",method="GET",code="200",} 12.0

事件产生的值的总和

含义：这12次http请求的总响应时间为 51.029495508s

io_namespace_http_requests_latency_seconds_summary_sum{path="/",method="GET",code="200",} 51.029495508

事件产生的值的分布情况

含义：这12次http请求响应时间的中位数是3.052404983s

io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.5",} 3.052404983

含义：这12次http请求响应时间的9分位数是8.003261666s

io_namespace_http_requests_latency_seconds_summary{path="/",method="GET",code="200",quantile="0.9",} 8.003261666

使用Collector暴露业务指标

除了在拦截器中使用Prometheus提供的Counter,Summary,Gauage等构造监控指标以外，我们还可以通过自定义的Collector实现对相关业务指标的暴露

@SpringBootApplication
@EnablePrometheusEndpoint
public class SpringApplication extends WebMvcConfigurerAdapter implements CommandLineRunner {
 
@Autowired
private CustomExporter customExporter;
 
...省略的代码
 
@Override
public void run(String... args) throws Exception {
    ...省略的代码
    customExporter.register();
}
}

CustomExporter集成自io.prometheus.client.Collector，在调用Collector的register()方法后，当访问/metrics时，则会自动从Collector的collection()方法中获取采集到的监控指标。

由于这里CustomExporter存在于Spring的IOC容器当中，这里可以直接访问业务代码，返回需要的业务相关的指标。

import io.prometheus.client.Collector;
import io.prometheus.client.GaugeMetricFamily;
import org.springframework.stereotype.Component;
 
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
 
@Component
public class CustomExporter extends Collector {
@Override
public List<MetricFamilySamples> collect() {
    List<MetricFamilySamples> mfs = new ArrayList<>();
 
    # 创建metrics指标
    GaugeMetricFamily labeledGauge =
            new GaugeMetricFamily("io_namespace_custom_metrics", "custom metrics", Collections.singletonList("labelname"));
 
    # 设置指标的label以及value
    labeledGauge.addMetric(Collections.singletonList("labelvalue"), 1);
 
    mfs.add(labeledGauge);
    return mfs;
}
}

当然这里也可以使用CounterMetricFamily，SummaryMetricFamily声明其它的指标类型。

小结

好了。目前为止，启动应用程序，并且访问 http://localhost:8080/metrics。我们可以看到如下结果。

这部分分别介绍了两种方式，在Spring应用中实现对于自定义Metrics指标的定义：

拦截器/过滤器：用于统计所有应用请求的情况
自定义Collector: 可以用于统计应用业务能力相关的监控情况

同时介绍了4中Metrics指标类型以及使用场景：

Counter，只增不减的计数器
Gauge，可增可减的仪表盘
Histogram，自带buckets区间用于统计分布统计图
Summary，客户端定义的数据分布统计图

自定义Metrics：让Prometheus监控你的应用程序的更多相关文章

Prometheus 监控linux服务器
Prometheus 监控linux服务器 node_exporter:用于*NIX系统监控,使用Go语言编写的收集器. 使用版本 node_exporter 0.17.0 相关文档使用文档:htt ...
如何用prometheus监控k8s集群中业务pod的metrics
一般,我们从网上看到的帖子和资料, 都是用prometheus监控k8s的各项资源, 如api server, namespace, pod, node等. 那如果是自己的业务pod上的自定义metr ...
Kubernetes集群部署史上最详细（二）Prometheus监控Kubernetes集群
使用Prometheus监控Kubernetes集群监控方面Grafana采用YUM安装通过服务形式运行,部署在Master上,而Prometheus则通过POD运行,Grafana通过使用Prom ...
prometheus监控示例
prometheus架构图 prometheus 各组件介绍 Prometheus Server: 使用pull方式采集监控数据,在该组件上配置监控数据的采集和告警规则. Client Library ...
micrometer自定义metrics
micrometer提供了基于Java的monitor facade,其与springboot应用和prometheus的集成方式如下图展示上图中展示的很清楚,应用通过micrometer采集和暴露 ...
Kubernetes学习之路（二十四）之Prometheus监控
目录 1.Prometheus概述 2.Prometheus部署 2.1.创建名称空间prom 2.2.部署node_exporter 2.3.部署prometheus-server 2.4.部署ku ...
Kubernetes容器集群管理环境 - Prometheus监控篇
一.Prometheus介绍之前已经详细介绍了Kubernetes集群部署篇,今天这里重点说下Kubernetes监控方案-Prometheus+Grafana.Prometheus(普罗米修斯)是一 ...
Prometheus 监控领域最锋利的“瑞士军刀”
原文:https://mp.weixin.qq.com/s/Cujn6_4w8ZcXCOWpoAStvQ 一.Kubernetes 容器监控的标配—Prometheus 1.简介 Prometheus ...
Prometheus 监控Docker服务器及Granfanna可视化
Prometheus 监控Docker服务器及Granfanna可视化 cAdvisor(Container Advisor)用于收集正在运行的容器资源使用和性能信息. 使用Prometheus监控c ...

随机推荐

10分钟10行代码开发APP(delphi 应用案例)
总结一下用到的知识(开发环境安装配置不计算在内): 第六章使用不同风格的按钮: 第十七章让布局适应不同大小与方向的窗体: 第二十五章使用 dbExpress访问 InterBase ToGo ...
NC 的简单使用
netcat被誉为网络安全界的’瑞士军刀’,相信没有什么人不认识它吧……一个简单而有用的工具,透过使用TCP或UDP协议的网络连接去读写数据.它被设计成一个稳定的后门工具,能够直接由其它程序和脚本轻松 ...
sencha touch 在线实战培训第一期第二节
2013.12.30晚上8点开的课,仍然有些紧张,开始讲课进度很慢,后面又有些快了... 本期培训一共八节,前三堂免费,后面的课程需要付费才可以观看. 本节内容: 页面实现及跳转控制跳转容器.路由理 ...
Mysql语句优化
总结总结自己犯过的错,网上说的与自己的Mysql语句优化的想法. 1.查询数据库的语句的字段,尽量做到用多少写多少. 2.建索引,确保查询速度. 3.orm框架自带的方法会损耗一部分性能,这个性能应该 ...
关于linux例行任务crontab的使用
Linux 例行性任务(也叫周期性任务)命令使用:crontab1.crontab -l 查看当前用户的任务2.crontab -e 编辑(设置)当前用户的任务,执行行不用重启crond服务.3 ...
利用开源架构ELK构建分布式日志系统
问题导读 1.ELK产生的背景?2.ELK的基本组成模块以及各个模块的作用?3.ELK的使用总计有哪些? 背景日志,对每个系统来说,都是很重要,又很容易被忽视的部分.日志里记录了程序执行的关键信息, ...
ToStringBuilder类
文章来源:http://blog.csdn.net/zhaowen25/article/details/39521899 apache的commons-lang3的工具包里有一个ToStringBui ...
h5页面弹窗滚动穿透的思考
可能我们经常做这样的弹窗对吧,兴许我们绝对很简单,两下搞定: 弹窗的页面结构代码:  <div class="ma ...
python中的range与xrange
range 也是一种类型(type),它是一个数字的序列(s sequence of numbers),而且是不可变的,通常用在for循环中. class range(stop) class rang ...
xtrabackup安装部署（二）
在官网中,复制相关链接下载最新版本(建议使用当前发布版本前6个月左右的稳定版本) https://www.percona.com/downloads/XtraBackup/LATEST/ 1.下载和安 ...

自定义Metrics：让Prometheus监控你的应用程序

前言

扩展Spring应用程序，支持Prometheus采集

添加Prometheus Java Client依赖

build.gradle

启用Prometheus Metrics Endpoint

HELP jvm_gc_collection_seconds Time spent in a given JVM garbage collector in seconds.

TYPE jvm_gc_collection_seconds summary

HELP jvm_classes_loaded The number of classes that are currently loaded in the JVM

TYPE jvm_classes_loaded gauge

HELP jvm_classes_loaded_total The total number of classes that have been loaded since the JVM has started execution

TYPE jvm_classes_loaded_total counter

添加拦截器，为监控埋点做准备

自定义Metrics指标

1）Counter:只增不减的计数器

PromQL

PromQL

PromQL

2）Gauge: 可增可减的仪表盘

PromQL

3）Histogram：自带buckets区间用于统计分布统计图

实际含义： 当前一共发生了2次http请求

实际含义： 发生的2次http请求总的响应时间为13.107670803000001 秒

在总共2次请求当中,http请求响应时间 <=0.005 秒 的请求次数为0

在总共2次请求当中,http请求响应时间 <=0.01 秒 的请求次数为0

在总共2次请求当中,http请求响应时间 <=0.025 秒 的请求次数为0

在总共2次请求当中,http请求响应时间 <=10 秒 的请求次数为0

在总共2次请求当中,ttp请求响应时间 10 秒 的请求次数为0

Summary: 客户端定义的数据分布统计图

含义：当前http请求发生总次数为12次

含义：这12次http请求的总响应时间为 51.029495508s

含义：这12次http请求响应时间的中位数是3.052404983s

含义：这12次http请求响应时间的9分位数是8.003261666s

使用Collector暴露业务指标

小结

自定义Metrics：让Prometheus监控你的应用程序的更多相关文章

随机推荐

热门专题

实际含义：当前一共发生了2次http请求

实际含义：发生的2次http请求总的响应时间为13.107670803000001 秒

在总共2次请求当中,http请求响应时间 <=0.005 秒的请求次数为0

在总共2次请求当中,http请求响应时间 <=0.01 秒的请求次数为0

在总共2次请求当中,http请求响应时间 <=0.025 秒的请求次数为0

在总共2次请求当中,http请求响应时间 <=10 秒的请求次数为0

在总共2次请求当中,ttp请求响应时间 10 秒的请求次数为0