prometheus从零开始

本次的想法是做服务监控并告警主要线路如下图所示

1、运行prometheus docker方式

docker run -itd \

-p 9090:9090 \

-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \

prom/prometheus

2、prometheus.yml 初始配置文件如下：

global:

  scrape_interval:     15s # By default, scrape targets every 15 seconds.   全局默认值 15秒抓取一次数据

  # Attach these labels to any time series or alerts when communicating with

  # external systems (federation, remote storage, Alertmanager).

  external_labels:

    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['localhost:9090']

3、默认 prometheus 会有自己的指标接口http://192.168.246.2:9090/metrics 内容部分截取如下

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.

# TYPE go_gc_duration_seconds summary

go_gc_duration_seconds{quantile="0"} 2.6636e-05

go_gc_duration_seconds{quantile="0.25"} 0.000123346

go_gc_duration_seconds{quantile="0.5"} 0.000159706

go_gc_duration_seconds{quantile="0.75"} 0.000190857

go_gc_duration_seconds{quantile="1"} 0.001369042

4、可以登录9090端口去看看prometheus主界面可以执行PromQL (Prometheus Query Language) 来excute得到结果

比如这个 prometheus_target_interval_length_seconds{quantile="0.99"}

具体PromQL语法示例请参考官网https://prometheus.io/docs/prometheus/latest/querying/basics/

5、上面的数据是prometheus自己的，下面我们自己生产数据给它有很多公共的exporter 可以用比如 node_exporter 他可以暴露机器一些基本的通用指标。

也可以执行python自定义编程取指标让自己成为一个exporter

安装node_exporter 官网例子但是不要使用127.0.0.1 因为我的prometheus是docker起的和宿主机的127.0.0.1是不通的它抓取不到数据的，请改成实际的主机地址

ps:其他exporter 可参考地址 https://prometheus.io/docs/instrumenting/exporters/

tar -xzvf node_exporter-*.*.tar.gz

cd node_exporter-*.*

# Start 3 example targets in separate terminals:

./node_exporter --web.listen-address 127.0.0.1:8080

./node_exporter --web.listen-address 127.0.0.1:8081

./node_exporter --web.listen-address 127.0.0.1:8082

6、需要修改prometheus.yml增加job 抓取exproter 修改后的如下增加了一个job 里面有三个exporter 标签是随便配的

global:

  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with

  # external systems (federation, remote storage, Alertmanager).

  external_labels:

    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['localhost:9090']

  - job_name: 'node'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['192.168.246.2:8080', '192.168.246.2:8081']

        labels:

          group: 'production'

      - targets: ['192.168.246.2:8082']

        labels:

          group: 'canary'

6、查看页面是否ok了

7、可以看看node_exporter 暴露的指标例子比如有如下的

node_cpu_seconds_total{cpu="0",mode="idle"} 2963.27

node_cpu_seconds_total{cpu="0",mode="iowait"} 0.38

node_cpu_seconds_total{cpu="0",mode="irq"} 0

node_cpu_seconds_total{cpu="0",mode="nice"} 0

node_cpu_seconds_total{cpu="0",mode="softirq"} 0.35

node_cpu_seconds_total{cpu="0",mode="steal"} 0

node_cpu_seconds_total{cpu="0",mode="system"} 19.19

node_cpu_seconds_total{cpu="0",mode="user"} 16.96

node_cpu_seconds_total{cpu="1",mode="idle"} 2965.47

node_cpu_seconds_total{cpu="1",mode="iowait"} 0.37

node_cpu_seconds_total{cpu="1",mode="irq"} 0

node_cpu_seconds_total{cpu="1",mode="nice"} 0.03

node_cpu_seconds_total{cpu="1",mode="softirq"} 0.28

node_cpu_seconds_total{cpu="1",mode="steal"} 0

node_cpu_seconds_total{cpu="1",mode="system"} 18.42

node_cpu_seconds_total{cpu="1",mode="user"} 17.95

8、如果我们想看近5分钟内每个实例的所有cpus的平均每秒CPU时间速率可以这样写

avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

图例结果

9、下面设置一个rules规则，写一个文件 prometheus.rules.yml

groups:

- name: cpu-node

  rules:

  - record: job_instance_mode:node_cpu_seconds:avg_rate5m

    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

10、现在发现配置文件太多了我们重新用另一种方式启动docker 优化一下把本地配置文件都放在 /opt/prometheus/，原来的docker可删除了。

--web.enable-lifecycle 参数支持热更新 接口是curl -X POST http://192.168.246.2:9090/-/reload

docker run -itd -p 9090:9090 -v /opt/prometheus/:/etc/prometheus/ prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle

11、查看rules

12、上面只是规则，并没有告警，我们假设 avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m])) > 0.5 就触发 cpu警告这是假设的测试

我们需要rules文件如下：

groups:

- name: example

  rules:

  - alert: HighCpuLatency

    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m])) > 0.5

    for: 10s

    labels:

      severity: page

    annotations:

      summary: High request latency

写到prometheus.rules.yml文件中，注意groups不能复制进去 key不能重复可以使用命令检查rules文件是否正确

[root@test prometheus]# ./prometheus-2.26.0.linux-amd64/promtool check rules prometheus.rules.yml

Checking prometheus.rules.yml

  FAILED:

prometheus.rules.yml: yaml: unmarshal errors:

  line 6: mapping key "groups" already defined at line 1

prometheus.rules.yml: yaml: unmarshal errors:

  line 6: mapping key "groups" already defined at line 1

[root@test prometheus]# ./prometheus-2.26.0.linux-amd64/promtool check rules prometheus.rules.yml

Checking prometheus.rules.yml

  SUCCESS: 2 rules found

[root@test prometheus]#

13、热更新一下

curl -X POST http://192.168.246.2:9090/-/reload

这次不用重启docker了

查看页面 rules会增加一个且alert会先有pending状态，等符合条件后就触发告警

14、下面启动alertmanager

启动之前要做两个配置，首先把 alertmanager 的IP和端口配置到prometheus.yml中

最会面增加了 alerting的配置这样 prometheus 就连上 alertmanager

global:

  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with

  # external systems (federation, remote storage, Alertmanager).

  external_labels:

    monitor: 'codelab-monitor'

rule_files:

  - 'prometheus.rules.yml'

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['localhost:9090']

  - job_name: 'node'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['192.168.246.2:8080', '192.168.246.2:8081']

        labels:

          group: 'production'

      - targets: ['192.168.246.2:8082']

        labels:

          group: 'canary'

alerting:

  alertmanagers:

    - static_configs:

        - targets: ["192.168.246.2:9093"]

第二个配置我们先测试邮件告警，写的 alertmanager 配置如下

注意修改的部分 qq如何申请授权码请百度一下

  smtp_smarthost: 'smtp.qq.com:465'

  smtp_from: '6171391@qq.com'

  smtp_auth_username: '6171391@qq.com'

  smtp_auth_password: 'qq授权码'

  smtp_require_tls: false
默认不做任何过滤选择的接收人

  - to: 'dfwl@163.com'

global:

  # The smarthost and SMTP sender used for mail notifications.

  smtp_smarthost: 'smtp.qq.com:465'

  smtp_from: '6171391@qq.com'

  smtp_auth_username: '6171391@qq.com'

  smtp_auth_password: 'qq授权码'

  smtp_require_tls: false

# The directory from which notification templates are read.

templates:

- '/etc/alertmanager/template/*.tmpl'

# The root route on which each incoming alert enters.

route:

  group_by: ['alertname', 'cluster', 'service']

  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch

  # of new alerts that started firing for that group.

  group_interval: 1m

  # If an alert has successfully been sent, wait 'repeat_interval' to

  # resend them.

  repeat_interval: 7h

  # A default receiver

  receiver: team-X-mails

  # The child route trees.

  routes:

  # This routes performs a regular expression match on alert labels to

  # catch alerts that are related to a list of services.

  - match_re:

      service: ^(foo1|foo2|baz)$

    receiver: team-X-mails

    # The service has a sub-route for critical alerts, any alerts

    # that do not match, i.e. severity != critical, fall-back to the

    # parent node and are sent to 'team-X-mails'

    routes:

    - match:

        severity: critical

      receiver: team-X-pager

  - match:

      service: files

    receiver: team-Y-mails

    routes:

    - match:

        severity: critical

      receiver: team-Y-pager

  # This route handles all alerts coming from a database service. If there's

  # no team to handle it, it defaults to the DB team.

  - match:

      service: database

    receiver: team-DB-pager

    # Also group alerts by affected database.

    group_by: [alertname, cluster, database]

    routes:

    - match:

        owner: team-X

      receiver: team-X-pager

      continue: true

    - match:

        owner: team-Y

      receiver: team-Y-pager

# Inhibition rules allow to mute a set of alerts given that another alert is

# firing.

# We use this to mute any warning-level notifications if the same alert is

# already critical.

inhibit_rules:

- source_match:

    severity: 'critical'

  target_match:

    severity: 'warning'

  # Apply inhibition if the alertname is the same.

  # CAUTION:

  #   If all label names listed in `equal` are missing

  #   from both the source and target alerts,

  #   the inhibition rule will apply!

  equal: ['alertname', 'cluster', 'service']

receivers:

- name: 'team-X-mails'

  email_configs:

  - to: 'dfwl@163.com'

- name: 'team-X-pager'

  email_configs:

  - to: 'team-X+alerts-critical@example.org'

  pagerduty_configs:

  - service_key: <team-X-key>

- name: 'team-Y-mails'

  email_configs:

  - to: 'team-Y+alerts@example.org'

- name: 'team-Y-pager'

  pagerduty_configs:

  - service_key: <team-Y-key>

- name: 'team-DB-pager'

  pagerduty_configs:

  - service_key: <team-DB-key>

15、手动启动测试一下生产环境可以docker或k8s等方式启动

./alertmanager --config.file=alertmanager.yml

16、alertmanager页面能同步到告警

过一会就会发邮件了

group_interval: 1m   意思 从第一次接受告警1m后还在就发

prometheus从零开始的更多相关文章

从零开始搭建Prometheus自动监控报警系统
从零搭建Prometheus监控报警系统什么是Prometheus? Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB).Prometheus使用Go语言开 ...
从零开始学习Prometheus监控报警系统
Prometheus简介 Prometheus是一个开源的监控报警系统,它最初由SoundCloud开发. 2016年,Prometheus被纳入了由谷歌发起的Linux基金会旗下的云原生基金会( C ...
Prometheus监控学习记录
官方文档 Prometheus基础文档从零开始:Prometheus 进阶之路:Prometheus —— 技巧篇进阶之路:Prometheus —— 理解篇 prometheus的数据类型介绍 ...
基于prometheus监控k8s集群
本文建立在你已经会安装prometheus服务的基础之上,如果你还不会安装,请参考:prometheus多维度监控容器如果你还没有安装库k8s集群,情参考: 从零开始搭建基于calico的kuben ...
kubernetes 1.15.1 高可用部署 -- 从零开始
这是一本书!!! 一本写我在容器生态圈的所学!!! 重点先知: 1. centos 7.6安装优化 2. k8s 1.15.1 高可用部署 3. 网络插件calico 4. dashboard 插件 ...
你必须知道的容器监控 (3) Prometheus
本篇已加入<.NET Core on K8S学习实践系列文章索引>,可以点击查看更多容器化技术相关系列文章.上一篇介绍了Google开发的容器监控工具cAdvisor,但是其提供的操作界面 ...
从零搭建Prometheus监控报警系统
什么是Prometheus? Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB).Prometheus使用Go语言开发,是Google BorgMon监控系统 ...
3W字干货深入分析基于Micrometer和Prometheus实现度量和监控的方案
前提最近线上的项目使用了spring-actuator做度量统计收集,使用Prometheus进行数据收集,Grafana进行数据展示,用于监控生成环境机器的性能指标和业务数据指标.一般,我们叫这样 ...
Prometheus入门教程（二）：Prometheus + Grafana实现可视化、告警
文章首发于[陈树义]公众号,点击跳转到原文:https://mp.weixin.qq.com/s/56S290p4j9KROB5uGRcGkQ Prometheus UI 提供了快速验证 PromQL ...

随机推荐

Kafka原理笔记
1.什么是kafka? Kafka是一种分布式的,基于发布/订阅的消息系统(消息队列). 2.为什么要用kafka? 当业务逻辑变得复杂,数据量也会越来越多.此时可能需要增加多条数据线,每条数据线将收 ...
git config 配置简写命令
在多人协作开发时,一般用git来进行代码管理. git有一些命令如:git pull . git push等等,这些命令可以设置alias,也就是缩写. 如:git pull 是 git pl, gi ...
最详细之教你Jenkins+github自动化部署.Net Core程序到Docker
环境 centos7.9,.NET5 一.Jenkins搭建 1)下载Jenkins的war包在\home目录建一个jenkins目录放jenkins的包 #进入\home目录 cd \home # ...
初探 Python Flask+Jinja2 SSTI
初探 Python Flask+Jinja2 SSTI 文章首发安全客:https://www.anquanke.com/post/id/226900 SSTI简介 SSTI主要是因为某些语言的框架中 ...
LDAP未授权访问学习
LDAP未授权访问学习一.LDAP 介绍 LDAP的全称为Lightweight Directory Access Protocol(轻量级目录访问协议), 基于X.500标准, 支持 TCP/IP ...
VLAN-3 Hybrid接口应用
一.实验拓扑图二.实验编址三.实验步骤 1.给对应的PC设置对应的IP和掩码还有接口,以及根据需要划分不同的vlan区域,再用文本标记出不同部门. 2.启动设备(全选) 3.首先用ping命令检查 ...
Python数学建模系列（一）：规划问题之线性规划
@ 目录前言线性规划样例1:求解下列线性规划问题 scipy库求解样例2:求解下列线性规划问题 pulp库求解样例3.运输问题说明结语前言 Hello!小伙伴! 非常感谢您阅读海轰的文 ...
[转]C# 互操作性入门系列(一)：C#中互操作性介绍
传送门 C#互操作系列文章: C# 互操作性入门系列(一):C#中互操作性介绍 C# 互操作性入门系列(二):使用平台调用调用Win32 函数 C# 互操作性入门系列(三):平台调用中的数据封送处理 ...
【小技巧】排名前 16 的 Java 工具类！
转自java技术栈: https://mp.weixin.qq.com/s?__biz=MzI3ODcxMzQzMw==&mid=2247485460&idx=1&sn=cef ...
【REST】使用RestSharp 库消费Restful Service
使用RestSharp 库消费Restful Service 现在互联网上的服务接口都是Restful的,SOAP的Service已经不是主流..NET/Mono下如何消费Restful Serv ...

prometheus从零开始

prometheus从零开始的更多相关文章

随机推荐

热门专题