prometheus从零开始

本次的想法是做服务监控并告警主要线路如下图所示

1、运行prometheus docker方式

docker run -itd \

-p 9090:9090 \

-v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \

prom/prometheus

2、prometheus.yml 初始配置文件如下：

global:

  scrape_interval:     15s # By default, scrape targets every 15 seconds.   全局默认值 15秒抓取一次数据

  # Attach these labels to any time series or alerts when communicating with

  # external systems (federation, remote storage, Alertmanager).

  external_labels:

    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['localhost:9090']

3、默认 prometheus 会有自己的指标接口http://192.168.246.2:9090/metrics 内容部分截取如下

# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.

# TYPE go_gc_duration_seconds summary

go_gc_duration_seconds{quantile="0"} 2.6636e-05

go_gc_duration_seconds{quantile="0.25"} 0.000123346

go_gc_duration_seconds{quantile="0.5"} 0.000159706

go_gc_duration_seconds{quantile="0.75"} 0.000190857

go_gc_duration_seconds{quantile="1"} 0.001369042

4、可以登录9090端口去看看prometheus主界面可以执行PromQL (Prometheus Query Language) 来excute得到结果

比如这个 prometheus_target_interval_length_seconds{quantile="0.99"}

具体PromQL语法示例请参考官网https://prometheus.io/docs/prometheus/latest/querying/basics/

5、上面的数据是prometheus自己的，下面我们自己生产数据给它有很多公共的exporter 可以用比如 node_exporter 他可以暴露机器一些基本的通用指标。

也可以执行python自定义编程取指标让自己成为一个exporter

安装node_exporter 官网例子但是不要使用127.0.0.1 因为我的prometheus是docker起的和宿主机的127.0.0.1是不通的它抓取不到数据的，请改成实际的主机地址

ps:其他exporter 可参考地址 https://prometheus.io/docs/instrumenting/exporters/

tar -xzvf node_exporter-*.*.tar.gz

cd node_exporter-*.*

# Start 3 example targets in separate terminals:

./node_exporter --web.listen-address 127.0.0.1:8080

./node_exporter --web.listen-address 127.0.0.1:8081

./node_exporter --web.listen-address 127.0.0.1:8082

6、需要修改prometheus.yml增加job 抓取exproter 修改后的如下增加了一个job 里面有三个exporter 标签是随便配的

global:

  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with

  # external systems (federation, remote storage, Alertmanager).

  external_labels:

    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['localhost:9090']

  - job_name: 'node'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['192.168.246.2:8080', '192.168.246.2:8081']

        labels:

          group: 'production'

      - targets: ['192.168.246.2:8082']

        labels:

          group: 'canary'

6、查看页面是否ok了

7、可以看看node_exporter 暴露的指标例子比如有如下的

node_cpu_seconds_total{cpu="0",mode="idle"} 2963.27

node_cpu_seconds_total{cpu="0",mode="iowait"} 0.38

node_cpu_seconds_total{cpu="0",mode="irq"} 0

node_cpu_seconds_total{cpu="0",mode="nice"} 0

node_cpu_seconds_total{cpu="0",mode="softirq"} 0.35

node_cpu_seconds_total{cpu="0",mode="steal"} 0

node_cpu_seconds_total{cpu="0",mode="system"} 19.19

node_cpu_seconds_total{cpu="0",mode="user"} 16.96

node_cpu_seconds_total{cpu="1",mode="idle"} 2965.47

node_cpu_seconds_total{cpu="1",mode="iowait"} 0.37

node_cpu_seconds_total{cpu="1",mode="irq"} 0

node_cpu_seconds_total{cpu="1",mode="nice"} 0.03

node_cpu_seconds_total{cpu="1",mode="softirq"} 0.28

node_cpu_seconds_total{cpu="1",mode="steal"} 0

node_cpu_seconds_total{cpu="1",mode="system"} 18.42

node_cpu_seconds_total{cpu="1",mode="user"} 17.95

8、如果我们想看近5分钟内每个实例的所有cpus的平均每秒CPU时间速率可以这样写

avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

图例结果

9、下面设置一个rules规则，写一个文件 prometheus.rules.yml

groups:

- name: cpu-node

  rules:

  - record: job_instance_mode:node_cpu_seconds:avg_rate5m

    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m]))

10、现在发现配置文件太多了我们重新用另一种方式启动docker 优化一下把本地配置文件都放在 /opt/prometheus/，原来的docker可删除了。

--web.enable-lifecycle 参数支持热更新 接口是curl -X POST http://192.168.246.2:9090/-/reload

docker run -itd -p 9090:9090 -v /opt/prometheus/:/etc/prometheus/ prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle

11、查看rules

12、上面只是规则，并没有告警，我们假设 avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m])) > 0.5 就触发 cpu警告这是假设的测试

我们需要rules文件如下：

groups:

- name: example

  rules:

  - alert: HighCpuLatency

    expr: avg by (job, instance, mode) (rate(node_cpu_seconds_total[5m])) > 0.5

    for: 10s

    labels:

      severity: page

    annotations:

      summary: High request latency

写到prometheus.rules.yml文件中，注意groups不能复制进去 key不能重复可以使用命令检查rules文件是否正确

[root@test prometheus]# ./prometheus-2.26.0.linux-amd64/promtool check rules prometheus.rules.yml

Checking prometheus.rules.yml

  FAILED:

prometheus.rules.yml: yaml: unmarshal errors:

  line 6: mapping key "groups" already defined at line 1

prometheus.rules.yml: yaml: unmarshal errors:

  line 6: mapping key "groups" already defined at line 1

[root@test prometheus]# ./prometheus-2.26.0.linux-amd64/promtool check rules prometheus.rules.yml

Checking prometheus.rules.yml

  SUCCESS: 2 rules found

[root@test prometheus]#

13、热更新一下

curl -X POST http://192.168.246.2:9090/-/reload

这次不用重启docker了

查看页面 rules会增加一个且alert会先有pending状态，等符合条件后就触发告警

14、下面启动alertmanager

启动之前要做两个配置，首先把 alertmanager 的IP和端口配置到prometheus.yml中

最会面增加了 alerting的配置这样 prometheus 就连上 alertmanager

global:

  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with

  # external systems (federation, remote storage, Alertmanager).

  external_labels:

    monitor: 'codelab-monitor'

rule_files:

  - 'prometheus.rules.yml'

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['localhost:9090']

  - job_name: 'node'

    # Override the global default and scrape targets from this job every 5 seconds.

    scrape_interval: 5s

    static_configs:

      - targets: ['192.168.246.2:8080', '192.168.246.2:8081']

        labels:

          group: 'production'

      - targets: ['192.168.246.2:8082']

        labels:

          group: 'canary'

alerting:

  alertmanagers:

    - static_configs:

        - targets: ["192.168.246.2:9093"]

第二个配置我们先测试邮件告警，写的 alertmanager 配置如下

注意修改的部分 qq如何申请授权码请百度一下

  smtp_smarthost: 'smtp.qq.com:465'

  smtp_from: '6171391@qq.com'

  smtp_auth_username: '6171391@qq.com'

  smtp_auth_password: 'qq授权码'

  smtp_require_tls: false
默认不做任何过滤选择的接收人

  - to: 'dfwl@163.com'

global:

  # The smarthost and SMTP sender used for mail notifications.

  smtp_smarthost: 'smtp.qq.com:465'

  smtp_from: '6171391@qq.com'

  smtp_auth_username: '6171391@qq.com'

  smtp_auth_password: 'qq授权码'

  smtp_require_tls: false

# The directory from which notification templates are read.

templates:

- '/etc/alertmanager/template/*.tmpl'

# The root route on which each incoming alert enters.

route:

  group_by: ['alertname', 'cluster', 'service']

  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch

  # of new alerts that started firing for that group.

  group_interval: 1m

  # If an alert has successfully been sent, wait 'repeat_interval' to

  # resend them.

  repeat_interval: 7h

  # A default receiver

  receiver: team-X-mails

  # The child route trees.

  routes:

  # This routes performs a regular expression match on alert labels to

  # catch alerts that are related to a list of services.

  - match_re:

      service: ^(foo1|foo2|baz)$

    receiver: team-X-mails

    # The service has a sub-route for critical alerts, any alerts

    # that do not match, i.e. severity != critical, fall-back to the

    # parent node and are sent to 'team-X-mails'

    routes:

    - match:

        severity: critical

      receiver: team-X-pager

  - match:

      service: files

    receiver: team-Y-mails

    routes:

    - match:

        severity: critical

      receiver: team-Y-pager

  # This route handles all alerts coming from a database service. If there's

  # no team to handle it, it defaults to the DB team.

  - match:

      service: database

    receiver: team-DB-pager

    # Also group alerts by affected database.

    group_by: [alertname, cluster, database]

    routes:

    - match:

        owner: team-X

      receiver: team-X-pager

      continue: true

    - match:

        owner: team-Y

      receiver: team-Y-pager

# Inhibition rules allow to mute a set of alerts given that another alert is

# firing.

# We use this to mute any warning-level notifications if the same alert is

# already critical.

inhibit_rules:

- source_match:

    severity: 'critical'

  target_match:

    severity: 'warning'

  # Apply inhibition if the alertname is the same.

  # CAUTION:

  #   If all label names listed in `equal` are missing

  #   from both the source and target alerts,

  #   the inhibition rule will apply!

  equal: ['alertname', 'cluster', 'service']

receivers:

- name: 'team-X-mails'

  email_configs:

  - to: 'dfwl@163.com'

- name: 'team-X-pager'

  email_configs:

  - to: 'team-X+alerts-critical@example.org'

  pagerduty_configs:

  - service_key: <team-X-key>

- name: 'team-Y-mails'

  email_configs:

  - to: 'team-Y+alerts@example.org'

- name: 'team-Y-pager'

  pagerduty_configs:

  - service_key: <team-Y-key>

- name: 'team-DB-pager'

  pagerduty_configs:

  - service_key: <team-DB-key>

15、手动启动测试一下生产环境可以docker或k8s等方式启动

./alertmanager --config.file=alertmanager.yml

16、alertmanager页面能同步到告警

过一会就会发邮件了

group_interval: 1m   意思 从第一次接受告警1m后还在就发

prometheus从零开始的更多相关文章

从零开始搭建Prometheus自动监控报警系统
从零搭建Prometheus监控报警系统什么是Prometheus? Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB).Prometheus使用Go语言开 ...
从零开始学习Prometheus监控报警系统
Prometheus简介 Prometheus是一个开源的监控报警系统,它最初由SoundCloud开发. 2016年,Prometheus被纳入了由谷歌发起的Linux基金会旗下的云原生基金会( C ...
Prometheus监控学习记录
官方文档 Prometheus基础文档从零开始:Prometheus 进阶之路:Prometheus —— 技巧篇进阶之路:Prometheus —— 理解篇 prometheus的数据类型介绍 ...
基于prometheus监控k8s集群
本文建立在你已经会安装prometheus服务的基础之上,如果你还不会安装,请参考:prometheus多维度监控容器如果你还没有安装库k8s集群,情参考: 从零开始搭建基于calico的kuben ...
kubernetes 1.15.1 高可用部署 -- 从零开始
这是一本书!!! 一本写我在容器生态圈的所学!!! 重点先知: 1. centos 7.6安装优化 2. k8s 1.15.1 高可用部署 3. 网络插件calico 4. dashboard 插件 ...
你必须知道的容器监控 (3) Prometheus
本篇已加入<.NET Core on K8S学习实践系列文章索引>,可以点击查看更多容器化技术相关系列文章.上一篇介绍了Google开发的容器监控工具cAdvisor,但是其提供的操作界面 ...
从零搭建Prometheus监控报警系统
什么是Prometheus? Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB).Prometheus使用Go语言开发,是Google BorgMon监控系统 ...
3W字干货深入分析基于Micrometer和Prometheus实现度量和监控的方案
前提最近线上的项目使用了spring-actuator做度量统计收集,使用Prometheus进行数据收集,Grafana进行数据展示,用于监控生成环境机器的性能指标和业务数据指标.一般,我们叫这样 ...
Prometheus入门教程（二）：Prometheus + Grafana实现可视化、告警
文章首发于[陈树义]公众号,点击跳转到原文:https://mp.weixin.qq.com/s/56S290p4j9KROB5uGRcGkQ Prometheus UI 提供了快速验证 PromQL ...

随机推荐

被字节跳动、小米、美团面试官问的AndroidFramework难倒了？这里有23道面试真题，助力成为offer收割机！
目录 1.Android中多进程通信的方式有哪些?a.进程通信你用过哪些?原理是什么?(字节跳动.小米)2.描述下Binder机制原理?(东方头条)3.Binder线程池的工作过程是什么样?(东方头条 ...
【LeetCode】316. 去除重复字母
316. 去除重复字母知识点:栈:单调题目描述给你一个字符串 s ,请你去除字符串中重复的字母,使得每个字母只出现一次.需保证返回结果的字典序最小(要求不能打乱其他字符的相对位置). 示例输 ...
Linux系统CPU信息查询方法
lscpu cat /proc/cpuinfo对绝大多数Linux适用,lscpu更简练 cat /proc/cpuinfo 下面是一个实例: processor : 0 vendor_id : Ge ...
阿里云NAS文件迁移项目实践
阿里云文件存储NAS是阿里云推出的用于传统文件共享的,使用NFS协议挂载的共享文件夹. 产品背景下图是NAS和阿里云另一明星产品OSS以及块存储EBS的区别 NAS核心优势:无需修改程序,挂载之后, ...
Java之JSP
JSP JSP简介 JSP指的是 JavaServerPages ,Java服务器端页面,也和Servlet一样,用来开发动态web JSP页面中可以嵌入java代码为用户提供动态数据 JSP原理 J ...
泛微OA e-cology 数据库接口信息泄露学习
泛微OA e-cology 数据库接口信息泄露漏洞信息攻击者可通过存在漏洞的页面直接获取到数据库配置信息.如果攻击者可直接访问数据库,则可直接获取用户数据,甚至可以直接控制数据库服务器:会将当前连 ...
SQL 练习26
查询 1990 年出生的学生名单 --方式1 SELECT * FROM Student WHERE Sage BETWEEN '1990-01-01' AND '1990-12-31' --方式2 ...
Dapps-是一个跨平台的应用服务商店
简介 Dapps 是一个跨平台的应用商店,包含众多软件,基于docker dapps是什么? 它是一个应用程序商店,包含丰富的软件,因为基于docker,使你本机电脑有云开发的效果. 一键安装程序:多 ...
腾讯云TDSQL MySQL版 - 开发指南分布式事务
由于事务操作的数据通常跨多个物理节点,在分布式数据库中,类似方案即称为分布式事务. TDSQL MySQL版支持普通分布式事务协议和 XA 分布式事务协议.TDSQL MySQL版(内核5.7或以上 ...
C#多线程实践-锁和线程安全
锁实现互斥的访问,用于确保在同一时刻只有一个线程可以进入特殊的代码片段,考虑下面的类: class ThreadUnsafe { static int val1, val2; static void ...

prometheus从零开始

prometheus从零开始的更多相关文章

随机推荐

热门专题