host主机监控规则

1.先在 Prometheus 主程序目录下创建rules目录，然后在该目录下创建 host.yml文件，内容如下:

内容很多，可以根据实际情况进行调整。

规则参考网址：https://awesome-prometheus-alerts.grep.to/rules

参考网址的规则中，有些地方需要修改，比如：

  - alert: HostNetworkReceiveErrors

    expr: increase(node_network_receive_errs_total[5m]) > 0

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host Network Receive Errors (instance {{ $labels.instance }})"

      description: "{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"  ```

具体使用的时候需要修改description，最外层的双引号修改成单引号，因为里面的`"%.0f"`已经使用双引号了。修改成：

'{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}'

注意：目录和文件的权限：chown -R prometheus:prometheus rules

```yaml

groups:

- name: Host and hardware

  rules:

  - alert: HostOutOfMemory

    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host out of memory (instance {{ $labels.instance }})"

      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostMemoryUnderMemoryPressure

    expr: rate(node_vmstat_pgmajfault[1m]) > 1000

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host memory under memory pressure (instance {{ $labels.instance }})"

      description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostUnusualNetworkThroughputIn

    expr: sum by (instance) (irate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host unusual network throughput in (instance {{ $labels.instance }})"

      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostUnusualNetworkThroughputOut

    expr: sum by (instance) (irate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host unusual network throughput out (instance {{ $labels.instance }})"

      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostUnusualDiskReadRate

    expr: sum by (instance) (irate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host unusual disk read rate (instance {{ $labels.instance }})"

      description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostUnusualDiskWriteRate

    expr: sum by (instance) (irate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host unusual disk write rate (instance {{ $labels.instance }})"

      description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostOutOfDiskSpace

    expr: (node_filesystem_avail_bytes{mountpoint="/rootfs"}  * 100) / node_filesystem_size_bytes{mountpoint="/rootfs"} < 10

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host out of disk space (instance {{ $labels.instance }})"

      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostDiskWillFillIn4Hours

    expr: predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host disk will fill in 4 hours (instance {{ $labels.instance }})"

      description: "Disk will fill in 4 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostOutOfInodes

    expr: node_filesystem_files_free{mountpoint ="/rootfs"} / node_filesystem_files{mountpoint ="/rootfs"} * 100 < 10

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host out of inodes (instance {{ $labels.instance }})"

      description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostUnusualDiskReadLatency

    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host unusual disk read latency (instance {{ $labels.instance }})"

      description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostUnusualDiskWriteLatency

    expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host unusual disk write latency (instance {{ $labels.instance }})"

      description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostHighCpuLoad

    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host high CPU load (instance {{ $labels.instance }})"

      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostContextSwitching

    expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1000

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host context switching (instance {{ $labels.instance }})"

      description: "Context switching is growing on node (> 1000 / s)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostSwapIsFillingUp

    expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host swap is filling up (instance {{ $labels.instance }})"

      description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostSystemdServiceCrashed

    expr: node_systemd_unit_state{state="failed"} == 1

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host SystemD service crashed (instance {{ $labels.instance }})"

      description: "SystemD service crashed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostPhysicalComponentTooHot

    expr: node_hwmon_temp_celsius > 75

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host physical component too hot (instance {{ $labels.instance }})"

      description: "Physical hardware component too hot\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostNodeOvertemperatureAlarm

    expr: node_hwmon_temp_alarm == 1

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Host node overtemperature alarm (instance {{ $labels.instance }})"

      description: "Physical node temperature alarm triggered\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostRaidArrayGotInactive

    expr: node_md_state{state="inactive"} > 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Host RAID array got inactive (instance {{ $labels.instance }})"

      description: "RAID array {{ $labels.device }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostRaidDiskFailure

    expr: node_md_disks{state="fail"} > 0

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host RAID disk failure (instance {{ $labels.instance }})"

      description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostKernelVersionDeviations

    expr: count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host kernel version deviations (instance {{ $labels.instance }})"

      description: "Different kernel versions are running\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostOomKillDetected

    expr: increase(node_vmstat_oom_kill[5m]) > 0

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host OOM kill detected (instance {{ $labels.instance }})"

      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: HostEdacCorrectableErrorsDetected

    expr: increase(node_edac_correctable_errors_total[5m]) > 0

    for: 5m

    labels:

      severity: info

    annotations:

      summary: "Host EDAC Correctable Errors detected (instance {{ $labels.instance }})"

      description: '{{ $labels.instance }} has had {{ printf "%.0f" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

  - alert: HostEdacUncorrectableErrorsDetected

    expr: node_edac_uncorrectable_errors_total > 0

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})"

      description: '{{ $labels.instance }} has had {{ printf "%.0f" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

  - alert: HostNetworkReceiveErrors

    expr: increase(node_network_receive_errs_total[5m]) > 0

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host Network Receive Errors (instance {{ $labels.instance }})"

      description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

  - alert: HostNetworkTransmitErrors

    expr: increase(node_network_transmit_errs_total[5m]) > 0

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Host Network Transmit Errors (instance {{ $labels.instance }})"

      description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last five minutes.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}'

  - alert: JvmMemoryFillingUp

    expr: jvm_memory_bytes_used / jvm_memory_bytes_max{area="heap"} > 0.8

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "JVM memory filling up (instance {{ $labels.instance }})"

      description: "JVM memory is filling up (> 80%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: SpeedtestSlowInternetDownload

    expr: avg_over_time(speedtest_download[30m]) < 75

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "SpeedTest Slow Internet Download (instance {{ $labels.instance }})"

      description: "Internet download speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: SpeedtestSlowInternetUpload

    expr: avg_over_time(speedtest_upload[30m]) < 20

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "SpeedTest Slow Internet Upload (instance {{ $labels.instance }})"

      description: "Internet upload speed is currently {{humanize $value}} Mbps.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

2.在 Prometheus 主程序目录下的prometheus.yml进行修改，引用上述告警规则

rule_files:

  - "rules/*.yml"

3.重启Prometheus

host主机监控规则的更多相关文章

redis监控规则
其他说明参考host主机监控规则:https://www.cnblogs.com/sanduzxcvbnm/p/13589848.html groups: - name: Redis monitori ...
cAdvisor容器监控规则
其他说明参考host主机监控规则:https://www.cnblogs.com/sanduzxcvbnm/p/13589848.html 在prometheus主程序目录下的rules目录下新建do ...
Weave Scope 多主机监控 - 每天5分钟玩转 Docker 容器技术（81）
除了监控容器,Weave Scope 还可以监控 Docker Host. 点击顶部 HOSTS 菜单项,地图将显示当前 host. 与容器类似,点击该 host 图标将显示详细信息. host 当前 ...
081、Weave Scope 多主机监控（2019-04-29 周一）
参考https://www.cnblogs.com/CloudMan6/p/7674011.html Weave Scope 除了监控容器,还可以监控Docker Host. 点击顶部 HOS ...
第 10 章容器监控 - 081 - Weave Scope 多主机监控
除了监控容器,Weave Scope 还可以监控 Docker Host 点击顶部 HOSTS 菜单项,地图将显示当前 host. 与容器类似,点击该 host 图标将显示详细信息 host 当前的资 ...
Zabbix学习之路（二）之添加主机监控及自定义item监控
1.zabbix_get命令详解安装zabbix-get命令 [root@linux-node1 ~]# yum install -y zabbix_get 参数说明: -s --host: 指定客 ...
/cloudmonitor.log 主机监控
989866842 INFO 2018-09-30 01:38:58.58 [ricGatherServiceHttp] 提交指标完成,耗时:18ms. SystemInfo [serialNumbe ...
zabbix如何添加主机监控
1,首先,监控的主机安装zabbix客户端.zabbix提供多种监控方式,我们这里监控的主机上边安装agentd守护端进行数据收集并监测. 其中客户端安装我们这里就不介绍了,请参考之前教程里边的客户端 ...
源码解析.Net中Host主机的构建过程
前言本篇文章着重讲一下在.Net中Host主机的构建过程,依旧延续之前文章的思路,着重讲解其源码,如果有不知道有哪些用法的同学可以点击这里,废话不多说,咱们直接进入正题 Host构建过程下图是我自 ...

随机推荐

记一次 .NET 某电厂Web系统内存泄漏分析
一:背景 1. 讲故事前段时间有位朋友找到我,说他的程序内存占用比较大,寻求如何解决,截图就不发了,分析下来我感觉除了程序本身的问题之外,.NET5 在内存管理方面做的也不够好,所以有必要给大家分享 ...
windows10：vscode下go语言的适配
ps:本篇依赖golang的sdk已经安装完成: url:https://www.cnblogs.com/mrwhite2020/p/16475731.html 一.下载vscode,选择wind ...
第九天python3 闭包、nonlocal、默认值的作用域
闭包自由变量:未在本地作用域中定义的变量,例如定义在内存函数外的外层函数的作用域中的变量: 闭包:出现在嵌套函数中,指的是内层函数引用到了外层函数的自由变量,就形成了闭包: 示例1: # -*- c ...
js for和while两种99乘法表
<script type="text/javascript"> for(var i=1; i<=9; i++) { for(var j=1; j<=i;j+ ...
Python算法之动态规划(Dynamic Programming)解析:二维矩阵中的醉汉(魔改版leetcode出界的路径数)
原文转载自「刘悦的技术博客」https://v3u.cn/a_id_168 现在很多互联网企业学聪明了,知道应聘者有目的性的刷Leetcode原题,用来应付算法题面试,所以开始对这些题进行" ...
SpringBoot的创建和特性
一.SpringBoot的特点创建独立的Spring应用程序直接嵌入Tomcat.Jetty或Undertow(无需部署WAR文件) 提供自以为是的"starter"依赖项,以 ...
总结-一本通提高篇&算竞进阶记录
当一个人看见星空,就再无法忍受黑暗为了点亮渐渐沉寂的星空不想就这样退役一定不会鸽の坑 . 一本通提高篇 . 算竞进阶 . CDQ & 整体二分 . 平衡树 . LCT . 字符串 . 随 ...
java学习第一天.day01
Java的编译和运行机制 java文件编译成字节码文件后加载到java缓存中jvm Java的基本语法 1.Java语言严格区分大小写 2.一个Java源文件里可以定义多个Java类,但不能存在多个p ...
jsp获取下拉框组件的值
jsp获取下拉框组件的值 1.首先,写一个带有下拉框的前台页 1 <%@ page language="java" contentType="text/html; ...
套接字传输（TCP简单使用）

host主机监控规则

host主机监控规则的更多相关文章

随机推荐

热门专题