redis监控规则

其他说明参考host主机监控规则：https://www.cnblogs.com/sanduzxcvbnm/p/13589848.html

groups:

- name:  Redis monitoring

  rules:

  - alert: BlackboxProbeFailed

    expr: probe_success == 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Blackbox probe failed (instance {{ $labels.instance }})"

      description: "Probe failed\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: BlackboxSlowProbe

    expr: avg_over_time(probe_duration_seconds[1m]) > 1

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Blackbox slow probe (instance {{ $labels.instance }})"

      description: "Blackbox probe took more than 1s to complete\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: BlackboxProbeHttpFailure

    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Blackbox probe HTTP failure (instance {{ $labels.instance }})"

      description: "HTTP status code is not 200-399\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: BlackboxSslCertificateWillExpireSoon

    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})"

      description: "SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: BlackboxSslCertificateWillExpireSoon

    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})"

      description: "SSL certificate expires in 3 days\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: BlackboxSslCertificateExpired

    expr: probe_ssl_earliest_cert_expiry - time() <= 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Blackbox SSL certificate expired (instance {{ $labels.instance }})"

      description: "SSL certificate has expired already\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: BlackboxProbeSlowHttp

    expr: avg_over_time(probe_http_duration_seconds[1m]) > 1

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Blackbox probe slow HTTP (instance {{ $labels.instance }})"

      description: "HTTP request took more than 1s\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: BlackboxProbeSlowPing

    expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Blackbox probe slow ping (instance {{ $labels.instance }})"

      description: "Blackbox ping took more than 1s\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisDown

    expr: redis_up == 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Redis down (instance {{ $labels.instance }})"

      description: "Redis instance is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisMissingMaster

    expr: count(redis_instance_info{role="master"}) == 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Redis missing master (instance {{ $labels.instance }})"

      description: "Redis cluster has no node marked as master.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisTooManyMasters

    expr: count(redis_instance_info{role="master"}) > 1

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Redis too many masters (instance {{ $labels.instance }})"

      description: "Redis cluster has too many nodes marked as master.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisDisconnectedSlaves

    expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Redis disconnected slaves (instance {{ $labels.instance }})"

      description: "Redis not replicating for all slaves. Consider reviewing the redis replication status.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisReplicationBroken

    expr: delta(redis_connected_slaves[1m]) < 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Redis replication broken (instance {{ $labels.instance }})"

      description: "Redis instance lost a slave\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisClusterFlapping

    expr: changes(redis_connected_slaves[5m]) > 2

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Redis cluster flapping (instance {{ $labels.instance }})"

      description: "Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisMissingBackup

    expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Redis missing backup (instance {{ $labels.instance }})"

      description: "Redis has not been backuped for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisOutOfMemory

    expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Redis out of memory (instance {{ $labels.instance }})"

      description: "Redis is running out of memory (> 90%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisTooManyConnections

    expr: redis_connected_clients > 100

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Redis too many connections (instance {{ $labels.instance }})"

      description: "Redis instance has too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisNotEnoughConnections

    expr: redis_connected_clients < 5

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Redis not enough connections (instance {{ $labels.instance }})"

      description: "Redis instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: RedisRejectedConnections

    expr: increase(redis_rejected_connections_total[1m]) > 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Redis rejected connections (instance {{ $labels.instance }})"

      description: "Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: TraefikBackendDown

    expr: count(traefik_backend_server_up) by (backend) == 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Traefik backend down (instance {{ $labels.instance }})"

      description: "All Traefik backends are down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: TraefikHighHttp4xxErrorRateBackend

    expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }})"

      description: "Traefik backend 4xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: TraefikHighHttp5xxErrorRateBackend

    expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }})"

      description: "Traefik backend 5xx error rate is above 5%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdInsufficientMembers

    expr: count(etcd_server_id) % 2 == 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Etcd insufficient Members (instance {{ $labels.instance }})"

      description: "Etcd cluster should have an odd number of members\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdNoLeader

    expr: etcd_server_has_leader == 0

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Etcd no Leader (instance {{ $labels.instance }})"

      description: "Etcd cluster have no leader\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHighNumberOfLeaderChanges

    expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd high number of leader changes (instance {{ $labels.instance }})"

      description: "Etcd leader changed more than 3 times during last hour\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHighNumberOfFailedGrpcRequests

    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd high number of failed GRPC requests (instance {{ $labels.instance }})"

      description: "More than 1% GRPC request failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHighNumberOfFailedGrpcRequests

    expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.05

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Etcd high number of failed GRPC requests (instance {{ $labels.instance }})"

      description: "More than 5% GRPC request failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdGrpcRequestsSlow

    expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd GRPC requests slow (instance {{ $labels.instance }})"

      description: "GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHighNumberOfFailedHttpRequests

    expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.01

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd high number of failed HTTP requests (instance {{ $labels.instance }})"

      description: "More than 1% HTTP failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHighNumberOfFailedHttpRequests

    expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.05

    for: 5m

    labels:

      severity: critical

    annotations:

      summary: "Etcd high number of failed HTTP requests (instance {{ $labels.instance }})"

      description: "More than 5% HTTP failure detected in Etcd for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHttpRequestsSlow

    expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd HTTP requests slow (instance {{ $labels.instance }})"

      description: "HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdMemberCommunicationSlow

    expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.15

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd member communication slow (instance {{ $labels.instance }})"

      description: "Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHighNumberOfFailedProposals

    expr: increase(etcd_server_proposals_failed_total[1h]) > 5

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd high number of failed proposals (instance {{ $labels.instance }})"

      description: "Etcd server got more than 5 failed proposals past hour\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHighFsyncDurations

    expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd high fsync durations (instance {{ $labels.instance }})"

      description: "Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: EtcdHighCommitDurations

    expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "Etcd high commit durations (instance {{ $labels.instance }})"

      description: "Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

  - alert: OpenebsUsedPoolCapacity

    expr: (openebs_used_pool_capacity_percent) > 80

    for: 5m

    labels:

      severity: warning

    annotations:

      summary: "OpenEBS used pool capacity (instance {{ $labels.instance }})"

      description: "OpenEBS Pool use more than 80% of his capacity\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

redis监控规则的更多相关文章

DB监控-redis监控
公司的redis业务很多,redis监控自然也是DB监控的一大模块,包括采集.展示.监控告警.本文主要介绍redis监控的主要指标和采集方法. 一.Redis监控系统逻辑 1.DBA通过前台页面添加r ...
［转］细说Redis监控和告警
原文 https://zhuoroger.github.io/2016/08/20/redis-monitor-and-alarm/? 对于任何应用服务和组件,都需要一套完善可靠谱监控方案. 尤其r ...
Redis 监控
redis 监控有 redis-live 和 redis-stat Redis-Live是一个用来监控redis实例,分析查询语句并且有web界面的监控工具,使用python编写. redis-sta ...
[Monitor] 监控规则定义
系统监控规则:
Redis监控技巧(转)
来自:http://blog.nosqlfan.com/html/4166.html Redis 监控最直接的方法当然就是使用系统提供的 info 命令来做了,你只需要执行下面一条命令,就能获得 Re ...
redis监控状态
Redis介绍 Redis是一种高级key-value数据库.它跟memcached类似,不过数据可以持久化,而且支持的数据类型很丰富.有字符串,链表.哈希.集合和有序集合5种.支持在服务器端计算集合 ...
Redis监控方案
Redis 监控最直接的方法当然就是使用系统提供的 info 命令来做了,你只需要执行下面一条命令,就能获得 Redis 系统的状态报告. redis-cli info 内存使用如果 Redis 使 ...
Redis监控
首先判断客户端和服务器连接是否正常 # 客户端和服务器连接正常,返回PONG redis> PING PONG # 客户端和服务器连接不正常(网络不正常或服务器未能正常运行),返回连接异常 re ...
做个简单的Redis监控(源码分享)
Redis监控 Redis 是目前应用广泛的NoSQL,我做的项目中大部分都是与Redis打交道,发现身边的朋友也更多人在用,相对于memcached 来说,它的优势也确实是可圈可点.在随着业务,数据 ...

随机推荐

如何用车辆违章查询API接口进行快速开发
最近公司项目有一个车辆违章查询显示的小功能,想着如果用现成的API就可以大大提高开发效率,所以在网上的API商店搜索了一番,发现了 APISpace,它里面的车辆违章查询API非常符合我的开发需求. ...
ELK 日志分析系统的部署
一.ELK简介 ElasticSearch介绍Elasticsearch是一个基于Lucene的搜索服务器. 它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口. Elasti ...
Linux下IPC之共享内存的使用方法
基本参考 <Unix环境高级编程>第14.9节共享内存来学习. 参考blog:https://blog.csdn.net/weixin_45794138/article/details/1 ...
kube-scheduler的调度上下文
前一章节了解到了kube-scheduler中的概念,该章节则对调度上下文的源码进行分析 Scheduler Scheduler 是整个 kube-scheduler 的一个 structure,提供 ...
Deployment之滚动更新策略。
1.Deployment控制器详细信息中包含了其更新策略的相关配置.kubectl describe命令中输出的StrategyType.RollingUpdateStrategy字段等: root@ ...
王霸雄图荣华敝屣，谈笑间尽归尘土|基于Python3双队列数据结构搭建股票/外汇交易匹配撮合系统
原文转载自「刘悦的技术博客」https://v3u.cn/a_id_192 如果你爱他,那么送他去股市,因为那里是天堂:如果你恨他,送他去股市,因为那里是地狱. 在过去的一年里,新冠疫情持续冲击世界经 ...
MySQL主从复制原理及搭建过程
GreatSQL社区原创内容未经授权不得随意使用,转载请联系小编并注明来源. 复制概述复制即把一台服务器上的数据通过某种手段同步到另外一台或多台从服务器上,使得从服务器在数据上与主服务器保持一致. ...
Javaweb06-JDBC
1.jdbc.properties配置文件 jdbc.properties driverClass=com.mysql.jdbc.Driver jdbcUrl=jdbc:mysql://localho ...
从零开始Blazor Server(10)--编辑角色
例图目前的样式是这样的: 其中角色在一个table里,然后可以增删改查,并且可以给指定的用户分配权限. 创建文件首先我们在Pages/Admin目录下新建一个Role.razor.因为我们的Adm ...
我与Apache DolphinScheduler社区的故事
我与DolphinScheduler社区的故事 Apache DolphinScheduler 是一个开源的分布式去中心化.易扩展的可视化DAG大数据调度系统. 于2017年在易观数科立项,2019年 ...

redis监控规则

redis监控规则的更多相关文章

随机推荐

热门专题