redis监控规则
其他说明参考host主机监控规则:https://www.cnblogs.com/sanduzxcvbnm/p/13589848.html
groups:
- name: Redis monitoring
rules:
- alert: BlackboxProbeFailed
expr: probe_success == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Blackbox probe failed (instance {{ $labels.instance }})"
description: "Probe failed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: BlackboxSlowProbe
expr: avg_over_time(probe_duration_seconds[1m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Blackbox slow probe (instance {{ $labels.instance }})"
description: "Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: BlackboxProbeHttpFailure
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 5m
labels:
severity: critical
annotations:
summary: "Blackbox probe HTTP failure (instance {{ $labels.instance }})"
description: "HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 5m
labels:
severity: warning
annotations:
summary: "Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})"
description: "SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
for: 5m
labels:
severity: critical
annotations:
summary: "Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})"
description: "SSL certificate expires in 3 days\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: BlackboxSslCertificateExpired
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 5m
labels:
severity: critical
annotations:
summary: "Blackbox SSL certificate expired (instance {{ $labels.instance }})"
description: "SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: BlackboxProbeSlowHttp
expr: avg_over_time(probe_http_duration_seconds[1m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Blackbox probe slow HTTP (instance {{ $labels.instance }})"
description: "HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: BlackboxProbeSlowPing
expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Blackbox probe slow ping (instance {{ $labels.instance }})"
description: "Blackbox ping took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisDown
expr: redis_up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Redis down (instance {{ $labels.instance }})"
description: "Redis instance is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisMissingMaster
expr: count(redis_instance_info{role="master"}) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Redis missing master (instance {{ $labels.instance }})"
description: "Redis cluster has no node marked as master.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisTooManyMasters
expr: count(redis_instance_info{role="master"}) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Redis too many masters (instance {{ $labels.instance }})"
description: "Redis cluster has too many nodes marked as master.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisDisconnectedSlaves
expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Redis disconnected slaves (instance {{ $labels.instance }})"
description: "Redis not replicating for all slaves. Consider reviewing the redis replication status.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisReplicationBroken
expr: delta(redis_connected_slaves[1m]) < 0
for: 5m
labels:
severity: critical
annotations:
summary: "Redis replication broken (instance {{ $labels.instance }})"
description: "Redis instance lost a slave\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisClusterFlapping
expr: changes(redis_connected_slaves[5m]) > 2
for: 5m
labels:
severity: critical
annotations:
summary: "Redis cluster flapping (instance {{ $labels.instance }})"
description: "Changes have been detected in Redis replica connection. This can occur when replica nodes lose connection to the master and reconnect (a.k.a flapping).\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisMissingBackup
expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
for: 5m
labels:
severity: critical
annotations:
summary: "Redis missing backup (instance {{ $labels.instance }})"
description: "Redis has not been backuped for 24 hours\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisOutOfMemory
expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Redis out of memory (instance {{ $labels.instance }})"
description: "Redis is running out of memory (> 90%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisTooManyConnections
expr: redis_connected_clients > 100
for: 5m
labels:
severity: warning
annotations:
summary: "Redis too many connections (instance {{ $labels.instance }})"
description: "Redis instance has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisNotEnoughConnections
expr: redis_connected_clients < 5
for: 5m
labels:
severity: warning
annotations:
summary: "Redis not enough connections (instance {{ $labels.instance }})"
description: "Redis instance should have more connections (> 5)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: RedisRejectedConnections
expr: increase(redis_rejected_connections_total[1m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Redis rejected connections (instance {{ $labels.instance }})"
description: "Some connections to Redis has been rejected\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: TraefikBackendDown
expr: count(traefik_backend_server_up) by (backend) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Traefik backend down (instance {{ $labels.instance }})"
description: "All Traefik backends are down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: TraefikHighHttp4xxErrorRateBackend
expr: sum(rate(traefik_backend_requests_total{code=~"4.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Traefik high HTTP 4xx error rate backend (instance {{ $labels.instance }})"
description: "Traefik backend 4xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: TraefikHighHttp5xxErrorRateBackend
expr: sum(rate(traefik_backend_requests_total{code=~"5.*"}[3m])) by (backend) / sum(rate(traefik_backend_requests_total[3m])) by (backend) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Traefik high HTTP 5xx error rate backend (instance {{ $labels.instance }})"
description: "Traefik backend 5xx error rate is above 5%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdInsufficientMembers
expr: count(etcd_server_id) % 2 == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Etcd insufficient Members (instance {{ $labels.instance }})"
description: "Etcd cluster should have an odd number of members\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Etcd no Leader (instance {{ $labels.instance }})"
description: "Etcd cluster have no leader\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHighNumberOfLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd high number of leader changes (instance {{ $labels.instance }})"
description: "Etcd leader changed more than 3 times during last hour\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHighNumberOfFailedGrpcRequests
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd high number of failed GRPC requests (instance {{ $labels.instance }})"
description: "More than 1% GRPC request failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHighNumberOfFailedGrpcRequests
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Etcd high number of failed GRPC requests (instance {{ $labels.instance }})"
description: "More than 5% GRPC request failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdGrpcRequestsSlow
expr: histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd GRPC requests slow (instance {{ $labels.instance }})"
description: "GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHighNumberOfFailedHttpRequests
expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd high number of failed HTTP requests (instance {{ $labels.instance }})"
description: "More than 1% HTTP failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHighNumberOfFailedHttpRequests
expr: sum(rate(etcd_http_failed_total[5m])) BY (method) / sum(rate(etcd_http_received_total[5m])) BY (method) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Etcd high number of failed HTTP requests (instance {{ $labels.instance }})"
description: "More than 5% HTTP failure detected in Etcd for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHttpRequestsSlow
expr: histogram_quantile(0.99, rate(etcd_http_successful_duration_seconds_bucket[5m])) > 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd HTTP requests slow (instance {{ $labels.instance }})"
description: "HTTP requests slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m])) > 0.15
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd member communication slow (instance {{ $labels.instance }})"
description: "Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHighNumberOfFailedProposals
expr: increase(etcd_server_proposals_failed_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd high number of failed proposals (instance {{ $labels.instance }})"
description: "Etcd server got more than 5 failed proposals past hour\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd high fsync durations (instance {{ $labels.instance }})"
description: "Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: EtcdHighCommitDurations
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "Etcd high commit durations (instance {{ $labels.instance }})"
description: "Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: OpenebsUsedPoolCapacity
expr: (openebs_used_pool_capacity_percent) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "OpenEBS used pool capacity (instance {{ $labels.instance }})"
description: "OpenEBS Pool use more than 80% of his capacity\n VALUE = {{ $value }}\n LABELS: {{ $labels }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
redis监控规则的更多相关文章
- DB监控-redis监控
公司的redis业务很多,redis监控自然也是DB监控的一大模块,包括采集.展示.监控告警.本文主要介绍redis监控的主要指标和采集方法. 一.Redis监控系统逻辑 1.DBA通过前台页面添加r ...
- [转]细说Redis监控和告警
原文 https://zhuoroger.github.io/2016/08/20/redis-monitor-and-alarm/? 对于任何应用服务和组件,都需要一套完善可靠谱监控方案. 尤其r ...
- Redis 监控
redis 监控有 redis-live 和 redis-stat Redis-Live是一个用来监控redis实例,分析查询语句并且有web界面的监控工具,使用python编写. redis-sta ...
- [Monitor] 监控规则定义
系统监控规则:
- Redis监控技巧(转)
来自:http://blog.nosqlfan.com/html/4166.html Redis 监控最直接的方法当然就是使用系统提供的 info 命令来做了,你只需要执行下面一条命令,就能获得 Re ...
- redis监控状态
Redis介绍 Redis是一种高级key-value数据库.它跟memcached类似,不过数据可以持久化,而且支持的数据类型很丰富.有字符串,链表.哈希.集合和有序集合5种.支持在服务器端计算集合 ...
- Redis监控方案
Redis 监控最直接的方法当然就是使用系统提供的 info 命令来做了,你只需要执行下面一条命令,就能获得 Redis 系统的状态报告. redis-cli info 内存使用 如果 Redis 使 ...
- Redis监控
首先判断客户端和服务器连接是否正常 # 客户端和服务器连接正常,返回PONG redis> PING PONG # 客户端和服务器连接不正常(网络不正常或服务器未能正常运行),返回连接异常 re ...
- 做个简单的Redis监控(源码分享)
Redis监控 Redis 是目前应用广泛的NoSQL,我做的项目中大部分都是与Redis打交道,发现身边的朋友也更多人在用,相对于memcached 来说,它的优势也确实是可圈可点.在随着业务,数据 ...
随机推荐
- webapi <Message>已拒绝为此请求授权。</Message>
webapi <Message>已拒绝为此请求授权.</Message> 原有的调用base.OnAuthorization(actionContext); 换成下面这个 // ...
- Pytorch Tensor 维度的扩充和压缩
维度扩展 x.unsqueeze(n) 在 n 号位置添加一个维度 例子: import torch x = torch.rand(3,2) x1 = x.unsqueeze(0) # 在第一维的位置 ...
- ServerlessBench 2.0:华为云联合上海交大发布Serverless基准测试平台
摘要:华为云联合上海交大重磅推出ServerlessBench 2.0,为社区提供涵盖12类基准测试用例.新增5大类跨平台测试用例.4大类关键特性指标.且多平台兼容的Serverless开放基准测试集 ...
- 如何有效地开发 Jmix 扩展组件
扩展组件的概念在使用 Jmix 框架开发中扮演着非常重要的角色.我们将在本文探索什么是扩展组件以及 Jmix Studio 在扩展组件开发和应用程序模块化方面能给开发者带来什么帮助. Jmix 中的扩 ...
- WPF 截图控件之移除控件(九)「仿微信」
WPF 截图控件之移除控件(九)「仿微信」 WPF 截图控件之移除控件(九)「仿微信」 作者:WPFDevelopersOrg 原文链接: https://github.com/WPFDevelope ...
- 简单学习一下ibd数据文件解析
来源:原创投稿 作者:花家舍 简介:数据库技术爱好者. GreatSQL社区原创内容未经授权不得随意使用,转载请联系小编并注明来源. 简单学习一下数据文件解析 这是尝试使用Golang语言简单解析My ...
- java学习第一天.day06
方法 方法的优点 1. 使程序变得更简短而清晰. 2. 有利于程序维护. 3. 可以提高程序开发的效率. 4. 提高了代码的重用性. static的作用 static在方法中如果没有添加就只能用对象调 ...
- 在微信小程序中,如何获取 for 循环的 index
微信小程序中,for 循环的 index(索引值)可以用wx:for-index="index"来获取. <view class="item" wx:fo ...
- 我就获取个时间,机器就down了
本文主要讲解linux 时间管理系统中的一个问题 背景:linux 时间管理,包含clocksource,clockevent,timer,tick,timekeeper等等概念 , 这些概念有机地组 ...
- Postman中的断言
Postman设置断言 一.断言的定义 1.什么是断言? 一般一个完整的接口测试,包括:请求->获取响应正文->断言,请求和获取响应正文很常见.断言一般是对请求的响应结果做操作,判断预期结 ...