1.先在 Prometheus 主程序目录下创建rules目录,然后在该目录下创建 prometheus-test.yml文件,内容如下:

内容很多,可以根据实际情况进行调整。

规则参考网址:https://awesome-prometheus-alerts.grep.to/rules

注意:注意目录和文件的权限:chown -R prometheus:prometheus rules

groups:
- name: Prometheus self-monitoring
rules:
- alert: PrometheusJobMissing
expr: absent(up{job="my-job"})
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus job missing (instance {{ $labels.instance }})"
description: "A Prometheus job has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTargetMissing
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus target missing (instance {{ $labels.instance }})"
description: "A Prometheus target has disappeared. An exporter might be crashed.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusAllTargetsMissing
expr: count by (job) (up) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus all targets missing (instance {{ $labels.instance }})"
description: "A Prometheus job does not have living target anymore.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusConfigurationReloadFailure
expr: prometheus_config_last_reload_successful != 1
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus configuration reload failure (instance {{ $labels.instance }})"
description: "Prometheus configuration reload error\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTooManyRestarts
expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus too many restarts (instance {{ $labels.instance }})"
description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusAlertmanagerConfigurationReloadFailure
expr: alertmanager_config_last_reload_successful != 1
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})"
description: "AlertManager configuration reload error\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusAlertmanagerConfigNotSynced
expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus AlertManager config not synced (instance {{ $labels.instance }})"
description: "Configurations of AlertManager cluster instances are out of sync\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusAlertmanagerE2eDeadManSwitch
expr: vector(1)
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})"
description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusNotConnectedToAlertmanager
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus not connected to alertmanager (instance {{ $labels.instance }})"
description: "Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusRuleEvaluationFailures
expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus rule evaluation failures (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTemplateTextExpansionFailures
expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus template text expansion failures (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} template text expansion failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusRuleEvaluationSlow
expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus rule evaluation slow (instance {{ $labels.instance }})"
description: "Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusNotificationsBacklog
expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus notifications backlog (instance {{ $labels.instance }})"
description: "The Prometheus notification queue has not been empty for 10 minutes\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusAlertmanagerNotificationFailing
expr: rate(alertmanager_notifications_failed_total[1m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus AlertManager notification failing (instance {{ $labels.instance }})"
description: "Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTargetEmpty
expr: prometheus_sd_discovered_targets == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus target empty (instance {{ $labels.instance }})"
description: "Prometheus has no target in service discovery\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTargetScrapingSlow
expr: prometheus_target_interval_length_seconds{quantile="0.9"} > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus target scraping slow (instance {{ $labels.instance }})"
description: "Prometheus is scraping exporters slowly\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusLargeScrape
expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus large scrape (instance {{ $labels.instance }})"
description: "Prometheus has many scrapes that exceed the sample limit\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTargetScrapeDuplicate
expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus target scrape duplicate (instance {{ $labels.instance }})"
description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTsdbCheckpointCreationFailures
expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} checkpoint creation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTsdbCheckpointDeletionFailures
expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTsdbCompactionsFailed
expr: increase(prometheus_tsdb_compactions_failed_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB compactions failed (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} TSDB compactions failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTsdbHeadTruncationsFailed
expr: increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB head truncations failed (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTsdbReloadFailures
expr: increase(prometheus_tsdb_reloads_failures_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB reload failures (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} TSDB reload failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTsdbWalCorruptions
expr: increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: PrometheusTsdbWalTruncationsFailed
expr: increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})"
description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: MysqlDown
expr: mysql_up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "MySQL down (instance {{ $labels.instance }})"
description: "MySQL instance is down on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: MysqlTooManyConnections
expr: avg by (instance) (max_over_time(mysql_global_status_threads_connected[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL too many connections (instance {{ $labels.instance }})"
description: "More than 80% of MySQL connections are in use on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: MysqlHighThreadsRunning
expr: avg by (instance) (max_over_time(mysql_global_status_threads_running[5m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 60
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL high threads running (instance {{ $labels.instance }})"
description: "More than 60% of MySQL connections are in running state on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: MysqlSlaveIoThreadNotRunning
expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0
for: 5m
labels:
severity: critical
annotations:
summary: "MySQL Slave IO thread not running (instance {{ $labels.instance }})"
description: "MySQL Slave IO thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: MysqlSlaveSqlThreadNotRunning
expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0
for: 5m
labels:
severity: critical
annotations:
summary: "MySQL Slave SQL thread not running (instance {{ $labels.instance }})"
description: "MySQL Slave SQL thread not running on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: MysqlSlaveReplicationLag
expr: mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 300
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL Slave replication lag (instance {{ $labels.instance }})"
description: "MysqL replication lag on {{ $labels.instance }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: MysqlSlowQueries
expr: mysql_global_status_slow_queries > 0
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL slow queries (instance {{ $labels.instance }})"
description: "MySQL server is having some slow queries.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: MysqlRestarted
expr: mysql_global_status_uptime < 60
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL restarted (instance {{ $labels.instance }})"
description: "MySQL has just been restarted, less than one minute ago on {{ $labels.instance }}.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KafkaTopicsReplicas
expr: sum(kafka_topic_partition_in_sync_replica) by (topic) < 3
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka topics replicas (instance {{ $labels.instance }})"
description: "Kafka topic in-sync partition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KafkaConsumersGroup
expr: sum(kafka_consumergroup_lag) by (consumergroup) > 50
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka consumers group (instance {{ $labels.instance }})"
description: "Kafka consumers group\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: NginxHighHttp4xxErrorRate
expr: sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Nginx high HTTP 4xx error rate (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 4xx (> 5%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: NginxHighHttp5xxErrorRate
expr: sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Nginx high HTTP 5xx error rate (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 5xx (> 5%)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: NginxLatencyHigh
expr: histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Nginx latency high (instance {{ $labels.instance }})"
description: "Nginx p99 latency is higher than 10 seconds\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesNodeReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes Node ready (instance {{ $labels.instance }})"
description: "Node {{ $labels.node }} has been unready for a long time\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes memory pressure (instance {{ $labels.instance }})"
description: "{{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes disk pressure (instance {{ $labels.instance }})"
description: "{{ $labels.node }} has DiskPressure condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesOutOfDisk
expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes out of disk (instance {{ $labels.instance }})"
description: "{{ $labels.node }} has OutOfDisk condition\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesJobFailed
expr: kube_job_status_failed > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes Job failed (instance {{ $labels.instance }})"
description: "Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesCronjobSuspended
expr: kube_cronjob_spec_suspend != 0
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes CronJob suspended (instance {{ $labels.instance }})"
description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesPersistentvolumeclaimPending
expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }})"
description: "PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesVolumeOutOfDiskSpace
expr: kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes Volume out of disk space (instance {{ $labels.instance }})"
description: "Volume is almost full (< 10% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesVolumeFullInFourDays
expr: predict_linear(kubelet_volume_stats_available_bytes[6h], 4 * 24 * 3600) < 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes Volume full in four days (instance {{ $labels.instance }})"
description: "{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value | humanize }}% is available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesPersistentvolumeError
expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes PersistentVolume error (instance {{ $labels.instance }})"
description: "Persistent volume is in bad state\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesStatefulsetDown
expr: (kube_statefulset_status_replicas_ready / kube_statefulset_status_replicas_current) != 1
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes StatefulSet down (instance {{ $labels.instance }})"
description: "A StatefulSet went down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesHpaScalingAbility
expr: kube_hpa_status_condition{condition="false", status="AbleToScale"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes HPA scaling ability (instance {{ $labels.instance }})"
description: "Pod is unable to scale\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesHpaMetricAvailability
expr: kube_hpa_status_condition{condition="false", status="ScalingActive"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes HPA metric availability (instance {{ $labels.instance }})"
description: "HPA is not able to colelct metrics\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesHpaScaleCapability
expr: kube_hpa_status_desired_replicas >= kube_hpa_spec_max_replicas
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes HPA scale capability (instance {{ $labels.instance }})"
description: "The maximum number of desired Pods has been hit\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesPodNotHealthy
expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes Pod not healthy (instance {{ $labels.instance }})"
description: "Pod has been in a non-ready state for longer than an hour.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes pod crash looping (instance {{ $labels.instance }})"
description: "Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesReplicassetMismatch
expr: kube_replicaset_spec_replicas != kube_replicaset_status_ready_replicas
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})"
description: "Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesDeploymentReplicasMismatch
expr: kube_deployment_spec_replicas != kube_deployment_status_replicas_available
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})"
description: "Deployment Replicas mismatch\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesStatefulsetReplicasMismatch
expr: kube_statefulset_status_replicas_ready != kube_statefulset_status_replicas
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})"
description: "A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesDeploymentGenerationMismatch
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes Deployment generation mismatch (instance {{ $labels.instance }})"
description: "A Deployment has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesStatefulsetGenerationMismatch
expr: kube_statefulset_status_observed_generation != kube_statefulset_metadata_generation
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }})"
description: "A StatefulSet has failed but has not been rolled back.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesStatefulsetUpdateNotRolledOut
expr: max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }})"
description: "StatefulSet update has not been rolled out.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesDaemonsetRolloutStuck
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})"
description: "Some Pods of DaemonSet are not scheduled or not ready\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesDaemonsetMisscheduled
expr: kube_daemonset_status_number_misscheduled > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }})"
description: "Some DaemonSet Pods are running where they are not supposed to run\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesCronjobTooLong
expr: time() - kube_cronjob_next_schedule_time > 3600
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes CronJob too long (instance {{ $labels.instance }})"
description: "CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesJobCompletion
expr: kube_job_spec_completions - kube_job_status_succeeded > 0 or kube_job_status_failed > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes job completion (instance {{ $labels.instance }})"
description: "Kubernetes Job failed to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesApiServerErrors
expr: sum(rate(apiserver_request_count{job="apiserver",code=~"^(?:5..)$"}[2m])) / sum(rate(apiserver_request_count{job="apiserver"}[2m])) * 100 > 3
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes API server errors (instance {{ $labels.instance }})"
description: "Kubernetes API server is experiencing high error rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesApiClientErrors
expr: (sum(rate(rest_client_requests_total{code=~"(4|5).."}[2m])) by (instance, job) / sum(rate(rest_client_requests_total[2m])) by (instance, job)) * 100 > 1
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes API client errors (instance {{ $labels.instance }})"
description: "Kubernetes API client is experiencing high error rate\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesClientCertificateExpiresNextWeek
expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 7*24*60*60
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes client certificate expires next week (instance {{ $labels.instance }})"
description: "A client certificate used to authenticate to the apiserver is expiring next week.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesClientCertificateExpiresSoon
expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 24*60*60
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes client certificate expires soon (instance {{ $labels.instance }})"
description: "A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: KubernetesApiServerLatency
expr: histogram_quantile(0.99, sum(apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"}) WITHOUT (instance, resource)) / 1e+06 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes API server latency (instance {{ $labels.instance }})"
description: "Kubernetes API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephState
expr: ceph_health_status != 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ceph State (instance {{ $labels.instance }})"
description: "Ceph instance unhealthy\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephMonitorClockSkew
expr: abs(ceph_monitor_clock_skew_seconds) > 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph monitor clock skew (instance {{ $labels.instance }})"
description: "Ceph monitor clock skew detected. Please check ntp and hardware clock settings\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephMonitorLowSpace
expr: ceph_monitor_avail_percent < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph monitor low space (instance {{ $labels.instance }})"
description: "Ceph monitor storage is low.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephOsdDown
expr: ceph_osd_up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ceph OSD Down (instance {{ $labels.instance }})"
description: "Ceph Object Storage Daemon Down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephHighOsdLatency
expr: ceph_osd_perf_apply_latency_seconds > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph high OSD latency (instance {{ $labels.instance }})"
description: "Ceph Object Storage Daemon latetncy is high. Please check if it doesn't stuck in weird state.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephOsdLowSpace
expr: ceph_osd_utilization > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph OSD low space (instance {{ $labels.instance }})"
description: "Ceph Object Storage Daemon is going out of space. Please add more disks.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephOsdReweighted
expr: ceph_osd_weight < 1
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph OSD reweighted (instance {{ $labels.instance }})"
description: "Ceph Object Storage Daemon take ttoo much time to resize.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephPgDown
expr: ceph_pg_down > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ceph PG down (instance {{ $labels.instance }})"
description: "Some Ceph placement groups are down. Please ensure that all the data are available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephPgIncomplete
expr: ceph_pg_incomplete > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ceph PG incomplete (instance {{ $labels.instance }})"
description: "Some Ceph placement groups are incomplete. Please ensure that all the data are available.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephPgInconsistant
expr: ceph_pg_inconsistent > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph PG inconsistant (instance {{ $labels.instance }})"
description: "Some Ceph placement groups are inconsitent. Data is available but inconsistent across nodes.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephPgActivationLong
expr: ceph_pg_activating > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph PG activation long (instance {{ $labels.instance }})"
description: "Some Ceph placement groups are too long to activate.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephPgBackfillFull
expr: ceph_pg_backfill_toofull > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Ceph PG backfill full (instance {{ $labels.instance }})"
description: "Some Ceph placement groups are located on full Object Storage Daemon on cluster. Those PGs can be unavailable shortly. Please check OSDs, change weight or reconfigure CRUSH rules.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: CephPgUnavailable
expr: ceph_pg_total - ceph_pg_active > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ceph PG unavailable (instance {{ $labels.instance }})"
description: "Some Ceph placement groups are unavailable.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: ThanosCompactionHalted
expr: thanos_compactor_halted == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Thanos compaction halted (instance {{ $labels.instance }})"
description: "Thanos compaction has failed to run and is now halted.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: ThanosCompactBucketOperationFailure
expr: rate(thanos_objstore_bucket_operation_failures_total[1m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Thanos compact bucket operation failure (instance {{ $labels.instance }})"
description: "Thanos compaction has failing storage operations\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: ThanosCompactNotRun
expr: (time() - thanos_objstore_bucket_last_successful_upload_time) > 24*60*60
for: 5m
labels:
severity: critical
annotations:
summary: "Thanos compact not run (instance {{ $labels.instance }})"
description: "Thanos compaction has not run in 24 hours.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"

2.在 Prometheus 主程序目录下的prometheus.yml进行修改,引用上述告警规则

rule_files:
- "rules/*.yml"

3.重启Prometheus

Prometheus自身的监控告警规则的更多相关文章

  1. Spring Boot 微服务应用集成Prometheus + Grafana 实现监控告警

    Spring Boot 微服务应用集成Prometheus + Grafana 实现监控告警 一.添加依赖 1.1 Actuator 的 /prometheus端点 二.Prometheus 配置 部 ...

  2. k8s实战之部署Prometheus+Grafana可视化监控告警平台

    写在前面 之前部署web网站的时候,架构图中有一环节是监控部分,并且搭建一套有效的监控平台对于运维来说非常之重要,只有这样才能更有效率的保证我们的服务器和服务的稳定运行,常见的开源监控软件有好几种,如 ...

  3. Prometheus中使用的告警规则

    参考网站:https://awesome-prometheus-alerts.grep.to/rules 这个网站上有好多常用软件的告警规则,但是有些并不一定实用,有些使用起来会有错误,这里就把这些都 ...

  4. Prometheus+Grafana企业监控系统

    Prometheus+Grafana企业监控系统 作者 刘畅 实验配置: 主机名称 Ip地址 controlnode 172.16.1.70/24 slavenode1 172.16.1.71/24 ...

  5. Prometheus监控学习笔记之Prometheus 2.0 告警规则介绍

    0x00 变化 Prometheus 2.0 已经发布一段时间了,从今天开始我将分几篇文章为大家介绍其中的一些变化. 此篇文章主要介绍 2.0 的告警规则声明的新写法. 从 1.x 到 2.0 规则声 ...

  6. Prometheus 编写告警规则案例

    Prometheus 编写告警规则案例 注:确保alertmanager配置完毕! 1.创建编辑文件:vim /usr/local/prometheus/rules/node.yml # groups ...

  7. kubernetes(k8s) Prometheus+grafana监控告警安装部署

    主机数据收集 主机数据的采集是集群监控的基础:外部模块收集各个主机采集到的数据分析就能对整个集群完成监控和告警等功能.一般主机数据采集和对外提供数据使用cAdvisor 和node-exporter等 ...

  8. Grafana+Prometheus实现Ceph监控和钉钉告警-转载(云栖社区)

    获取软件包 最新的软件包获取地址 https://prometheus.io/download/ Prometheus 1.下载Prometheus $ wget https://github.com ...

  9. 实用干货丨如何使用Prometheus配置自定义告警规则

    前 言 Prometheus是一个用于监控和告警的开源系统.一开始由Soundcloud开发,后来在2016年,它迁移到CNCF并且称为Kubernetes之后最流行的项目之一.从整个Linux服务器 ...

随机推荐

  1. 在 SQL Server 中使用 Try Catch 处理异常

    如何在 SQL Server 中使用 Try Catch 处理错误? 从 SQL Server 2005 开始,我们在TRY 和 CATCH块的帮助下提供了结构错误处理机制.使用TRY-CATCH的语 ...

  2. SimpleMongoDbFactory类已经失效,被SimpleMongoClientDbFactory替代

    老版本的mongodbtemplate连接池的用法 spring: data: mongodb: address: 127.0.0.1:37017 replica-set: database: xxx ...

  3. Stream流的特点_只能使用一次和Stream流中的常用方法_map

    Stream流的特点_只能使用一次 public class FilterStudy04 { public static void main(String[] args) { //创建一个Stream ...

  4. 推荐几款最好用的MySQL开源客户端,建议收藏!

    一.摘要 众所周知,MYSQL 是目前使得最广泛.最流行的数据库技术之一,为了更方便的管理数据库,市场上出现了大量软件公司和个人开发者研发的客户端工具,比如我们所熟知的比较知名的客户端: Navica ...

  5. 编译器优化:何为SLP矢量化

    摘要:SLP矢量化的目标是将相似的独立指令组合成向量指令,内存访问.算术运算.比较运算.PHI节点都可以使用这种技术进行矢量化. 本文分享自华为云社区<编译器优化那些事儿(1):SLP矢量化介绍 ...

  6. BZOJ3295/Luogu3157 [CQOI2011]动态逆序对 (CDQ or 树套树 )

    /* Dear friend, wanna learn CDQ? As a surprice, this code is totally wrong. You may ask, then why yo ...

  7. 使用Linux、Nginx和Github Actions托管部署ASP.NET Core 6.0应用

    使用Linux.Nginx和Github Actions托管部署ASP.NET Core 6.0应用 前言 本文主要参考微软这篇文档而来 Host ASP.NET Core on Linux with ...

  8. Java精进-手写持久层框架

    前言 本文适合有一定java基础的同学,通过自定义持久层框架,可以更加清楚常用的mybatis等开源框架的原理. JDBC操作回顾及问题分析 学习java的同学一定避免不了接触过jdbc,让我们来回顾 ...

  9. Python之验证码识别功能

    Python之pytesseract 识别验证码 1.验证码来一个 2.适合什么样的验证码呢? 只能识别简单.静态.无重叠.只有数字字母的验证码 3.实际应用:模拟人工登录.页面内容识别.爬虫抓取信息 ...

  10. K8s小白?应用部署太难?看这篇就够了!

    在云原生趋势下,容器和 Kubernetes 可谓是家喻户晓,许多企业内部的研发团队都在使用 Kubernetes 打造 DevOps 平台.从最早的容器概念到 Kubernetes 再到 DevOp ...