一、概述

首先Prometheus整体监控结构略微复杂,一个个部署并不简单。另外监控Kubernetes就需要访问内部数据,必定需要进行认证、鉴权、准入控制,

那么这一整套下来将变得难上加难,而且还需要花费一定的时间,如果你没有特别高的要求,我还是建议选用开源比较好的一些方案。

关于Prometheus具体介绍不再多说,可以参考另外一篇博文:Kubernetes实战总结 - Prometheus部署(v0.3.0)

本篇主要针对Kubernetes部署Prometheus相关配置介绍,本人采用的是github开源的部署方案:prometheus-operator/kube-prometheus

关于这个kube-prometheus目前应该是开源最好的方案了,该存储库收集Kubernetes清单,Grafana仪表板和Prometheus规则,以及文档和脚本,

以使用Prometheus Operator 通过Prometheus提供易于操作的端到端Kubernetes集群监视。以容器的方式部署到k8s集群,而且还可以自定义配置,非常的方便。

注意:本人使用的kubernetes-1.17.5  + release-0.3,由于网络问题本人已修改全部镜像地址。


二、结构分析

kube-prometheus相关部署文件在manifests目录中,共65个yaml,其中setup文件夹中包含所有自定义资源配置CustomResourceDefinition(一般不用修改,也不要轻易修改),所以部署时必须先执行这个文件夹。

其中包括告警(Alertmanager)、监控(Prometheus)、监控项(PrometheusRule)这三类资源定义,所以如果你想直接在k8s中修改对应控制器配置是没有用的(比如kubectl edit sts prometheus-k8s -n monitoring) 。

这里yaml文件看着很多,只要我们梳理一下就会很容易理解了,首先分为7个组件prometheus-operator、prometheus-adapter、prometheus、alertmanager、grafana、kube-state-metrics、node-exporter,

然后每个组件都会定义控制器、配置文件、集群权限、访问配置等, 但是我们一般只需要进行自定义告警配置和监控项,这样一筛选发现只需要修改几个文件即可(其中红色后面重点说明,紫色可根据项目情况调整资源配置)。

[root@ymt108 manifests]# tree
.
├── alertmanager-alertmanager.yaml
├── alertmanager-secret.yaml    # 告警配置
├── alertmanager-serviceAccount.yaml
├── alertmanager-serviceMonitor.yaml
├── alertmanager-service.yaml
├── grafana-dashboardDatasources.yaml
├── grafana-dashboardDefinitions.yaml
├── grafana-dashboardSources.yaml
├── grafana-deployment.yaml
├── grafana-serviceAccount.yaml
├── grafana-serviceMonitor.yaml
├── grafana-service.yaml
├── kube-state-metrics-clusterRoleBinding.yaml
├── kube-state-metrics-clusterRole.yaml
├── kube-state-metrics-deployment.yaml
├── kube-state-metrics-roleBinding.yaml
├── kube-state-metrics-role.yaml
├── kube-state-metrics-serviceAccount.yaml
├── kube-state-metrics-serviceMonitor.yaml
├── kube-state-metrics-service.yaml
├── node-exporter-clusterRoleBinding.yaml
├── node-exporter-clusterRole.yaml
├── node-exporter-daemonset.yaml
├── node-exporter-serviceAccount.yaml
├── node-exporter-serviceMonitor.yaml
├── node-exporter-service.yaml
├── prometheus-adapter-apiService.yaml
├── prometheus-adapter-clusterRoleAggregatedMetricsReader.yaml
├── prometheus-adapter-clusterRoleBindingDelegator.yaml
├── prometheus-adapter-clusterRoleBinding.yaml
├── prometheus-adapter-clusterRoleServerResources.yaml
├── prometheus-adapter-clusterRole.yaml
├── prometheus-adapter-configMap.yaml
├── prometheus-adapter-deployment.yaml
├── prometheus-adapter-roleBindingAuthReader.yaml
├── prometheus-adapter-serviceAccount.yaml
├── prometheus-adapter-service.yaml
├── prometheus-clusterRoleBinding.yaml
├── prometheus-clusterRole.yaml
├── prometheus-operator-serviceMonitor.yaml
├── prometheus-prometheus.yaml  # 监控配置
├── prometheus-roleBindingConfig.yaml
├── prometheus-roleBindingSpecificNamespaces.yaml
├── prometheus-roleConfig.yaml
├── prometheus-roleSpecificNamespaces.yaml
├── prometheus-rules.yaml  # 默认监控项
├── prometheus-serviceAccount.yaml
├── prometheus-serviceMonitorApiserver.yaml
├── prometheus-serviceMonitorCoreDNS.yaml
├── prometheus-serviceMonitorKubeControllerManager.yaml
├── prometheus-serviceMonitorKubelet.yaml
├── prometheus-serviceMonitorKubeScheduler.yaml
├── prometheus-serviceMonitor.yaml
├── prometheus-service.yaml
└── setup
├── 0namespace-namespace.yaml
├── prometheus-operator-0alertmanagerCustomResourceDefinition.yaml
├── prometheus-operator-0podmonitorCustomResourceDefinition.yaml
├── prometheus-operator-0prometheusCustomResourceDefinition.yaml
├── prometheus-operator-0prometheusruleCustomResourceDefinition.yaml
├── prometheus-operator-0servicemonitorCustomResourceDefinition.yaml
├── prometheus-operator-clusterRoleBinding.yaml
├── prometheus-operator-clusterRole.yaml
├── prometheus-operator-deployment.yaml
├── prometheus-operator-serviceAccount.yaml
└── prometheus-operator-service.yaml 1 directories, 65 files

三、修改Prometheus配置

为了保留原始文件,我们复制一份prometheus-prometheus.yaml进行如下修改:

1)replicas:根据项目情况调整副本数

2)retention:修改Prometheus数据保留期限,默认值为“24h”,并且必须与正则表达式“ [0-9] +(ms | s | m | h | d | w | y)”匹配。

3)additionalScrapeConfigs:增加额外监控项配置,具体配置查看第五部分“添加k8s外部监控”。

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
labels:
prometheus: k8s
name: k8s
namespace: monitoring
spec:
alerting:
alertmanagers:
- name: alertmanager-main
namespace: monitoring
port: web
# baseImage: quay.io/prometheus/prometheus
baseImage: registry.cn-shanghai.aliyuncs.com/leozhanggg/prometheus/prometheus
additionalScrapeConfigs:
name: additional-scrape-configs
key: prometheus-additional.yaml
retention: 15d
nodeSelector:
kubernetes.io/os: linux
podMonitorSelector: {}
replicas:
resources:
requests:
memory: 400Mi
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
serviceAccountName: prometheus-k8s
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
version: v2.11.0

四、修改PrometheusRule配置

首先查看默认监控项配置prometheus-rules.yaml,其中包括76个告警项,基本覆盖了k8s常用监控点,同样为了保留源文件,我们复制一份prometheus-rules.yaml进行一些修改。

由于我使用的版本在裸机k8s集群上存在无法获取到scheduler和controller资源问题,被我注释了;另外由于general-rules规则与我自定义规则冲突也被我注释了;

最后增加了platform参数区分环境,以及进行部分提示语中译。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-k8s-rules
namespace: monitoring
spec:
groups:
- name: node-exporter.rules
rules:
- expr: |
count without (cpu) (
count without (mode) (
node_cpu_seconds_total{job="node-exporter"}
)
)
record: instance:node_num_cpu:sum
- expr: |
1 - avg without (cpu, mode) (
rate(node_cpu_seconds_total{job="node-exporter", mode="idle"}[1m])
)
record: instance:node_cpu_utilisation:rate1m
- expr: |
(
node_load1{job="node-exporter"}
/
instance:node_num_cpu:sum{job="node-exporter"}
)
record: instance:node_load1_per_cpu:ratio
- expr: |
1 - (
node_memory_MemAvailable_bytes{job="node-exporter"}
/
node_memory_MemTotal_bytes{job="node-exporter"}
)
record: instance:node_memory_utilisation:ratio
- expr: |
rate(node_vmstat_pgmajfault{job="node-exporter"}[1m])
record: instance:node_vmstat_pgmajfault:rate1m
- expr: |
rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m])
record: instance_device:node_disk_io_time_seconds:rate1m
- expr: |
rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+"}[1m])
record: instance_device:node_disk_io_time_weighted_seconds:rate1m
- expr: |
sum without (device) (
rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[1m])
)
record: instance:node_network_receive_bytes_excluding_lo:rate1m
- expr: |
sum without (device) (
rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[1m])
)
record: instance:node_network_transmit_bytes_excluding_lo:rate1m
- expr: |
sum without (device) (
rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[1m])
)
record: instance:node_network_receive_drop_excluding_lo:rate1m
- expr: |
sum without (device) (
rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[1m])
)
record: instance:node_network_transmit_drop_excluding_lo:rate1m
- name: kube-apiserver.rules
rules:
- expr: |
histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile
- name: k8s.rules
rules:
- expr: |
sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m])) by (namespace)
record: namespace:container_cpu_usage_seconds_total:sum_rate
- expr: |
sum by (namespace, pod, container) (
rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container!="POD"}[5m])
) * on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate
- expr: |
container_memory_working_set_bytes{job="kubelet", image!=""}
* on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_memory_working_set_bytes
- expr: |
container_memory_rss{job="kubelet", image!=""}
* on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_memory_rss
- expr: |
container_memory_cache{job="kubelet", image!=""}
* on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_memory_cache
- expr: |
container_memory_swap{job="kubelet", image!=""}
* on (namespace, pod) group_left(node) max by(namespace, pod, node) (kube_pod_info)
record: node_namespace_pod_container:container_memory_swap
- expr: |
sum(container_memory_usage_bytes{job="kubelet", image!="", container!="POD"}) by (namespace)
record: namespace:container_memory_usage_bytes:sum
- expr: |
sum by (namespace, label_name) (
sum(kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"} * on (endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) by (namespace, pod)
* on (namespace, pod)
group_left(label_name) kube_pod_labels{job="kube-state-metrics"}
)
record: namespace:kube_pod_container_resource_requests_memory_bytes:sum
- expr: |
sum by (namespace, label_name) (
sum(kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"} * on (endpoint, instance, job, namespace, pod, service) group_left(phase) (kube_pod_status_phase{phase=~"Pending|Running"} == 1)) by (namespace, pod)
* on (namespace, pod)
group_left(label_name) kube_pod_labels{job="kube-state-metrics"}
)
record: namespace:kube_pod_container_resource_requests_cpu_cores:sum
- expr: |
sum(
label_replace(
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"},
"replicaset", "$1", "owner_name", "(.*)"
) * on(replicaset, namespace) group_left(owner_name) kube_replicaset_owner{job="kube-state-metrics"},
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
labels:
workload_type: deployment
record: mixin_pod_workload
- expr: |
sum(
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"},
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
labels:
workload_type: daemonset
record: mixin_pod_workload
- expr: |
sum(
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"},
"workload", "$1", "owner_name", "(.*)"
)
) by (namespace, workload, pod)
labels:
workload_type: statefulset
record: mixin_pod_workload
- name: kube-scheduler.rules
rules:
- expr: |
histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.99, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.99"
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.9, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.9"
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- expr: |
histogram_quantile(0.5, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m])) without(instance, pod))
labels:
quantile: "0.5"
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
- name: node.rules
rules:
- expr: sum(min(kube_pod_info) by (node))
record: ':kube_pod_info_node_count:'
- expr: |
max(label_replace(kube_pod_info{job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)")) by (node, namespace, pod)
record: 'node_namespace_pod:kube_pod_info:'
- expr: |
count by (node) (sum by (node, cpu) (
node_cpu_seconds_total{job="node-exporter"}
* on (namespace, pod) group_left(node)
node_namespace_pod:kube_pod_info:
))
record: node:node_num_cpu:sum
- expr: |
sum(
node_memory_MemAvailable_bytes{job="node-exporter"} or
(
node_memory_Buffers_bytes{job="node-exporter"} +
node_memory_Cached_bytes{job="node-exporter"} +
node_memory_MemFree_bytes{job="node-exporter"} +
node_memory_Slab_bytes{job="node-exporter"}
)
)
record: :node_memory_MemAvailable_bytes:sum
- name: kube-prometheus-node-recording.rules
rules:
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[3m])) BY
(instance)
record: instance:node_cpu:rate:sum
- expr: sum((node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_free_bytes{mountpoint="/"}))
BY (instance)
record: instance:node_filesystem_usage:sum
- expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance)
record: instance:node_network_receive_bytes:rate:sum
- expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)
record: instance:node_network_transmit_bytes:rate:sum
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m])) WITHOUT
(cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total)
BY (instance, cpu)) BY (instance)
record: instance:node_cpu:ratio
- expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait"}[5m]))
record: cluster:node_cpu:sum_rate5m
- expr: cluster:node_cpu_seconds_total:rate5m / count(sum(node_cpu_seconds_total)
BY (instance, cpu))
record: cluster:node_cpu:ratio
- name: node-exporter
rules:
- alert: NodeFilesystemSpaceFillingUp
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left and is filling
up.
summary: "预计文件系统将在接下来的24小时内用完空间。"
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40
and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemSpaceFillingUp
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left and is filling
up fast.
summary: "预计文件系统将在接下来的4个小时内用完空间。"
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 20
and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left.
summary: "文件系统剩余空间不到5%。"
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available space left.
summary: "文件系统剩余空间不到3%。"
expr: |
(
node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 3
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemFilesFillingUp
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available inodes left and is filling
up.
summary: "预计文件系统将在接下来的24小时内用尽inodes。"
expr: |
(
node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40
and
predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemFilesFillingUp
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available inodes left and is filling
up fast.
summary: "预计文件系统将在接下来的4小时内用尽inodes。"
expr: |
(
node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 20
and
predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 4*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available inodes left.
summary: "文件系统仅剩不到5%的inodes。"
expr: |
(
node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 5
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: warning
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
platform: "测试平台"
description: Filesystem on {{ $labels.device }} at {{ $labels.instance }}
has only {{ printf "%.2f" $value }}% available inodes left.
summary: "文件系统仅剩不到3%的inodes。"
expr: |
(
node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3
and
node_filesystem_readonly{job="node-exporter",fstype!=""} == 0
)
for: 1h
labels:
severity: critical
- alert: NodeNetworkReceiveErrs
annotations:
platform: "测试平台"
description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
{{ printf "%.0f" $value }} receive errors in the last two minutes.'
summary: "网络接口报告许多接收错误。"
expr: |
increase(node_network_receive_errs_total[2m]) > 10
for: 1h
labels:
severity: warning
- alert: NodeNetworkTransmitErrs
annotations:
platform: "测试平台"
description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered
{{ printf "%.0f" $value }} transmit errors in the last two minutes.'
summary: "网络接口报告许多传输错误。"
expr: |
increase(node_network_transmit_errs_total[2m]) > 10
for: 1h
labels:
severity: warning
- name: kubernetes-apps
rules:
- alert: KubePodCrashLooping
annotations:
platform: "测试平台"
message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
}}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
expr: |
rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: critical
- alert: KubePodNotReady
annotations:
platform: "测试平台"
message: "Pod {{$labels.namespace}}/{{$labels.pod}}处于未就绪状态的时间超过15分钟。"
expr: |
sum by (namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Failed|Pending|Unknown"} * on(namespace, pod) group_left(owner_kind) kube_pod_owner{owner_kind!="Job"}) > 0
for: 15m
labels:
severity: critical
- alert: KubeDeploymentGenerationMismatch
annotations:
platform: "测试平台"
message: "Deployment {{$labels.namespace}}/{{$labels.deployment}}生成不匹配,这表明Deployment已失败但尚未回滚。"
expr: |
kube_deployment_status_observed_generation{job="kube-state-metrics"}
!=
kube_deployment_metadata_generation{job="kube-state-metrics"}
for: 15m
labels:
severity: critical
- alert: KubeDeploymentReplicasMismatch
annotations:
platform: "测试平台"
message: "Deployment{{$labels.namespace}}/{{$labels.deployment}}超过15分钟未匹配预期的副本数。"
expr: |
kube_deployment_spec_replicas{job="kube-state-metrics"}
!=
kube_deployment_status_replicas_available{job="kube-state-metrics"}
for: 15m
labels:
severity: critical
- alert: KubeStatefulSetReplicasMismatch
annotations:
platform: "测试平台"
message: "StatefulSet {{$labels.namespace}}/{{$labels.statefulset}}超过15分钟未匹配预期的副本数。"
expr: |
kube_statefulset_status_replicas_ready{job="kube-state-metrics"}
!=
kube_statefulset_status_replicas{job="kube-state-metrics"}
for: 15m
labels:
severity: critical
- alert: KubeStatefulSetGenerationMismatch
annotations:
platform: "测试平台"
message: "StatefulSet {{$labels.namespace}}/{{$labels.statefulset}}生成不匹配,这表明StatefulSet已失败但尚未回滚。"
expr: |
kube_statefulset_status_observed_generation{job="kube-state-metrics"}
!=
kube_statefulset_metadata_generation{job="kube-state-metrics"}
for: 15m
labels:
severity: critical
- alert: KubeStatefulSetUpdateNotRolledOut
annotations:
platform: "测试平台"
message: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update
has not been rolled out.
expr: |
max without (revision) (
kube_statefulset_status_current_revision{job="kube-state-metrics"}
unless
kube_statefulset_status_update_revision{job="kube-state-metrics"}
)
*
(
kube_statefulset_replicas{job="kube-state-metrics"}
!=
kube_statefulset_status_replicas_updated{job="kube-state-metrics"}
)
for: 15m
labels:
severity: critical
- alert: KubeDaemonSetRolloutStuck
annotations:
platform: "测试平台"
message: Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet
{{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready.
expr: |
kube_daemonset_status_number_ready{job="kube-state-metrics"}
/
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00
for: 15m
labels:
severity: critical
- alert: KubeContainerWaiting
annotations:
platform: "测试平台"
message: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}}
has been in waiting state for longer than 1 hour.
expr: |
sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0
for: 1h
labels:
severity: warning
- alert: KubeDaemonSetNotScheduled
annotations:
platform: "测试平台"
message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
}} are not scheduled.'
expr: |
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"}
-
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0
for: 10m
labels:
severity: warning
- alert: KubeDaemonSetMisScheduled
annotations:
platform: "测试平台"
message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
}} are running where they are not supposed to run.'
expr: |
kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0
for: 10m
labels:
severity: warning
- alert: KubeCronJobRunning
annotations:
platform: "测试平台"
message: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more
than 1h to complete.
expr: |
time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600
for: 1h
labels:
severity: warning
- alert: KubeJobCompletion
annotations:
platform: "测试平台"
message: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more
than one hour to complete.
expr: |
kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0
for: 1h
labels:
severity: warning
- alert: KubeJobFailed
annotations:
platform: "测试平台"
message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
expr: |
kube_job_failed{job="kube-state-metrics"} > 0
for: 15m
labels:
severity: warning
- alert: KubeHpaReplicasMismatch
annotations:
platform: "测试平台"
message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the
desired number of replicas for longer than 15 minutes.
expr: |
(kube_hpa_status_desired_replicas{job="kube-state-metrics"}
!=
kube_hpa_status_current_replicas{job="kube-state-metrics"})
and
changes(kube_hpa_status_current_replicas[15m]) == 0
for: 15m
labels:
severity: warning
- alert: KubeHpaMaxedOut
annotations:
platform: "测试平台"
message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at
max replicas for longer than 15 minutes.
expr: |
kube_hpa_status_current_replicas{job="kube-state-metrics"}
==
kube_hpa_spec_max_replicas{job="kube-state-metrics"}
for: 15m
labels:
severity: warning
- name: kubernetes-resources
rules:
- alert: KubeCPUOvercommit
annotations:
platform: "测试平台"
message: "群集已超额使用Pod的CPU资源请求,因此无法容忍节点故障。"
expr: |
sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum)
/
sum(kube_node_status_allocatable_cpu_cores)
>
(count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores)
for: 5m
labels:
severity: warning
- alert: KubeMemOvercommit
annotations:
platform: "测试平台"
message: "群集已过量使用Pod的内存资源请求,因此无法容忍节点故障。"
expr: |
sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum)
/
sum(kube_node_status_allocatable_memory_bytes)
>
(count(kube_node_status_allocatable_memory_bytes)-1)
/
count(kube_node_status_allocatable_memory_bytes)
for: 5m
labels:
severity: warning
- alert: KubeCPUOvercommit
annotations:
platform: "测试平台"
message: "群集已超额使用了对命名空间的CPU资源请求。"
expr: |
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"})
/
sum(kube_node_status_allocatable_cpu_cores)
> 1.5
for: 5m
labels:
severity: warning
- alert: KubeMemOvercommit
annotations:
platform: "测试平台"
message: "群集已过量使用了对命名空间的内存资源请求。"
expr: |
sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"})
/
sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"})
> 1.5
for: 5m
labels:
severity: warning
- alert: KubeQuotaExceeded
annotations:
platform: "测试平台"
message: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage
}} of its {{ $labels.resource }} quota.
expr: |
kube_resourcequota{job="kube-state-metrics", type="used"}
/ ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
> 0.90
for: 15m
labels:
severity: warning
# - alert: CPUThrottlingHigh
# annotations:
# message: '{{ $value | humanizePercentage }} throttling of CPU in namespace
# {{ $labels.namespace }} for container {{ $labels.container }} in pod {{
# $labels.pod }}.'
# expr: |
# sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace)
# /
# sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace)
# > ( 25 / 100 )
# for: 15m
# labels:
# severity: warning
- name: kubernetes-storage
rules:
- alert: KubePersistentVolumeUsageCritical
annotations:
platform: "测试平台"
message: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
}} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage
}} free.
expr: |
kubelet_volume_stats_available_bytes{job="kubelet"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet"}
< 0.03
for: 1m
labels:
severity: critical
- alert: KubePersistentVolumeFullInFourDays
annotations:
platform: "测试平台"
message: "根据最近的抽样,{{$labels.persistentvolumeclaim}}在命名空间{{$labels.namespace}}中声明的PersistentVolume预计将在四天内填满,目前{{$value | humanizePercentage}}可用。"
expr: |
(
kubelet_volume_stats_available_bytes{job="kubelet"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet"}
) < 0.15
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0
for: 1h
labels:
severity: critical
- alert: KubePersistentVolumeErrors
annotations:
platform: "测试平台"
message: The persistent volume {{ $labels.persistentvolume }} has status {{
$labels.phase }}.
expr: |
kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0
for: 5m
labels:
severity: critical
- name: kubernetes-system
rules:
- alert: KubeVersionMismatch
annotations:
platform: "测试平台"
message: There are {{ $value }} different semantic versions of Kubernetes
components running.
expr: |
count(count by (gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[0-9]*.[0-9]*).*"))) > 1
for: 15m
labels:
severity: warning
- alert: KubeClientErrors
annotations:
platform: "测试平台"
message: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance
}}' is experiencing {{ $value | humanizePercentage }} errors.'
expr: |
(sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job)
/
sum(rate(rest_client_requests_total[5m])) by (instance, job))
> 0.01
for: 15m
labels:
severity: warning
- name: kubernetes-system-apiserver
rules:
- alert: KubeAPILatencyHigh
annotations:
platform: "测试平台"
message: The API server has a 99th percentile latency of {{ $value }} seconds
for {{ $labels.verb }} {{ $labels.resource }}.
expr: |
cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|PROXY|CONNECT"} > 1
for: 10m
labels:
severity: warning
- alert: KubeAPILatencyHigh
annotations:
platform: "测试平台"
message: The API server has a 99th percentile latency of {{ $value }} seconds
for {{ $labels.verb }} {{ $labels.resource }}.
expr: |
cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|PROXY|CONNECT"} > 4
for: 10m
labels:
severity: critical
- alert: KubeAPIErrorsHigh
annotations:
platform: "测试平台"
message: API server is returning errors for {{ $value | humanizePercentage
}} of requests.
expr: |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m]))
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) > 0.03
for: 10m
labels:
severity: critical
- alert: KubeAPIErrorsHigh
annotations:
platform: "测试平台"
message: API server is returning errors for {{ $value | humanizePercentage
}} of requests.
expr: |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m]))
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) > 0.01
for: 10m
labels:
severity: warning
- alert: KubeAPIErrorsHigh
annotations:
platform: "测试平台"
message: API server is returning errors for {{ $value | humanizePercentage
}} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource
}}.
expr: |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb)
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.10
for: 10m
labels:
severity: critical
- alert: KubeAPIErrorsHigh
annotations:
platform: "测试平台"
message: API server is returning errors for {{ $value | humanizePercentage
}} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource
}}.
expr: |
sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb)
/
sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.05
for: 10m
labels:
severity: warning
- alert: KubeClientCertificateExpiration
annotations:
platform: "测试平台"
message: "用于验证apiserver的客户端证书的有效期限少于7.0天。"
expr: |
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800
labels:
severity: warning
- alert: KubeClientCertificateExpiration
annotations:
platform: "测试平台"
message: "用于验证apiserver的客户端证书的有效期限少于24.0小时。"
expr: |
apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400
labels:
severity: critical
- alert: KubeAPIDown
annotations:
platform: "测试平台"
message: "KubeAPI已从Prometheus目标发现中消失。"
expr: |
absent(up{job="apiserver"} == 1)
for: 15m
labels:
severity: critical
- name: kubernetes-system-kubelet
rules:
- alert: KubeNodeNotReady
annotations:
platform: "测试平台"
message: "{{$labels.node}}尚未准备就绪超过15分钟。"
expr: |
kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
for: 15m
labels:
severity: warning
- alert: KubeNodeUnreachable
annotations:
platform: "测试平台"
message: "{{$labels.node}}无法访问,某些工作负荷可能会重新安排。"
expr: |
kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1
labels:
severity: warning
- alert: KubeletTooManyPods
annotations:
platform: "测试平台"
message: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage
}} of its Pod capacity.
expr: |
max(max(kubelet_running_pod_count{job="kubelet"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"}) by(node) > 0.95
for: 15m
labels:
severity: warning
- alert: KubeletDown
annotations:
platform: "测试平台"
message: "Kubelet已从Prometheus目标发现中消失。"
expr: |
absent(up{job="kubelet"} == 1)
for: 15m
labels:
severity: critical
# - name: kubernetes-system-scheduler
# rules:
# - alert: KubeSchedulerDown
# annotations:
# message: KubeScheduler has disappeared from Prometheus target discovery.
# runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown
# expr: |
# absent(up{job="kube-scheduler"} == 1)
# for: 15m
# labels:
# severity: critical
# - name: kubernetes-system-controller-manager
# rules:
# - alert: KubeControllerManagerDown
# annotations:
# message: KubeControllerManager has disappeared from Prometheus target discovery.
# runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown
# expr: |
# absent(up{job="kube-controller-manager"} == 1)
# for: 15m
# labels:
# severity: critical
- name: prometheus
rules:
- alert: PrometheusBadConfig
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to
reload its configuration.
summary: "Prometheus配置重新加载失败。"
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_config_last_reload_successful{job="prometheus-k8s",namespace="monitoring"}[5m]) == 0
for: 10m
labels:
severity: critical
- alert: PrometheusNotificationQueueRunningFull
annotations:
platform: "测试平台"
description: Alert notification queue of Prometheus {{$labels.namespace}}/{{$labels.pod}}
is running full.
summary: "Prometheus警报通知队列预计将在30m以内用完。"
expr: |
# Without min_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
predict_linear(prometheus_notifications_queue_length{job="prometheus-k8s",namespace="monitoring"}[5m], 60 * 30)
>
min_over_time(prometheus_notifications_queue_capacity{job="prometheus-k8s",namespace="monitoring"}[5m])
)
for: 15m
labels:
severity: warning
- alert: PrometheusErrorSendingAlertsToSomeAlertmanagers
annotations:
platform: "测试平台"
description: '{{ printf "%.1f" $value }}% errors while sending alerts from
Prometheus {{$labels.namespace}}/{{$labels.pod}} to Alertmanager {{$labels.alertmanager}}.'
summary: "Prometheus在将警报发送到特定的Alertmanager时遇到了超过1%的错误。"
expr: |
(
rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m])
/
rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m])
)
* 100
> 1
for: 15m
labels:
severity: warning
- alert: PrometheusErrorSendingAlertsToAnyAlertmanager
annotations:
platform: "测试平台"
description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts
from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.'
summary: "Prometheus在将警报发送到任何Alertmanager时遇到3%以上的错误。"
expr: |
min without(alertmanager) (
rate(prometheus_notifications_errors_total{job="prometheus-k8s",namespace="monitoring"}[5m])
/
rate(prometheus_notifications_sent_total{job="prometheus-k8s",namespace="monitoring"}[5m])
)
* 100
> 3
for: 15m
labels:
severity: critical
- alert: PrometheusNotConnectedToAlertmanagers
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not connected
to any Alertmanagers.
summary: "Prometheus未与任何Alertmanager连接。"
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-k8s",namespace="monitoring"}[5m]) < 1
for: 10m
labels:
severity: warning
- alert: PrometheusTSDBReloadsFailing
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected
{{$value | humanize}} reload failures over the last 3h.
summary: "Prometheus从磁盘重新加载块时遇到问题。"
expr: |
increase(prometheus_tsdb_reloads_failures_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0
for: 4h
labels:
severity: warning
- alert: PrometheusTSDBCompactionsFailing
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected
{{$value | humanize}} compaction failures over the last 3h.
summary: "Prometheus在压缩块时遇到问题。"
expr: |
increase(prometheus_tsdb_compactions_failed_total{job="prometheus-k8s",namespace="monitoring"}[3h]) > 0
for: 4h
labels:
severity: warning
- alert: PrometheusNotIngestingSamples
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting
samples.
summary: "Prometheus没有获取到样本"
expr: |
rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-k8s",namespace="monitoring"}[5m]) <= 0
for: 10m
labels:
severity: warning
- alert: PrometheusDuplicateTimestamps
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping
{{ printf "%.4g" $value }} samples/s with different values but duplicated
timestamp.
summary: "Prometheus正在删除带有重复时间戳的样本。"
expr: |
rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 10m
labels:
severity: warning
- alert: PrometheusOutOfOrderTimestamps
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping
{{ printf "%.4g" $value }} samples/s with timestamps arriving out of order.
summary: "Prometheus丢弃带有乱序时间戳的样本。"
expr: |
rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 10m
labels:
severity: warning
- alert: PrometheusRemoteStorageFailures
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send
{{ printf "%.1f" $value }}% of the samples to queue {{$labels.queue}}.
summary: "Prometheus无法将样本发送到远程存储。"
expr: |
(
rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
/
(
rate(prometheus_remote_storage_failed_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
+
rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-k8s",namespace="monitoring"}[5m])
)
)
* 100
> 1
for: 15m
labels:
severity: critical
- alert: PrometheusRemoteWriteBehind
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write
is {{ printf "%.1f" $value }}s behind for queue {{$labels.queue}}.
summary: "Prometheus远程写入落后了。"
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-k8s",namespace="monitoring"}[5m])
- on(job, instance) group_right
max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-k8s",namespace="monitoring"}[5m])
)
> 120
for: 15m
labels:
severity: critical
- alert: PrometheusRemoteWriteDesiredShards
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write
desired shards calculation wants to run {{ $value }} shards, which is more
than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%s",job="prometheus-k8s",namespace="monitoring"}`
$labels.instance | query | first | value }}.
summary: "Prometheus远程写入所需的分片计算要比配置的最大分片运行更多。"
expr: |
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-k8s",namespace="monitoring"}[5m])
>
max_over_time(prometheus_remote_storage_shards_max{job="prometheus-k8s",namespace="monitoring"}[5m])
)
for: 15m
labels:
severity: warning
- alert: PrometheusRuleFailures
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to
evaluate {{ printf "%.0f" $value }} rules in the last 5m.
summary: "Prometheus无法通过规则评估。"
expr: |
increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 15m
labels:
severity: critical
- alert: PrometheusMissingRuleEvaluations
annotations:
platform: "测试平台"
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{
printf "%.0f" $value }} rule group evaluations in the last 5m.
summary: "Prometheus由于规则组评估速度慢而缺少规则评估。"
expr: |
increase(prometheus_rule_group_iterations_missed_total{job="prometheus-k8s",namespace="monitoring"}[5m]) > 0
for: 15m
labels:
severity: warning
- name: alertmanager.rules
rules:
- alert: AlertmanagerConfigInconsistent
annotations:
platform: "测试平台"
message: "Alertmanager {{$labels.service}}实例的配置不同步。"
expr: |
count_values("config_hash", alertmanager_config_hash{job="alertmanager-main",namespace="monitoring"}) BY (service) / ON(service) GROUP_LEFT() label_replace(max(prometheus_operator_spec_replicas{job="prometheus-operator",namespace="monitoring",controller="alertmanager"}) by (name, job, namespace, controller), "service", "alertmanager-$1", "name", "(.*)") != 1
for: 5m
labels:
severity: critical
- alert: AlertmanagerFailedReload
annotations:
platform: "测试平台"
message: "Alertmanager {{$labels.namespace}}/{{$labels.pod}}重新加载配置失败。"
expr: |
alertmanager_config_last_reload_successful{job="alertmanager-main",namespace="monitoring"} == 0
for: 10m
labels:
severity: warning
- alert: AlertmanagerMembersInconsistent
annotations:
platform: "测试平台"
message: "Alertmanager尚未找到集群的所有其他成员。"
expr: |
alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"}
!= on (service) GROUP_LEFT()
count by (service) (alertmanager_cluster_members{job="alertmanager-main",namespace="monitoring"})
for: 5m
labels:
severity: critical
# - name: general.rules
# rules:
# - alert: TargetDown
# annotations:
# platform: "测试平台"
# message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }} targets in
# {{ $labels.namespace }} namespace are down.'
# expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job,
# namespace, service)) > 10
# for: 10m
# labels:
# severity: warning
# - alert: Watchdog
# annotations:
# platform: "测试平台"
# message: "此警报始终处于触发状态,旨在确保整个警报管道均正常运行。"
# runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md
# expr: vector(1)
# labels:
# severity: none
- name: node-time
rules:
- alert: ClockSkewDetected
annotations:
platform: "测试平台"
message: Clock skew detected on node-exporter {{ $labels.namespace }}/{{ $labels.pod
}}. Ensure NTP is configured correctly on this host.
expr: |
abs(node_timex_offset_seconds{job="node-exporter"}) > 0.05
for: 2m
labels:
severity: warning
- name: node-network
rules:
- alert: NodeNetworkInterfaceFlapping
annotations:
platform: "测试平台"
message: Network interface "{{ $labels.device }}" changing it's up status
often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}"
expr: |
changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2
for: 2m
labels:
severity: warning
- name: prometheus-operator
rules:
- alert: PrometheusOperatorReconcileErrors
annotations:
platform: "测试平台"
message: Errors while reconciling {{ $labels.controller }} in {{ $labels.namespace
}} Namespace.
expr: |
rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning
- alert: PrometheusOperatorNodeLookupErrors
annotations:
platform: "测试平台"
message: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace.
expr: |
rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator",namespace="monitoring"}[5m]) > 0.1
for: 10m
labels:
severity: warning

prometheus-rules.yaml

接下来参考prometheus-rules.yaml,新建自定义的告警项prometheus-additional-rules.yaml

kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: prometheus-additional-rules
namespace: monitoring
spec:
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 30s
labels:
severity: critical
annotations:
platform: "育苗通测试平台"
summary: "监控采集器 {{ $labels.instance }} 停止工作!!!"
value: "{{ $value }}" - name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{device="rootfs"} / node_filesystem_size_bytes{device="rootfs"} * 100) > 85
for: 5m
labels:
severity: warning
annotations:
platform: "育苗通测试平台"
summary: "主机 {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率超过80%!!!"
value: "{{ $value }}" - alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
platform: "育苗通测试平台"
summary: "主机 {{ $labels.instance }} 内存使用率超过80%!!!"
value: "{{ $value }}" - alert: NodeCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
platform: "育苗通测试平台"
summary: "主机 {{ $labels.instance }} CPU使用率超过80%!!!"
value: "{{ $value }}"


五、添加k8s外部监控

一个项目开始可能很难实现全部容器化,比如数据库、CDH集群。但是我们依然需要监控他们,如果分成两套prometheus不利于管理,所以我们统一添加这些监控到kube-prometheus中。

那么接下来我们新建prometheus-additional.yaml文件,添加额外监控组件配置scrape_configs。

- job_name: 'node_export_others'
static_configs:
- targets:
- *.*.*.118:9591
- *.*.*.137:9591
- *.*.*.138:9591
- *.*.*.139:9591
- *.*.*.153:9591
- *.*.*.154:9591
- *.*.*.156:9591
- *.*.*.159:9591
- *.*.*.166:9591
- *.*.*.167:9591
- *.*.*.168:9591
- *.*.*.169:9591
- *.*.*.175:9591
- *.*.*.177:9591
- *.*.*.179:9591
- *.*.*.180:9591
- *.*.*.181:9591 - job_name: 'nginx'
static_configs:
- targets:
- *.*.*.159:9593 - job_name: 'mysql'
static_configs:
- targets:
- *.*.*.156:9592 - job_name: 'pushgateway-flink'
static_configs:
- targets:
- *.*.*.137:9091 - job_name: 'kafka_jmx_export'
static_configs:
- targets:
- *.*.*.118:9991
- *.*.*.138:9991
- *.*.*.139:9991 - job_name: 'zookeeper_cdh_export'
static_configs:
- targets:
- *.*.*.118:9595
- *.*.*.138:9595
- *.*.*.139:9595 - job_name: 'hdfs_namenode_cdh'
static_configs:
- targets:
- *.*.*.137:9600
- *.*.*.139:9600 - job_name: 'hdfs_datanode_cdh'
static_configs:
- targets:
- *.*.*.118:9601
- *.*.*.138:9601
- *.*.*.139:9601 - job_name: 'yarn_resourcemanager_cdh'
static_configs:
- targets:
- *.*.*.137:9602
- *.*.*.139:9602 - job_name: 'yarn_nodemanager_cdh'
static_configs:
- targets:
- *.*.*.118:9603
- *.*.*.138:9603
- *.*.*.139:9603 - job_name: 'redis_exporter'
static_configs:
- targets:
- *.*.*.166:9594 - job_name: 'elasticsearch'
metrics_path: "/_prometheus/metrics"
static_configs:
- targets:
- *.*.*.169:8970
- *.*.*.175:8970
- *.*.*.177:8970 - job_name: 'nacos'
metrics_path: '/nacos/actuator/prometheus'
static_configs:
- targets:
- *.*.*.166:8848
- *.*.*.167:8848
- *.*.*.168:8848 - job_name: 'redis_exporter_targets'
static_configs:
- targets:
- redis://*.*.*.179:6179
- redis://*.*.*.180:6179
- redis://*.*.*.181:6179
- redis://*.*.*.179:6178
- redis://*.*.*.180:6178
- redis://*.*.*.181:6178
- redis://*.*.*.179:6177
- redis://*.*.*.179:6176
metrics_path: /scrape
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: *.*.*.166:9594

prometheus-additional.yaml

然后我们需要将这些监控配置以secret资源类型存储到k8s集群中。

kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring


六、配置Alertmanager

监控和告警项已经配置好了,那么接下来我们将进行alertmanager告警配置了。

常用的接收方式就是邮件了,但这里我们将使用企业微信号进行接收,所以开发一个连接微信的应用appalertservice,进行消息转发和处理。

当然,你也可以直接配置微信号和消息模板,可参考:第3章 Prometheus告警处理

global:
resolve_timeout: 5m
# smtp_smarthost: 'smtp.sina.com:25'
# smtp_from: '******@sina.com'
# smtp_auth_username: '******@sina.com'
# smtp_auth_password: '******'
route:
group_by: ['job']
group_wait: 20s
group_interval: 30m
repeat_interval: 12h
receiver: webhook
receivers:
- name: webhook
webhook_configs:
- url: 'http://appalertservice:20119/'
# - name: 'email'
# email_configs:
# - to: '******@163.com'

然后我们需要将alertmanager配置以secret资源类型存储到k8s集群中。

kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring


七、配置Grafana

前面我们把监控和告警已经配置好了,那接下来就剩展示了。打开grafana  ->  点击添加按钮 ->Import ->Upload .json file,导入监控仪表板。

{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "This dashboard provides cluster admins with the ability to monitor nodes and identify workload bottlenecks. It can be deployed with PSPs enabled using the following helm chart - https://github.com/pivotal-cf/charts-grafana",
"editable": true,
"gnetId": 10000,
"graphTooltip": 0,
"id": 102,
"iteration": 1597137794957,
"links": [],
"panels": [
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 34,
"panels": [],
"repeat": null,
"title": "Summary",
"type": "row"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"editable": true,
"error": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 5,
"w": 8,
"x": 0,
"y": 1
},
"height": "180px",
"id": 4,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}) / sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"}) * 100",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"refId": "A",
"step": 10
}
],
"thresholds": "65, 90",
"title": "Cluster memory usage",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 5,
"w": 8,
"x": 8,
"y": 1
},
"height": "180px",
"id": 6,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) / sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"}) * 100",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 10
}
],
"thresholds": "65, 90",
"title": "Cluster CPU usage",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": true,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 5,
"w": 8,
"x": 16,
"y": 1
},
"height": "180px",
"id": 7,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_fs_usage_bytes{id=\"/\"}) / sum (container_fs_limit_bytes{id=\"/\"}) * 100",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"metric": "",
"refId": "A",
"step": 10
}
],
"thresholds": "65, 90",
"title": "Cluster filesystem usage",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 0,
"y": 6
},
"height": "1px",
"id": 9,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "20%",
"prefix": "",
"prefixFontSize": "20%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_memory_working_set_bytes{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Used",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 4,
"y": 6
},
"height": "1px",
"id": 10,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (machine_memory_bytes{kubernetes_io_hostname=~\"^$Node$\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Total",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 8,
"y": 6
},
"height": "1px",
"id": 11,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": " cores",
"postfixFontSize": "30%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (rate (container_cpu_usage_seconds_total{id=\"/\",kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Used",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "none",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 12,
"y": 6
},
"height": "1px",
"id": 12,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": " cores",
"postfixFontSize": "30%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (machine_cpu_cores{kubernetes_io_hostname=~\"^$Node$\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Total",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 16,
"y": 6
},
"height": "1px",
"id": 13,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_fs_usage_bytes{id=\"/\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Used",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": false,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 3,
"w": 4,
"x": 20,
"y": 6
},
"height": "1px",
"id": 14,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum (container_fs_limit_bytes{id=\"/\"})",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 10
}
],
"thresholds": "",
"title": "Total",
"type": "singlestat",
"valueFontSize": "50%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 9
},
"id": 35,
"panels": [],
"repeat": null,
"title": "Memory",
"type": "row"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"fill": 0,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 10
},
"id": 25,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": false,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"sideWidth": 200,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "connected",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "sum (container_memory_working_set_bytes{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}) by (pod)",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "{{ pod }}",
"metric": "container_memory_usage:sort_desc",
"refId": "A",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Pods memory usage",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 2,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 17
},
"id": 37,
"panels": [],
"title": "CPU",
"type": "row"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 3,
"editable": true,
"error": false,
"fill": 0,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 18
},
"height": "",
"id": 17,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "connected",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": true,
"steppedLine": false,
"targets": [
{
"expr": "sum (rate (container_cpu_usage_seconds_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "{{ pod }}",
"metric": "container_cpu",
"refId": "A",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Pods CPU usage",
"tooltip": {
"msResolution": true,
"shared": true,
"sort": 2,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "none",
"label": "cores",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 25
},
"id": 33,
"panels": [],
"repeat": null,
"title": "Network I/O",
"type": "row"
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"fill": 1,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 7,
"w": 24,
"x": 0,
"y": 26
},
"id": 16,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"max": false,
"min": false,
"rightSide": true,
"show": true,
"sideWidth": 200,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum (rate (container_network_receive_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "-> {{ pod }}",
"metric": "network",
"refId": "A",
"step": 10
},
{
"expr": "- sum (rate (container_network_transmit_bytes_total{image!=\"\",name=~\"^k8s_.*\",kubernetes_io_hostname=~\"^$Node$\"}[$interval])) by (pod)",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "<- {{ pod }}",
"metric": "network",
"refId": "B",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Pods network I/O",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 2,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"fill": 1,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 5,
"w": 24,
"x": 0,
"y": 33
},
"height": "200px",
"id": 32,
"legend": {
"alignAsTable": false,
"avg": true,
"current": true,
"max": false,
"min": false,
"rightSide": false,
"show": false,
"sideWidth": 200,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "connected",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum (rate (container_network_receive_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "Received",
"metric": "network",
"refId": "A",
"step": 10
},
{
"expr": "- sum (rate (container_network_transmit_bytes_total{kubernetes_io_hostname=~\"^$Node$\"}[$interval]))",
"format": "time_series",
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "Sent",
"metric": "network",
"refId": "B",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "Network I/O pressure",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "Bps",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": "10s",
"schemaVersion": 20,
"style": "dark",
"tags": [
"Prometheus",
"Kubernetes"
],
"templating": {
"list": [
{
"auto": true,
"auto_count": 20,
"auto_min": "2m",
"current": {
"text": "auto",
"value": "$__auto_interval_interval"
},
"hide": 2,
"label": null,
"name": "interval",
"options": [
{
"selected": true,
"text": "auto",
"value": "$__auto_interval_interval"
},
{
"selected": false,
"text": "1m",
"value": "1m"
},
{
"selected": false,
"text": "10m",
"value": "10m"
},
{
"selected": false,
"text": "30m",
"value": "30m"
},
{
"selected": false,
"text": "1h",
"value": "1h"
},
{
"selected": false,
"text": "6h",
"value": "6h"
},
{
"selected": false,
"text": "12h",
"value": "12h"
},
{
"selected": false,
"text": "1d",
"value": "1d"
},
{
"selected": false,
"text": "7d",
"value": "7d"
},
{
"selected": false,
"text": "14d",
"value": "14d"
},
{
"selected": false,
"text": "30d",
"value": "30d"
}
],
"query": "1m,10m,30m,1h,6h,12h,1d,7d,14d,30d",
"refresh": 2,
"skipUrlSync": false,
"type": "interval"
},
{
"current": {
"text": "prometheus",
"value": "prometheus"
},
"hide": 0,
"includeAll": false,
"label": null,
"multi": false,
"name": "datasource",
"options": [],
"query": "prometheus",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"type": "datasource"
},
{
"allValue": ".*",
"current": {
"text": "All",
"value": "$__all"
},
"datasource": "prometheus",
"definition": "",
"hide": 0,
"includeAll": true,
"label": null,
"multi": false,
"name": "Node",
"options": [],
"query": "label_values(kubernetes_io_hostname)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 0,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-5m",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "育苗通K8S集群监控",
"uid": "6KoW2MIGk",
"version": 15
}

k8s-model.json

{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "【中文版本】2020.06.28更新,增加整体资源展示!支持 Grafana6&7,Node Exporter v0.16及以上的版本,优化重要指标展示。包含整体资源展示与资源明细图表:CPU 内存 磁盘 IO 网络等监控指标。https://github.com/starsliao/Prometheus",
"editable": true,
"gnetId": 8919,
"graphTooltip": 0,
"id": 72,
"iteration": 1597137684806,
"links": [
{
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "更新node_exporter",
"tooltip": "",
"type": "link",
"url": "https://github.com/prometheus/node_exporter/releases"
},
{
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "更新当前仪表板",
"tooltip": "",
"type": "link",
"url": "https://grafana.com/dashboards/8919"
},
{
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "StarsL.cn",
"tooltip": "",
"type": "link",
"url": "https://starsl.cn"
},
{
"asDropdown": true,
"icon": "external link",
"tags": [],
"targetBlank": true,
"title": "",
"type": "dashboards"
}
],
"panels": [
{
"collapsed": false,
"datasource": "prometheus",
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 0
},
"id": 187,
"panels": [],
"title": "资源总览(关联JOB项)当前选中主机:【$show_hostname】实例:$node",
"type": "row"
},
{
"columns": [],
"datasource": "prometheus",
"description": "分区使用率、磁盘读取、磁盘写入、下载带宽、上传带宽,如果有多个网卡或者多个分区,是采集的使用率最高的网卡或者分区的数值。",
"fontSize": "100%",
"gridPos": {
"h": 12,
"w": 24,
"x": 0,
"y": 1
},
"id": 185,
"options": {},
"pageSize": 10,
"showHeader": true,
"sort": {
"col": 5,
"desc": false
},
"styles": [
{
"alias": "主机名",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 1,
"link": false,
"linkTooltip": "",
"linkUrl": "",
"mappingType": 1,
"pattern": "nodename",
"thresholds": [],
"type": "string",
"unit": "bytes"
},
{
"alias": "IP(链接到明细)",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": true,
"linkTargetBlank": false,
"linkTooltip": "浏览主机明细",
"linkUrl": "/d/9CWBz0bik/node-exporter?orgId=1&var-job=${job}&var-hostname=All&var-node=${__cell}&var-device=All",
"mappingType": 1,
"pattern": "instance",
"thresholds": [],
"type": "number",
"unit": "short"
},
{
"alias": "内存",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"mappingType": 1,
"pattern": "Value #B",
"thresholds": [],
"type": "number",
"unit": "bytes"
},
{
"alias": "CPU核",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": null,
"mappingType": 1,
"pattern": "Value #C",
"thresholds": [],
"type": "number",
"unit": "short"
},
{
"alias": " 运行时间",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #D",
"thresholds": [],
"type": "number",
"unit": "s"
},
{
"alias": "分区使用率*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #E",
"thresholds": [
"",
""
],
"type": "number",
"unit": "percent"
},
{
"alias": "CPU使用率",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #F",
"thresholds": [
"",
""
],
"type": "number",
"unit": "percent"
},
{
"alias": "内存使用率",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #G",
"thresholds": [
"",
""
],
"type": "number",
"unit": "percent"
},
{
"alias": "磁盘读取*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #H",
"thresholds": [
"",
""
],
"type": "number",
"unit": "Bps"
},
{
"alias": "磁盘写入*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #I",
"thresholds": [
"",
""
],
"type": "number",
"unit": "Bps"
},
{
"alias": "下载带宽*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #J",
"thresholds": [
"",
""
],
"type": "number",
"unit": "bps"
},
{
"alias": "上传带宽*",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #K",
"thresholds": [
"",
""
],
"type": "number",
"unit": "bps"
},
{
"alias": "5m负载",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "Value #L",
"thresholds": [],
"type": "number",
"unit": "short"
},
{
"alias": "",
"align": "right",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"decimals": 2,
"pattern": "/.*/",
"thresholds": [],
"type": "hidden",
"unit": "short"
}
],
"targets": [
{
"expr": "node_uname_info{job=~\"$job\"} - 0",
"format": "table",
"instant": true,
"interval": "",
"legendFormat": "主机名",
"refId": "A"
},
{
"expr": "sum(time() - node_boot_time_seconds{job=~\"$job\"})by(instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "运行时间",
"refId": "D"
},
{
"expr": "node_memory_MemTotal_bytes{job=~\"$job\"} - 0",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "总内存",
"refId": "B"
},
{
"expr": "count(node_cpu_seconds_total{job=~\"$job\",mode='system'}) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "总核数",
"refId": "C"
},
{
"expr": "node_load5{job=~\"$job\"}",
"format": "table",
"instant": true,
"interval": "",
"legendFormat": "5分钟负载",
"refId": "L"
},
{
"expr": "(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "CPU使用率",
"refId": "F"
},
{
"expr": "(1 - (node_memory_MemAvailable_bytes{job=~\"$job\"} / (node_memory_MemTotal_bytes{job=~\"$job\"})))* 100",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "内存使用率",
"refId": "G"
},
{
"expr": "max((node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}) *100/(node_filesystem_avail_bytes {job=~\"$job\",fstype=~\"ext.?|xfs\"}+(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"}-node_filesystem_free_bytes{job=~\"$job\",fstype=~\"ext.?|xfs\"})))by(instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "分区使用率",
"refId": "E"
},
{
"expr": "max(irate(node_disk_read_bytes_total{job=~\"$job\"}[5m])) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "最大读取",
"refId": "H"
},
{
"expr": "max(irate(node_disk_written_bytes_total{job=~\"$job\"}[5m])) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "最大写入",
"refId": "I"
},
{
"expr": "max(irate(node_network_receive_bytes_total{job=~\"$job\"}[5m])*8) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "下载带宽",
"refId": "J"
},
{
"expr": "max(irate(node_network_transmit_bytes_total{job=~\"$job\"}[5m])*8) by (instance)",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "上传带宽",
"refId": "K"
}
],
"timeFrom": null,
"timeShift": null,
"title": "服务器资源总览表(每页10行)",
"transform": "table",
"type": "table"
},
{
"aliasColors": {
"192.168.200.241:9100_Total": "dark-red",
"Idle - Waiting for something to happen": "#052B51",
"guest": "#9AC48A",
"idle": "#052B51",
"iowait": "#EAB839",
"irq": "#BF1B00",
"nice": "#C15C17",
"sdb_每秒I/O操作%": "#d683ce",
"softirq": "#E24D42",
"steal": "#FCE2DE",
"system": "#508642",
"user": "#5195CE",
"磁盘花费在I/O操作占比": "#ba43a9"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": null,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 13
},
"hiddenSeries": false,
"id": 191,
"legend": {
"alignAsTable": false,
"avg": false,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"maxPerRow": 6,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "总平均使用率",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
},
{
"alias": "总核数",
"color": "#C4162A"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "count(node_cpu_seconds_total{job=~\"$job\", mode='system'})",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总核数",
"refId": "B",
"step": 240
},
{
"expr": "sum(node_load5{job=~\"$job\"})",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总5分钟负载",
"refId": "A",
"step": 240
},
{
"expr": "avg(1 - avg(irate(node_cpu_seconds_total{job=~\"$job\",mode=\"idle\"}[5m])) by (instance)) * 100",
"format": "time_series",
"hide": false,
"interval": "30m",
"intervalFactor": 1,
"legendFormat": "总平均使用率",
"refId": "F",
"step": 240
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "$job:整体总负载与整体平均CPU使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "short",
"label": "总负载",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"decimals": 0,
"format": "percent",
"label": "平均使用率",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"192.168.200.241:9100_总内存": "dark-red",
"内存_Avaliable": "#6ED0E0",
"内存_Cached": "#EF843C",
"内存_Free": "#629E51",
"内存_Total": "#6d1f62",
"内存_Used": "#eab839",
"可用": "#9ac48a",
"总内存": "#bf1b00"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 1,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 8,
"y": 13
},
"height": "",
"hiddenSeries": false,
"id": 195,
"legend": {
"alignAsTable": false,
"avg": false,
"current": true,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sort": "current",
"sortDesc": false,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "总内存",
"color": "#C4162A",
"fill": 0
},
{
"alias": "总平均使用率",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"})",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总内存",
"refId": "A",
"step": 4
},
{
"expr": "sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"})",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总已用",
"refId": "B",
"step": 4
},
{
"expr": "(sum(node_memory_MemTotal_bytes{job=~\"$job\"} - node_memory_MemAvailable_bytes{job=~\"$job\"}) / sum(node_memory_MemTotal_bytes{job=~\"$job\"}))*100",
"format": "time_series",
"hide": false,
"interval": "30m",
"intervalFactor": 1,
"legendFormat": "总平均使用率",
"refId": "H"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "$job:整体总内存与整体平均内存使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "bytes",
"label": "总内存量",
"logBase": 1,
"max": null,
"min": "",
"show": true
},
{
"decimals": null,
"format": "percent",
"label": "平均使用率",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 1,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 13
},
"hiddenSeries": false,
"id": 197,
"legend": {
"alignAsTable": false,
"avg": false,
"current": true,
"hideEmpty": false,
"hideZero": false,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "总平均使用率",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
},
{
"alias": "总磁盘量",
"color": "#C4162A"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总磁盘量",
"refId": "E"
},
{
"expr": "sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总使用量",
"refId": "C"
},
{
"expr": "(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))) *100/(sum(avg(node_filesystem_avail_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))+(sum(avg(node_filesystem_size_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance)) - sum(avg(node_filesystem_free_bytes{job=~\"$job\",fstype=~\"xfs|ext.*\"})by(device,instance))))",
"format": "time_series",
"instant": false,
"interval": "30m",
"intervalFactor": 1,
"legendFormat": "总平均使用率",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "$job:整体总磁盘与整体平均磁盘使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": 1,
"format": "bytes",
"label": "总磁盘量",
"logBase": 1,
"max": null,
"min": "",
"show": true
},
{
"decimals": null,
"format": "percent",
"label": "平均使用率",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"collapsed": false,
"datasource": "prometheus",
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 21
},
"id": 189,
"panels": [],
"title": "资源明细:【$show_hostname】",
"type": "row"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorPostfix": false,
"colorPrefix": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"decimals": 0,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "s",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"threshcisLabels": false,
"threshcisMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 0,
"y": 22
},
"hideTimeOverride": true,
"id": 15,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "null",
"nullText": null,
"options": {},
"pluginVersion": "6.4.2",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "avg(time() - node_boot_time_seconds{instance=~\"$node\"})",
"format": "time_series",
"hide": false,
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 40
}
],
"threshciss": "1,2",
"thresholds": "1,3",
"title": "运行时间",
"type": "singlestat",
"valueFontSize": "70%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"datasource": "prometheus",
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"custom": {},
"decimals": 2,
"displayName": "",
"mappings": [
{
"from": "",
"id": 1,
"operator": "",
"text": "N/A",
"to": "",
"type": 1,
"value": ""
}
],
"max": 100,
"min": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 70
},
{
"color": "#EAB839",
"value": 90
}
]
},
"unit": "percent"
},
"overrides": []
},
"gridPos": {
"h": 6,
"w": 3,
"x": 2,
"y": 22
},
"id": 177,
"options": {
"displayMode": "lcd",
"fieldOptions": {
"calcs": [
"last"
],
"defaults": {
"decimals": 1,
"mappings": [
{
"from": "",
"id": 1,
"operator": "",
"text": "N/A",
"to": "",
"type": 1,
"value": ""
}
],
"max": 100,
"min": 0.1,
"thresholds": {
"": {
"color": "green",
"value": null
},
"": {
"color": "red",
"value": 80
},
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "#EAB839",
"value": 70
},
{
"color": "red",
"value": 90
}
]
},
"unit": "percent"
},
"override": {},
"overrides": [],
"values": false
},
"orientation": "horizontal",
"reduceOptions": {
"calcs": [
"mean"
],
"values": false
},
"showUnfilled": true
},
"pluginVersion": "6.4.3",
"targets": [
{
"expr": "100 - (avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) * 100)",
"instant": true,
"interval": "",
"legendFormat": "总CPU使用率",
"refId": "A"
},
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100",
"hide": true,
"instant": true,
"interval": "",
"legendFormat": "IOwait使用率",
"refId": "C"
},
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100",
"instant": true,
"interval": "",
"legendFormat": "内存使用率",
"refId": "B"
},
{
"expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"})*100 /(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint=\"$maxmount\"}))",
"hide": false,
"instant": true,
"interval": "",
"legendFormat": "最大分区({{mountpoint}})使用率",
"refId": "D"
},
{
"expr": "(1 - ((node_memory_SwapFree_bytes{instance=~\"$node\"} + 1)/ (node_memory_SwapTotal_bytes{instance=~\"$node\"} + 1))) * 100",
"instant": true,
"legendFormat": "交换分区使用率",
"refId": "F"
}
],
"timeFrom": null,
"timeShift": null,
"title": "",
"type": "bargauge"
},
{
"columns": [],
"datasource": "prometheus",
"description": "本看板中的:磁盘总量、使用量、可用量、使用率保持和df命令的Size、Used、Avail、Use% 列的值一致,并且Use%的值会四舍五入保留一位小数,会更加准确。\n\n注:df中Use%算法为:(size - free) * 100 / (avail + (size - free)),结果是整除则为该值,非整除则为该值+1,结果的单位是%。\n参考df命令源码:",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fontSize": "100%",
"gridPos": {
"h": 6,
"w": 10,
"x": 5,
"y": 22
},
"id": 181,
"links": [
{
"targetBlank": true,
"title": "https://github.com/coreutils/coreutils/blob/master/src/df.c",
"url": "https://github.com/coreutils/coreutils/blob/master/src/df.c"
}
],
"options": {},
"pageSize": null,
"scroll": true,
"showHeader": true,
"sort": {
"col": 6,
"desc": false
},
"styles": [
{
"alias": "分区",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"mappingType": 1,
"pattern": "mountpoint",
"thresholds": [
""
],
"type": "string",
"unit": "bytes"
},
{
"alias": "可用空间",
"align": "auto",
"colorMode": "value",
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 1,
"mappingType": 1,
"pattern": "Value #A",
"thresholds": [
"",
""
],
"type": "number",
"unit": "bytes"
},
{
"alias": "使用率",
"align": "auto",
"colorMode": "cell",
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"rgba(245, 54, 54, 0.9)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 1,
"mappingType": 1,
"pattern": "Value #B",
"thresholds": [
"",
""
],
"type": "number",
"unit": "percent"
},
{
"alias": "总空间",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 0,
"link": false,
"mappingType": 1,
"pattern": "Value #C",
"thresholds": [],
"type": "number",
"unit": "bytes"
},
{
"alias": "文件系统",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"mappingType": 1,
"pattern": "fstype",
"thresholds": [],
"type": "string",
"unit": "short"
},
{
"alias": "设备名",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"dateFormat": "YYYY-MM-DD HH:mm:ss",
"decimals": 2,
"link": false,
"mappingType": 1,
"pattern": "device",
"preserveFormat": false,
"sanitize": false,
"thresholds": [],
"type": "string",
"unit": "short"
},
{
"alias": "",
"align": "auto",
"colorMode": null,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"decimals": 2,
"pattern": "/.*/",
"preserveFormat": true,
"sanitize": false,
"thresholds": [],
"type": "hidden",
"unit": "short"
}
],
"targets": [
{
"expr": "node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总量",
"refId": "C"
},
{
"expr": "node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-0",
"format": "table",
"hide": false,
"instant": true,
"interval": "10s",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A"
},
{
"expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))",
"format": "table",
"hide": false,
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "B"
}
],
"title": "【$show_hostname】:各分区可用空间(EXT.*/XFS)",
"transform": "table",
"type": "table"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(50, 172, 45, 0.97)",
"rgba(237, 129, 40, 0.89)",
"#d44a3a"
],
"datasource": "prometheus",
"decimals": 2,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "percent",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 15,
"y": 22
},
"id": 20,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"nullPointMode": "connected",
"nullText": null,
"options": {},
"pluginVersion": "6.4.2",
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": true,
"lineColor": "#3274D9",
"show": true,
"ymax": null,
"ymin": null
},
"tableColumn": "",
"targets": [
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) * 100",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 20
}
],
"thresholds": "20,50",
"timeFrom": null,
"timeShift": null,
"title": "CPU iowait",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "avg"
},
{
"aliasColors": {
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in": "light-red",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_in下载": "green",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_cni0_out上传": "yellow",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_in下载": "purple",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out": "purple",
"cn-shenzhen.i-wz9cq1dcb6zwc39ehw59_eth0_out上传": "blue"
},
"bars": true,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"editable": true,
"error": false,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"grid": {},
"gridPos": {
"h": 6,
"w": 7,
"x": 17,
"y": 22
},
"hiddenSeries": false,
"id": 183,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": false,
"show": false,
"sort": "current",
"sortDesc": true,
"total": true,
"values": true
},
"lines": false,
"linewidth": 2,
"links": [],
"nullPointMode": "null as zero",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 1,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "/.*_out上传$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "increase(node_network_receive_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])",
"interval": "60m",
"intervalFactor": 1,
"legendFormat": "{{device}}_in下载",
"metric": "",
"refId": "A",
"step": 600,
"target": ""
},
{
"expr": "increase(node_network_transmit_bytes_total{instance=~\"$node\",device=~\"$device\"}[60m])",
"hide": false,
"interval": "60m",
"intervalFactor": 1,
"legendFormat": "{{device}}_out上传",
"refId": "B",
"step": 600
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每小时流量$device",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 0,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": "上传(-)/下载(+)",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorPostfix": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "short",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 0,
"y": 24
},
"id": 14,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"maxPerRow": 6,
"nullPointMode": "null",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "count(node_cpu_seconds_total{instance=~\"$node\", mode='system'})",
"format": "time_series",
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 20
}
],
"thresholds": "1,2",
"title": "CPU 核数",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorPostfix": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"decimals": null,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "short",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 15,
"y": 24
},
"id": 179,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"maxPerRow": 6,
"nullPointMode": "null",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "avg(node_filesystem_files_free{instance=~\"$node\",mountpoint=\"$maxmount\",fstype=~\"ext.?|xfs\"})",
"format": "time_series",
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 20
}
],
"thresholds": "100000,1000000",
"title": "剩余节点数:$maxmount ",
"type": "singlestat",
"valueFontSize": "70%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"decimals": 0,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "bytes",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 0,
"y": 26
},
"id": 75,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"maxPerRow": 6,
"nullPointMode": "null",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "70%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "sum(node_memory_MemTotal_bytes{instance=~\"$node\"})",
"format": "time_series",
"instant": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{instance}}",
"refId": "A",
"step": 20
}
],
"thresholds": "2,3",
"title": "总内存",
"type": "singlestat",
"valueFontSize": "80%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"cacheTimeout": null,
"colorBackground": false,
"colorPostfix": false,
"colorValue": true,
"colors": [
"rgba(245, 54, 54, 0.9)",
"rgba(237, 129, 40, 0.89)",
"rgba(50, 172, 45, 0.97)"
],
"datasource": "prometheus",
"decimals": null,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"format": "locale",
"gauge": {
"maxValue": 100,
"minValue": 0,
"show": false,
"thresholdLabels": false,
"thresholdMarkers": true
},
"gridPos": {
"h": 2,
"w": 2,
"x": 15,
"y": 26
},
"id": 178,
"interval": null,
"links": [],
"mappingType": 1,
"mappingTypes": [
{
"name": "value to text",
"value": 1
},
{
"name": "range to text",
"value": 2
}
],
"maxDataPoints": 100,
"maxPerRow": 6,
"nullPointMode": "null",
"nullText": null,
"options": {},
"postfix": "",
"postfixFontSize": "50%",
"prefix": "",
"prefixFontSize": "50%",
"rangeMaps": [
{
"from": "null",
"text": "N/A",
"to": "null"
}
],
"sparkline": {
"fillColor": "rgba(31, 118, 189, 0.18)",
"full": false,
"lineColor": "rgb(31, 120, 193)",
"show": false
},
"tableColumn": "",
"targets": [
{
"expr": "avg(node_filefd_maximum{instance=~\"$node\"})",
"format": "time_series",
"instant": true,
"intervalFactor": 1,
"legendFormat": "",
"refId": "A",
"step": 20
}
],
"thresholds": "1024,10000",
"title": "总文件描述符",
"type": "singlestat",
"valueFontSize": "70%",
"valueMaps": [
{
"op": "=",
"text": "N/A",
"value": "null"
}
],
"valueName": "current"
},
{
"aliasColors": {
"192.168.200.241:9100_Total": "dark-red",
"Idle - Waiting for something to happen": "#052B51",
"guest": "#9AC48A",
"idle": "#052B51",
"iowait": "#EAB839",
"irq": "#BF1B00",
"nice": "#C15C17",
"sdb_每秒I/O操作%": "#d683ce",
"softirq": "#E24D42",
"steal": "#FCE2DE",
"system": "#508642",
"user": "#5195CE",
"磁盘花费在I/O操作占比": "#ba43a9"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 28
},
"hiddenSeries": false,
"id": 7,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"maxPerRow": 6,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "/.*总使用率/",
"color": "#C4162A",
"fill": 0
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"system\"}[5m])) by (instance) *100",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "系统使用率",
"refId": "A",
"step": 20
},
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"user\"}[5m])) by (instance) *100",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "用户使用率",
"refId": "B",
"step": 240
},
{
"expr": "avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"iowait\"}[5m])) by (instance) *100",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "磁盘IO使用率",
"refId": "D",
"step": 240
},
{
"expr": "(1 - avg(irate(node_cpu_seconds_total{instance=~\"$node\",mode=\"idle\"}[5m])) by (instance))*100",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总使用率",
"refId": "F",
"step": 240
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "CPU使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": 0,
"format": "percent",
"label": "",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"192.168.200.241:9100_总内存": "dark-red",
"使用率": "yellow",
"内存_Avaliable": "#6ED0E0",
"内存_Cached": "#EF843C",
"内存_Free": "#629E51",
"内存_Total": "#6d1f62",
"内存_Used": "#eab839",
"可用": "#9ac48a",
"总内存": "#bf1b00"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 8,
"y": 28
},
"height": "",
"hiddenSeries": false,
"id": 156,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "总内存",
"color": "#C4162A",
"fill": 0
},
{
"alias": "使用率",
"color": "rgb(0, 209, 255)",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_memory_MemTotal_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "总内存",
"refId": "A",
"step": 4
},
{
"expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - node_memory_MemAvailable_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "已用",
"refId": "B",
"step": 4
},
{
"expr": "node_memory_MemAvailable_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "可用",
"refId": "F",
"step": 4
},
{
"expr": "node_memory_Buffers_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": true,
"intervalFactor": 1,
"legendFormat": "内存_Buffers",
"refId": "D",
"step": 4
},
{
"expr": "node_memory_MemFree_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": true,
"intervalFactor": 1,
"legendFormat": "内存_Free",
"refId": "C",
"step": 4
},
{
"expr": "node_memory_Cached_bytes{instance=~\"$node\"}",
"format": "time_series",
"hide": true,
"intervalFactor": 1,
"legendFormat": "内存_Cached",
"refId": "E",
"step": 4
},
{
"expr": "node_memory_MemTotal_bytes{instance=~\"$node\"} - (node_memory_Cached_bytes{instance=~\"$node\"} + node_memory_Buffers_bytes{instance=~\"$node\"} + node_memory_MemFree_bytes{instance=~\"$node\"})",
"format": "time_series",
"hide": true,
"intervalFactor": 1,
"refId": "G"
},
{
"expr": "(1 - (node_memory_MemAvailable_bytes{instance=~\"$node\"} / (node_memory_MemTotal_bytes{instance=~\"$node\"})))* 100",
"format": "time_series",
"hide": false,
"interval": "30m",
"intervalFactor": 10,
"legendFormat": "使用率",
"refId": "H"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "内存信息",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bytes",
"label": null,
"logBase": 1,
"max": null,
"min": "",
"show": true
},
{
"format": "percent",
"label": "内存使用率",
"logBase": 1,
"max": "",
"min": "",
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"192.168.10.227:9100_em1_in下载": "super-light-green",
"192.168.10.227:9100_em1_out上传": "dark-blue"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 28
},
"height": "",
"hiddenSeries": false,
"id": 157,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/.*_out上传$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_network_receive_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_in下载",
"refId": "A",
"step": 4
},
{
"expr": "irate(node_network_transmit_bytes_total{instance=~'$node',device=~\"$device\"}[5m])*8",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_out上传",
"refId": "B",
"step": 4
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每秒网络带宽使用$device",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "bps",
"label": "上传(-)/下载(+)",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"15分钟": "#6ED0E0",
"1分钟": "#BF1B00",
"5分钟": "#CCA300"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"editable": true,
"error": false,
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 1,
"grid": {},
"gridPos": {
"h": 8,
"w": 8,
"x": 0,
"y": 36
},
"height": "",
"hiddenSeries": false,
"id": 13,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"maxPerRow": 6,
"nullPointMode": "null as zero",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"repeat": null,
"seriesOverrides": [
{
"alias": "/.*总核数/",
"color": "#C4162A"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_load1{instance=~\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "1分钟负载",
"metric": "",
"refId": "A",
"step": 20,
"target": ""
},
{
"expr": "node_load5{instance=~\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "5分钟负载",
"refId": "B",
"step": 20
},
{
"expr": "node_load15{instance=~\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "15分钟负载",
"refId": "C",
"step": 20
},
{
"expr": " sum(count(node_cpu_seconds_total{instance=~\"$node\", mode='system'}) by (cpu,instance)) by(instance)",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "CPU总核数",
"refId": "D",
"step": 20
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "系统平均负载",
"tooltip": {
"msResolution": false,
"shared": true,
"sort": 2,
"value_type": "cumulative"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"vda_write": "#6ED0E0"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "Read bytes 每个磁盘分区每秒读取的比特数\nWritten bytes 每个磁盘分区每秒写入的比特数",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 1,
"gridPos": {
"h": 8,
"w": 8,
"x": 8,
"y": 36
},
"height": "",
"hiddenSeries": false,
"id": 168,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/.*_读取$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_disk_read_bytes_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_读取",
"refId": "A",
"step": 10
},
{
"expr": "irate(node_disk_written_bytes_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_写入",
"refId": "B",
"step": 10
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每秒磁盘读写容量",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "Bps",
"label": "读取(-)/写入(+)",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 1,
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 36
},
"hiddenSeries": false,
"id": 174,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/Inodes.*/",
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}) *100/(node_filesystem_avail_bytes {instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}+(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}-node_filesystem_free_bytes{instance=~'$node',fstype=~\"ext.*|xfs\",mountpoint !~\".*pod.*\"}))",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{mountpoint}}",
"refId": "A"
},
{
"expr": "node_filesystem_files_free{instance=~'$node',fstype=~\"ext.?|xfs\"} / node_filesystem_files{instance=~'$node',fstype=~\"ext.?|xfs\"}",
"hide": true,
"interval": "",
"legendFormat": "Inodes:{{instance}}:{{mountpoint}}",
"refId": "B"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "磁盘使用率",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "percent",
"label": "",
"logBase": 1,
"max": "",
"min": "",
"show": true
},
{
"decimals": 2,
"format": "percentunit",
"label": null,
"logBase": 1,
"max": "",
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"vda_write": "#6ED0E0"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "Reads completed: 每个磁盘分区每秒读完成次数\n\nWrites completed: 每个磁盘分区每秒写完成次数\n\nIO now 每个磁盘分区每秒正在处理的输入/输出请求数",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 9,
"w": 8,
"x": 0,
"y": 44
},
"height": "",
"hiddenSeries": false,
"id": 161,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/.*_读取$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_读取",
"refId": "A",
"step": 10
},
{
"expr": "irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_写入",
"refId": "B",
"step": 10
},
{
"expr": "node_disk_io_now{instance=~\"$node\"}",
"format": "time_series",
"hide": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}",
"refId": "C"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "磁盘读写速率(IOPS)",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "iops",
"label": "读取(-)/写入(+)I/O ops/sec",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"Idle - Waiting for something to happen": "#052B51",
"guest": "#9AC48A",
"idle": "#052B51",
"iowait": "#EAB839",
"irq": "#BF1B00",
"nice": "#C15C17",
"sdb_每秒I/O操作%": "#d683ce",
"softirq": "#E24D42",
"steal": "#FCE2DE",
"system": "#508642",
"user": "#5195CE",
"磁盘花费在I/O操作占比": "#ba43a9"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": null,
"description": "每一秒钟的自然时间内,花费在I/O上的耗时。(wall-clock time)\n\nnode_disk_io_time_seconds_total:\n磁盘花费在输入/输出操作上的秒数。该值为累加值。(Milliseconds Spent Doing I/Os)\n\nirate(node_disk_io_time_seconds_total[1m]):\n计算每秒的速率:(last值-last前一个值)/时间戳差值,即:1秒钟内磁盘花费在I/O操作的时间占比。",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 9,
"w": 8,
"x": 8,
"y": 44
},
"hiddenSeries": false,
"id": 175,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": false,
"rightSide": false,
"show": true,
"sideWidth": null,
"sort": null,
"sortDesc": null,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"maxPerRow": 6,
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_每秒I/O操作%",
"refId": "C"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每1秒内I/O操作耗时占比",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"decimals": null,
"format": "percentunit",
"label": "",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"vda": "#6ED0E0"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "Read time seconds 每个磁盘分区读操作花费的秒数\n\nWrite time seconds 每个磁盘分区写操作花费的秒数\n\nIO time seconds 每个磁盘分区输入/输出操作花费的秒数\n\nIO time weighted seconds每个磁盘分区输入/输出操作花费的加权秒数",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 1,
"fillGradient": 1,
"gridPos": {
"h": 9,
"w": 8,
"x": 16,
"y": 44
},
"height": "",
"hiddenSeries": false,
"id": 160,
"legend": {
"alignAsTable": true,
"avg": true,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": true,
"show": true,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null as zero",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/,*_读取$/",
"transform": "negative-Y"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "irate(node_disk_read_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_reads_completed_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_读取",
"refId": "B"
},
{
"expr": "irate(node_disk_write_time_seconds_total{instance=~\"$node\"}[5m]) / irate(node_disk_writes_completed_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_写入",
"refId": "C"
},
{
"expr": "irate(node_disk_io_time_seconds_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}",
"refId": "A",
"step": 10
},
{
"expr": "irate(node_disk_io_time_weighted_seconds_total{instance=~\"$node\"}[5m])",
"format": "time_series",
"hide": true,
"interval": "",
"intervalFactor": 1,
"legendFormat": "{{device}}_加权",
"refId": "D"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "每次IO读写的耗时(参考:小于100ms)(beta)",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "s",
"label": "读取(-)/写入(+)",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": false
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"192.168.200.241:9100_TCP_alloc": "semi-dark-blue",
"TCP": "#6ED0E0",
"TCP_alloc": "blue"
},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"decimals": 2,
"description": "Sockets_used - 已使用的所有协议套接字总量\n\nCurrEstab - 当前状态为 ESTABLISHED 或 CLOSE-WAIT 的 TCP 连接数\n\nTCP_alloc - 已分配(已建立、已申请到sk_buff)的TCP套接字数量\n\nTCP_tw - 等待关闭的TCP连接数\n\nUDP_inuse - 正在使用的 UDP 套接字数量\n\nRetransSegs - TCP 重传报文数\n\nOutSegs - TCP 发送的报文数\n\nInSegs - TCP 接收的报文数",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 16,
"x": 0,
"y": 53
},
"height": "",
"hiddenSeries": false,
"id": 158,
"interval": "",
"legend": {
"alignAsTable": true,
"avg": false,
"current": true,
"hideEmpty": true,
"hideZero": true,
"max": true,
"min": false,
"rightSide": true,
"show": true,
"sideWidth": null,
"sort": "current",
"sortDesc": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 1,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pointradius": 5,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/.*Sockets_used/",
"color": "#E02F44",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_netstat_Tcp_CurrEstab{instance=~'$node'}",
"format": "time_series",
"hide": false,
"instant": false,
"interval": "",
"intervalFactor": 1,
"legendFormat": "CurrEstab",
"refId": "A",
"step": 20
},
{
"expr": "node_sockstat_TCP_tw{instance=~'$node'}",
"format": "time_series",
"interval": "",
"intervalFactor": 1,
"legendFormat": "TCP_tw",
"refId": "D"
},
{
"expr": "node_sockstat_sockets_used{instance=~'$node'}",
"hide": false,
"interval": "30m",
"intervalFactor": 1,
"legendFormat": "Sockets_used",
"refId": "B"
},
{
"expr": "node_sockstat_UDP_inuse{instance=~'$node'}",
"interval": "",
"legendFormat": "UDP_inuse",
"refId": "C"
},
{
"expr": "node_sockstat_TCP_alloc{instance=~'$node'}",
"interval": "",
"legendFormat": "TCP_alloc",
"refId": "E"
},
{
"expr": "irate(node_netstat_Tcp_PassiveOpens{instance=~'$node'}[5m])",
"hide": true,
"interval": "",
"legendFormat": "{{instance}}_Tcp_PassiveOpens",
"refId": "G"
},
{
"expr": "irate(node_netstat_Tcp_ActiveOpens{instance=~'$node'}[5m])",
"hide": true,
"interval": "",
"legendFormat": "{{instance}}_Tcp_ActiveOpens",
"refId": "F"
},
{
"expr": "irate(node_netstat_Tcp_InSegs{instance=~'$node'}[5m])",
"interval": "",
"legendFormat": "Tcp_InSegs",
"refId": "H"
},
{
"expr": "irate(node_netstat_Tcp_OutSegs{instance=~'$node'}[5m])",
"interval": "",
"legendFormat": "Tcp_OutSegs",
"refId": "I"
},
{
"expr": "irate(node_netstat_Tcp_RetransSegs{instance=~'$node'}[5m])",
"hide": false,
"interval": "",
"legendFormat": "Tcp_RetransSegs",
"refId": "J"
},
{
"expr": "irate(node_netstat_TcpExt_ListenDrops{instance=~'$node'}[5m])",
"hide": true,
"interval": "",
"legendFormat": "",
"refId": "K"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "网络Socket连接信息",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"transformations": [],
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": "已使用的所有协议套接字总量",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
},
{
"aliasColors": {
"filefd_192.168.200.241:9100": "super-light-green",
"switches_192.168.200.241:9100": "semi-dark-red",
"使用的文件描述符_10.118.72.128:9100": "red",
"每秒上下文切换次数_10.118.71.245:9100": "yellow",
"每秒上下文切换次数_10.118.72.128:9100": "yellow"
},
"bars": false,
"cacheTimeout": null,
"dashLength": 10,
"dashes": false,
"datasource": "prometheus",
"description": "",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"fill": 0,
"fillGradient": 1,
"gridPos": {
"h": 8,
"w": 8,
"x": 16,
"y": 53
},
"hiddenSeries": false,
"hideTimeOverride": false,
"id": 16,
"legend": {
"alignAsTable": false,
"avg": false,
"current": true,
"max": false,
"min": false,
"rightSide": false,
"show": true,
"total": false,
"values": true
},
"lines": true,
"linewidth": 2,
"links": [],
"nullPointMode": "null",
"options": {
"dataLinks": []
},
"percentage": false,
"pluginVersion": "6.4.2",
"pointradius": 1,
"points": false,
"renderer": "flot",
"seriesOverrides": [
{
"alias": "/每秒上下文切换次数.*/",
"color": "#FADE2A",
"lines": false,
"pointradius": 1,
"points": true,
"yaxis": 2
},
{
"alias": "/使用的文件描述符.*/",
"color": "#F2495C"
}
],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "node_filefd_allocated{instance=~\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"intervalFactor": 5,
"legendFormat": "使用的文件描述符",
"refId": "B"
},
{
"expr": "irate(node_context_switches_total{instance=~\"$node\"}[5m])",
"interval": "",
"intervalFactor": 5,
"legendFormat": "每秒上下文切换次数",
"refId": "A"
},
{
"expr": " (node_filefd_allocated{instance=~\"$node\"}/node_filefd_maximum{instance=~\"$node\"}) *100",
"format": "time_series",
"hide": true,
"instant": false,
"interval": "",
"intervalFactor": 5,
"legendFormat": "使用的文件描述符占比_{{instance}}",
"refId": "C"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "打开的文件描述符(左 )/每秒上下文切换次数(右)",
"tooltip": {
"shared": true,
"sort": 2,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "short",
"label": "使用的文件描述符",
"logBase": 1,
"max": null,
"min": null,
"show": true
},
{
"format": "short",
"label": "context_switches",
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
],
"refresh": "",
"schemaVersion": 20,
"style": "dark",
"tags": [
"Prometheus",
"node_exporter"
],
"templating": {
"list": [
{
"allValue": null,
"current": {
"tags": [],
"text": "node-exporter",
"value": "node-exporter"
},
"datasource": "prometheus",
"definition": "label_values(node_uname_info, job)",
"hide": 0,
"includeAll": false,
"index": -1,
"label": "JOB",
"multi": false,
"name": "job",
"options": [],
"query": "label_values(node_uname_info, job)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"text": "All",
"value": "$__all"
},
"datasource": "prometheus",
"definition": "label_values(node_uname_info{job=~\"$job\"}, nodename)",
"hide": 0,
"includeAll": true,
"index": -1,
"label": "主机名",
"multi": false,
"name": "hostname",
"options": [],
"query": "label_values(node_uname_info{job=~\"$job\"}, nodename)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allFormat": "glob",
"allValue": null,
"current": {
"text": "ymt108",
"value": "ymt108"
},
"datasource": "prometheus",
"definition": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)",
"hide": 0,
"includeAll": false,
"index": -1,
"label": "Instance",
"multi": true,
"multiFormat": "regex values",
"name": "node",
"options": [],
"query": "label_values(node_uname_info{job=~\"$job\",nodename=~\"$hostname\"},instance)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allFormat": "glob",
"allValue": null,
"current": {
"text": "All",
"value": "$__all"
},
"datasource": "prometheus",
"definition": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)",
"hide": 0,
"includeAll": true,
"index": -1,
"label": "网卡",
"multi": true,
"multiFormat": "regex values",
"name": "device",
"options": [],
"query": "label_values(node_network_info{device!~'tap.*|veth.*|br.*|docker.*|virbr.*|lo.*|cni.*'},device)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 1,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"text": "/",
"value": "/"
},
"datasource": "prometheus",
"definition": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))",
"hide": 2,
"includeAll": false,
"index": -1,
"label": "最大挂载目录",
"multi": false,
"name": "maxmount",
"options": [],
"query": "query_result(topk(1,sort_desc (max(node_filesystem_size_bytes{instance=~'$node',fstype=~\"ext.?|xfs\",mountpoint!~\".*pods.*\"}) by (mountpoint))))",
"refresh": 2,
"regex": "/.*\\\"(.*)\\\".*/",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
},
{
"allValue": null,
"current": {
"text": "ymt108",
"value": "ymt108"
},
"datasource": "prometheus",
"definition": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)",
"hide": 2,
"includeAll": false,
"index": -1,
"label": "展示使用的主机名",
"multi": false,
"name": "show_hostname",
"options": [],
"query": "label_values(node_uname_info{job=~\"$job\",instance=~\"$node\"}, nodename)",
"refresh": 1,
"regex": "",
"skipUrlSync": false,
"sort": 5,
"tagValuesQuery": "",
"tags": [],
"tagsQuery": "",
"type": "query",
"useTags": false
}
]
},
"time": {
"from": "now-12h",
"to": "now"
},
"timepicker": {
"hidden": false,
"now": true,
"refresh_intervals": [
"15s",
"30s",
"1m",
"5m",
"15m",
"30m"
],
"time_options": [
"5m",
"15m",
"1h",
"6h",
"12h",
"24h",
"2d",
"7d",
"30d"
]
},
"timezone": "browser",
"title": "育苗通Node资源监控",
"uid": "hb7fSE0Zz",
"version": 11
}

node-model.json

当然,默认还内置了很多k8s相关的资源监控模板。


八、汇总

当我们完成了所有配置, 那接下来还需要整理一下,编写升级脚本upgrade.sh,方便之后部署,以及修改更新。

#!/bin/sh

# upgrade alertmanager configuration
kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring# upgrade scrape configs
kubectl create secret generic additional-scrape-configs --from-file=prometheus-additional.yaml -n monitoring# upgrade prometheus rules
kubectl apply -f prometheus-additional-rules.yaml
kubectl apply -f prometheus-rules.yaml # upgrade prometheus configuration
kubectl apply -f prometheus-prometheus.yaml

作者:Leozhanggg

出处:https://www.cnblogs.com/leozhanggg/p/13502983.html

本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。

Kubernetes实战总结 - 自定义Prometheus的更多相关文章

  1. Kubernetes实战总结 - 阿里云ECS自建K8S集群

    一.概述 详情参考阿里云说明:https://help.aliyun.com/document_detail/98886.html?spm=a2c4g.11186623.6.1078.323b1c9b ...

  2. Kubernetes实战(二):k8s v1.11.1 prometheus traefik组件安装及集群测试

    1.traefik traefik:HTTP层路由,官网:http://traefik.cn/,文档:https://docs.traefik.io/user-guide/kubernetes/ 功能 ...

  3. 新书推荐《再也不踩坑的Kubernetes实战指南》

      <再也不踩坑的Kubernetes实战指南>终于出版啦.目前可以在京东.天猫购买,京东自营和当当网预计一个星期左右上架. 本书贴合生产环境经验,解决在初次使用或者是构建集群中的痛点,帮 ...

  4. 2020 最新 Kubernetes实战指南

    1.Kubernetes带来的变革   对于开发人员 由于公司业务多,开发环境.测试环境.预生产环境和生产环境都是隔离的,而且除了生产环境,为了节省成本,其他环境可能是没有日志收集的,在没有用k8s的 ...

  5. Flink Native Kubernetes实战

    欢迎访问我的GitHub https://github.com/zq2599/blog_demos 内容:所有原创文章分类汇总及配套源码,涉及Java.Docker.Kubernetes.DevOPS ...

  6. 【SpringBoot】单元测试进阶实战、自定义异常处理、t部署war项目到tomcat9和启动原理讲解

    ========================4.Springboot2.0单元测试进阶实战和自定义异常处理 ============================== 1.@SpringBoot ...

  7. kubernetes实战(二十六):kubeadm 安装 高可用 k8s v1.16.x dashboard 2.x

    1.基本配置 基本配置.内核升级.基本服务安装参考https://www.cnblogs.com/dukuan/p/10278637.html,或者参考<再也不踩坑的Kubernetes实战指南 ...

  8. kubernetes实战(二十七):CentOS 8 二进制 高可用 安装 k8s 1.16.x

    1. 基本说明 本文章将演示CentOS 8二进制方式安装高可用k8s 1.16.x,相对于其他版本,二进制安装方式并无太大区别.CentOS 8相对于CentOS 7操作更加方便,比如一些服务的关闭 ...

  9. kubernetes实战(二十八):Kubernetes一键式资源管理平台Ratel安装及使用

    1. Ratel是什么? Ratel是一个Kubernetes资源平台,基于管理Kubernetes的资源开发,可以管理Kubernetes的Deployment.DaemonSet.Stateful ...

随机推荐

  1. Springboot启动扩展点超详细总结,再也不怕面试官问了

    1.背景 Spring的核心思想就是容器,当容器refresh的时候,外部看上去风平浪静,其实内部则是一片惊涛骇浪,汪洋一片.Springboot更是封装了Spring,遵循约定大于配置,加上自动装配 ...

  2. java基础知识--入门程序说明

    ①main方法:称为主方法,写法格式固定,是程序的入口或起始点,无论我们编写多少程序,JVM在运行的时候,都会从main方法这里开始执行. ②注释:对代码的解释说明.单行注释//.多行注释/* */. ...

  3. 毫不留情地揭开 ArrayList 和 LinkedList 之间的神秘面纱

    先看再点赞,给自己一点思考的时间,思考过后请毫不犹豫微信搜索[沉默王二],关注这个靠才华苟且的程序员.本文 GitHub github.com/itwanger 已收录,里面还有技术大佬整理的面试题, ...

  4. Java 并发队列 BlockingQueue

    BlockingQueue 开篇先介绍下 BlockingQueue 这个接口的规则,后面再看其实现. 首先,最基本的来说, BlockingQueue 是一个先进先出的队列(Queue),为什么说是 ...

  5. ContiPerf

    概述 ContiPerf 是一个轻量级的单元测试工具,基于JUnit 4二次开发,使用它基于注解的方式,快速在本地进行单元压测并提供详细的报告. Example 1. 新建 SpringBoot 工程 ...

  6. Laravel 使用阿里云 oss 存储对象

    一.下载安装 composer require jacobcyl/ali-oss-storage 二.注册服务提供者 在config/app.php的providers下添加: //阿里云OSS对象存 ...

  7. python map函数、filter函数、reduce函数

    1.map函数:map(func,可迭代对象): ①func可以是自定义的函数,也可以是功能简单的匿名函数(通过lambda定义) ②处理逻辑:表示将传入的可迭代对象依次循环,将每个元素按照传入的fu ...

  8. MacOS下ElasticSearch学习(第二天)

    ElasticSearch第二天 学于黑马和传智播客联合做的教学项目 感谢 黑马官网 传智播客官网 微信搜索"艺术行者",关注并回复关键词"elasticsearch&q ...

  9. MacOS下如何设置hosts?

    hosts文件是什么? hosts文件是一个系统文件,其作用就是将一些常用的网址域名与其对应的IP地址建立一个关联“数据库”.当用户在浏览器中输入一个需要登录的网址时,系统会首先自动从Hosts文件中 ...

  10. 线程_multiprocessing异步

    from multiprocessing import Pool import time import os def test(): print("---进程池中的进程---pid=%d,p ...