node pressure and pod eviction
0. overview
There are too many guides about node pressure and pod eviction, most of them are specific, and no system. so here is to combine the knowledge together and see it system and specific. Before introduce the knowledge lets us thinking those questions(actually too many questions..). What happened when the node shutdown, what happened if the API-server of Kubernetes can not connect to the Kubelet, what happened if the pod use too much resource even made the node unhealthy.
Yes, it just a piece of node pressure and pod eviction, the node pressure and pod eviction are more likely to belong to the abnormal cases, and yes those cases are very very important for business.
And here has a manifest about the guide cause is the abnormal cases and I just have the env with company product, it's serious and I don't want touch and make the env bad for our abnormal cases. So there is no too much demo with trials, for the demo I will attach the links if needed, and yes minikube is a good VM Kubernetes platform for newcomer and I also do something in minikube but I have checked with some materials online there already has a similar demo so I will pick the online materials directly.
1. container resource
As we know the app is running within a container and the resource is covered at the container level, there are limited by Cgroups. For container resource, it can be defined with CPU, memory, disk. For CPU is the compressible resources, but for memory, disk is the incompressible resources which means if one container uses too much memory, others(container/process/thread) will no too much memory can be used. so the resource limit is important, in Kubernetes, there are two fields that can define the resource: request and limit.
- requests: is the minimal resource container requests, but the container can use less than the request value.
- limits: is the maximal resource container requests, once out of the limits OOM killer will kill the container.
ok, lets us focus on the requests field, design an appropriate value that is good enough for requests cause Kubernetes will schedule the pod(container) to nodes bases on the requests, not the actual resources. For example, there is 500M memory of node can be allocated,one pod(just have one container) requests 600M but actually, the container just use 100M(the request value is more than actual value).. the container still can not scheduler to the node cause the requested memory is over than the allocated memory of the node.
here we are discussing the container resource, for pod resources, yes you probably know is the sum of all containers. now lets we thinking if the node resource is not enough and the pod needs to be evicted, which pod should be the first one? we must be thinking the low priority pods should be evicted, correct, Kubernetes has a QoS field that defined the priority of pod, how to define the priority is pretty simple by requests and limits.
there are too many guides that introduce how to define the priority: BestEffort, Burstable, and Guaranteed. I just want mentioned requests/limits are for container level, for pod QoS depends on all container requests/limits, which means all QoS of the container is the same, the pod QoS is the same as the container's Qos, but once one container QoS is different with others, the QoS of the pod is Burstable. let see the table directly:
| container1 QoS | container2 QoS | pod QoS |
|---|---|---|
| BestEffort | BestEffort | BestEffort |
| BestEffort | Burstable | Burstable |
| BestEffort | Guaranteed | Burstable |
| BestEffort | Burstable | Burstable |
| Burstable | Burstable | Burstable |
| Burstable | Burstable | Burstable |
| Burstable | Guaranteed | Burstable |
| Guaranteed | Guaranteed | Guaranteed |
For the description of QoS please refer to here.
2. node pressure eviction
In chapter1 we know that if the pod wants to use the incompressible resource more than the limits of resource, the OOM killer will kill the process, more detailed info can refer here.
Now if the node has the pressure of incompressible resources, but pods haven't use the resource over the limits, how do the Kubernetes do? it will introduce the Kubelet module of Kubernetes, Kubelet will monitor the resource of pods and report the resource to API-server, once meet the condition of pod eviction, the Kubelet will do the eviction action.
let's see the configure for Kubelet, just pick the necessary field:
[root@chunqiu ~ (Master)]# systemctl cat kubelet.service
# /etc/systemd/system/kubelet.service
...
ExecStart=/usr/local/bin/kubelet \
--pod-infra-container-image=bcmt-registry:5000/gcr.io/google-containers/pause-amd64:3.1 \
--kubeconfig=/etc/kubernetes/kubelet.kubeconfig \
--config=/etc/kubernetes/kubelet-config.yml \
...
Restart=always
RestartSec=5
Slice=system.slice
OOMScoreAdjust=-999
[root@chunqiu ~ (Master)]# cat /etc/kubernetes/kubelet-config.yml
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
...
nodeStatusUpdateFrequency: 8s
evictionPressureTransitionPeriod: "60s"
evictionMaxPodGracePeriod: 120
evictionHard:
memory.available: 9588Mi
nodefs.available: 10%
nodefs.inodesFree: 5%
imagefs.available: 15%
evictionSoft:
memory.available: 15341Mi
evictionMminimumReclaim:
memory.available: 15341Mi
evictionSoftGracePeriod:
memory.available: "30s"
systemReserved:
memory: 33Gi
cpu: "44"
Before introducing these parameters, let's see the eviction is split into two kinds of styles: soft eviction and hard eviction. For soft eviction, there are graceful eviction time(evictionSoftGracePeriod) and also graceful eviction time for pod(evictionMaxPodGracePeriod). For hard eviction, it more like an urgent status for node, so the pod will be evicted immediately to relieve symptoms asap.
ok, let's see the parameters now, and yes the detailed introduce can be found from Kubernetes Docs here.
common parameters
- nodeStatusUpdateFrequency: as the name, Kubelet will report the node status to API-server with this frequency, for the frequency is also too much is too slow to check the status of the node, too lower will make the node update frequently. it has been discussed with the article here.
- evictionPressureTransitionPeriod: when a pod has been evicted, probably the pod will be scheduled to the same node(with a label, affinity..) at that time the node pressure will become high again, then eviction will be triggered again, so the pod will be eviction and do the same procedure like a circle(and also here has an exponential delay for pod delete, it means pod will be deleted after 10s, 20s,30s.. till default value 5minutes, after that pod will be deleted by 5min, 10min.. this mechanism is for avoiding the frequent deletion of a pod), so Kubernetes has defined a period with parameter evictionPressureTransitionPeriod is to keep no pod scheduled to the node which just eviction pods.
- evictionMminimumReclaim: see how much resource Kubernetes will be clean up, if too low the purpose of eviction is not obvious, which means the pod will scheduler to the node again and the resource pressure will high after a while. so Kubernetes will clean up to enough resources which have been defined with parameter evictionMminimumReclaim so that no more resource pressure frequently.
soft eviction
- memory.available: is the target of memory available, once over the target, the pod eviction will be triggered.
- evictionSoftGracePeriod: after fitting the target, wait the time of evictionSoftGracePeriod to see whether the target is always satisfied, once satisfied the "alarm" will be canceled.
- evictionMaxPodGracePeriod: once fitting the target and satisfied the time of evictionSoftGracePeriod, the pod will be eviction by Kubelet, and the pod needs to graceful shutdown during the time evictionMaxPodGracePeriod, is important for business. The time of evictionMaxPodGracePeriod has two places one is the time that set in the pod manifest, Kubelet will choose the minimal value, such as here the evictionMaxPodGracePeriod is 120(s), and the pod manifest has defined the time with terminationGracePeriodSeconds: 30(s), so the evictionMaxPodGracePeriod time for Kubelet is 30(s).
hard eviction
- there are no too many parameters defined for hard eviction cause hard eviction is an urgent case once met the hard condition, the pod will be evicted immediately.
emm.., another question raised up: which pod should be eviction? it more related to the QoS that we discussed in the chapter1, and the incompressible resource itself, for how to pick the evicted pod here has been introduced in detail, we just skip it, and then we know the framework now, but how do we know the real-time resource status of pod or specific containers?
It will relate with the monitor mechanism of Kubernetes, the Kubelet module of Kubernetes has integrated the cAdvisor module build-in which is to monitor the pods status. and there is another module called metrics-server that will collect all cAdvisor data together then provide to outside with Kubernetes service clusterIP:
[root@chunqiu ~ (Master)]# kubectl get pods -n kube-system | grep met
dashboard-metrics-scraper-6554464489-kptsr 1/1 Running 1 102d
metrics-server-78f8b874b5-4hdzj 2/2 Running 4 102d
[root@chunqiu ~ (Master)]# kubectl get service -n kube-system | grep met
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
metrics-server ClusterIP **.***.246.177 <none> 443/TCP 102d
for more detailed Infos about the metrics server please refer here.
we can get the metrics server data by curl, but a more common way is to use the command: kubectl top, like:
[root@chunqiu ~ (Master)]# kubectl top pod --containers -n ci
POD NAME CPU(cores) MEMORY(bytes)
chunqiu-pod-0 container1 7m 64Mi
chunqiu-pod-0 container2 9m 83Mi
chunqiu-pod-0 container3 4m 20Mi
...
The result is different from the Kubectl describe command, this is the real-time resource status of pods.
ok, someone might see we can check the real-time result, but how do I know the resource define of containers? yes, this is related to the Linux cgroup, we can check with the container inside:
[root@chunqiu ~ (Master)]# kubectl exec -it chunqiu-pod-0 /bin/bash -n ci
bash-5.0$ ls /sys/fs/cgroup/memory/
memory.kmem.max_usage_in_bytes memory.max_usage_in_bytes memory.pressure_level
...
the container resource has been defined in cgroup, and also can check the defined with pod level, more detailed info can refer to here.
3. node eviction
Chapter2 we have discussed the case of node pressure, but how about this case that node shutdown and can not recovery long time, or network issue cause work node can not connect to the master node. For this case is also covered in Kubernetes, the controller manager has to monitor the status of the node, once the node "failed"(can cause by many factors), the Kubelet can not connect to API-server, which is an offline status, and then the controller manager will do something "delete" the pod, add Taints and scheduler the pod to other nodes. Describe the procedure is so complex, let's see a procedure figure first:

The figure is clear, we will introduce the procedure based on the figure, first, let's see the parameters that can be found by the controller-manager defined in the picture.
[root@chunqiu ~ (Master)]# kubectl get pods -n kube-system | grep controller-manager
kube-controller-manager-chunqiu-0 1/1 Running 1 102d
[root@chunqiu ~ (Master)]# kubectl describe pods kube-controller-manager-chunqiu-0 -n kube-system
Name: kube-controller-manager-chunqiu-0
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: chunqiu-0/172.17.66.2
Controlled By: Node/chunqiu-0
Command:
/usr/local/bin/kube-controller-manager
--node-monitor-period=5s
--node-monitor-grace-period=35s
--pod-eviction-timeout=120s
--terminated-pod-gc-threshold=100
State: Running
Last State: Terminated
Reason: Error
Exit Code: 255
Ready: True
Restart Count: 1
Limits:
cpu: 300m
memory: 1000Mi
Requests:
cpu: 100m
memory: 30Mi
Liveness: http-get https://127.0.0.1:10252/healthz delay=15s timeout=1s period=10s #success=1 #failure=3
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
...
There are some necessary parameters that need to introduce, the detained info can check here:
- node-monitor-period: the controller manager will query the API-server by the period of node-monitor-period.
- node-monitor-grace-period: if the status of the node is not ready or unknown with the period of node-monitor-grace-period, the node will be set as unhealthy(unknown), the default time of node-monitor-grace-period is the 40s.
- pod-eviction-timeout: after fitting the node-monitor-grace-period, the controller manager will wait the time of pod-eviction-timeout, after meeting the time, the controller manager will update the pod status as Terminating. The default time of pod-eviction timeout is 5 minutes.
Combine the procedure and parameters. Now let us summary the procedure here:
The Kubelet will report the node status to API-server with the period of nodeStatusUpdateFrequency, and then the controller manager will query the node status by API-server, here if the time is over the node-monitor-period, the node will be set as not ready(the meaning is like one give a signal to other, but other handles the signal so slowly, and also the revert case one query signal from other, but other haven't sent the signal), the more discussed the info can be checked here. ok, then if the status of the node is not ready(unhealthy)(actually, I haven't find the info about how to define the node status, like when is not ready, when is unhealthy, when is unknown, a related article is in here, it might occur together, but still need check, anyone knows the info please tell me, thanks so much). and then if the not ready status keeps until the period of node-monitor-grace-period, the node will be set as unknown, and the controller manager will wait for the time defined in pod-eviction-timeout, after that the pod will be set as Terminating but it still existing in Kubernetes cause Kubernetes doesn't know what happened with the unknown node, and also doesn't know the pod alive or not. The controller manager will set the status of the pod, and then add the Taints to the unknown nodes:
node.kubernetes.io/not-ready: Node is not ready. This corresponds to the NodeCondition Ready being “False”;
node.kubernetes.io/unreachable: Node is unreachable from the node controller. This corresponds to the NodeCondition Ready being “Unknown”.
after that, the new pod will be scheduled to other nodes, and once the unknown nodes become normal the old pod will be deleted by Kubelet which is triggered by the controller manager, also the operator can delete the Terminating pod manually in this case.
ok, is finished all? I would like to see no. lets us thinking again pod has been controller by the controller manager which divided into deployments, statefulSet, Daemonset.. so we talked above is what kind of controller? I prefer is deployments much cause for statefulSet the controller manager doesn't scheduler the new pod, the real action is to wait for the status of the old pod from Terminating to Ready, for daemonset is almost same, and the daemonset is more adapt for the hot env, more detailed info can refer to here.
4. What Next?
- For CPU/memory.. is the system resource, and if we want to define the resource by ourself how should we do?
- We have seen the priority 2000001000 of pod kube-controller-manager-chunqiu-0, how do the priority get and what's the functionality of the priority?
- What happened if the node has been added the Taints.
node pressure and pod eviction的更多相关文章
- kubernetes continually evict pod when node's inode exhausted
kubernetes等容器技术可以将所有的业务进程运行在公共的资源池中,提高资源利用率,节约成本,但是为避免不同进程之间相互干扰,对底层docker, kubernetes的隔离性就有了更高的要求,k ...
- kubernetes 利用label标签来绑定到特定node运行pod
利用label标签来绑定到特定node运行pod: 不如将有大量I/O的pod部署到配置了ssd的node上或者需要使用GPU的pod部署到某些安装了GPU的节点上 查看节点的标签: kubectl ...
- Kubernetes调整Node节点快速驱逐pod的时间
在高可用的k8s集群中,当Node节点挂掉,kubelet无法提供工作的时候,pod将会自动调度到其他的节点上去,而调度到节点上的时间需要我们慎重考量,因为它决定了生产的稳定性.可靠性,更快的迁移可以 ...
- k8s,coredns内部测试node节点上的pod的calico是否正常的一个小技巧
最近由于master整个挂掉,导致相关一些基础服务瘫掉,修复中测试有些节点网络又出现不通的情况正常的启动相关一些服务后,测试一些节点,比较费劲,还有进入pod,以及还有可能涉及命名空间操作这里可以这样 ...
- 我们可以定向调度某个pod在某个node上进行创建
集群环境:1.k8s用的是二进制方式安装 2.操作系统是linux (centos)3.操作系统版本为 7.2/7.4/7.94.k8s的应用管理.node管理.pod管理等用rancher.k8s令 ...
- 8.深入k8s:资源控制Qos和eviction及其源码分析
转载请声明出处哦~,本篇文章发布于luozhiyun的博客:https://www.luozhiyun.com,源码版本是1.19 又是一个周末,可以愉快的坐下来静静的品味一段源码,这一篇涉及到资源的 ...
- Kubernetes基础:Pod的详细介绍
本文的演练环境为基于Virtualbox搭建的Kubernetes集群,具体搭建步骤可以参考kubeadm安装kubernetes V1.11.1 集群 1. 基本概念 1.1 Pod是什么 Pod是 ...
- kubernetes-深入理解pod对象(七)
Pod中如何管理多个容器 Pod中可以同时运行多个进程(作为容器运行)协同工作.同一个Pod中的容器会自动的分配到同一个 node 上.同一个Pod中的容器共享资源.网络环境和依赖,它们总是被同时调度 ...
- tune kubernetes eviction parameter
Highlight 本文会介绍kubernetes中关于集群驱逐的相关参数, 合理设置驱逐速率的考虑因素, 但是不会涉及node层面资源的驱逐阈值的设置. Basic 在kubernetes中, 如果 ...
- 深入掌握K8S Pod
k8s系列文章: 什么是K8S K8S configmap介绍 Pod是k8s中最小的调度单元,包含了一个"根容器"和其它用户业务容器. 如果你使用过k8s的话,当然会了解pod的 ...
随机推荐
- VMware安装虚拟机详细步骤
在VMware中安装CentOS7 01.目录 CentOS7的下载 CentOS7的配置 CentOS7的安装 CentOS7的网络配置 自动获取IP 固定获取IP 02.安装前提 准备工作: 提前 ...
- JXNU acm选拔赛 壮壮的数组
壮壮的数组 Time Limit : 3000/1000ms (Java/Other) Memory Limit : 65535/32768K (Java/Other) Total Submiss ...
- Reformer 模型 - 突破语言建模的极限
Reformer 如何在不到 8GB 的内存上训练 50 万个词元 Kitaev.Kaiser 等人于 20202 年引入的 Reformer 模型 是迄今为止长序列建模领域内存效率最高的 trans ...
- MyBatis高频面试题
1.MyBatis中使用#和$书写占位符有什么区别? 2.Hibernate 与 Mybatis区别(MyBatis与Hibernate有什么不同). 3.持久层设计要考虑的问题有哪些? 4.你用过的 ...
- MinIO客户端之rm
MinIO提供了一个命令行程序mc用于协助用户完成日常的维护.管理类工作. 官方资料 mc rm 删除指定的对象. 准备待删除的对象,查看对象,命令如下: ./mc ls local1/bkt2/ 控 ...
- 听说生鲜领军企业k8s集群都上云了,鱼会飞了?
在这个中秋都和国庆在一起的双节里,我们的小明还在辛辛苦苦的找工作,听说他经历了一段"难忘"的面试. 小明面对着面试官的"层层拷问",游刃有余的化解了这些难题. ...
- C++篇:第十一章_标准库_知识点大全
C++篇为本人学C++时所做笔记(特别是疑难杂点),全是硬货,虽然看着枯燥但会让你收益颇丰,可用作学习C++的一大利器 十一.标准库 include头文件: ① 一般来说,导入objective c的 ...
- CodeLab:一款让你体验丝滑般的云化JupyterLab
摘要:从AI开发特点着手,华为云AI DTSE技术布道师陈阳在DTT第五期带来主题为<云化JupyterLab:华为云CodeLab介绍>技术分享. DTSE Tech Talk是华为云开 ...
- 掌握渗透测试,从Web漏洞靶场搭建开始
摘要:漏洞靶场,不仅可以帮助我们锻炼渗透测试能力.可以帮助我们分析漏洞形成机理.更可以学习如何修复提高代码能力,同时也可以帮助我们检测各种各样漏洞扫描器的效果. 本文分享自华为云社区<Web漏洞 ...
- storybook添加全局样式与sass全局变量设置
storybook组件需要全局样式,只需在.storybook/preview.js 增加全局样式即可. import '../src/style/index.scss'; export const ...