折腾kubernetes各种问题汇总-<1>

折腾部署fluend-elasticsearch日志,折腾出一大堆问题,解决这些问题过程中,感觉又了解了不少.

如何删除不一致状态下的rc,deployment,service.

在某些情况下,经常发现kubectl进程挂起现象,然后在get时候发现删了一半,而另外的删除不了

[root@k8s-master ~]# kubectl get -f fluentd-elasticsearch/
NAME DESIRED CURRENT READY AGE
rc/elasticsearch-logging-v1 0 2 2 15h NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/kibana-logging 0 1 1 1 15h
Error from server (NotFound): services "elasticsearch-logging" not found
Error from server (NotFound): daemonsets.extensions "fluentd-es-v1.22" not found
Error from server (NotFound): services "kibana-logging" not found

删除这些deployment,service或者rc命令如下:

kubectl delete deployment kibana-logging -n kube-system --cascade=false

kubectl delete deployment kibana-logging -n kube-system  --ignore-not-found

delete rc elasticsearch-logging-v1  -n kube-system --force now --grace-period=0

删除不了后如何重置etcd

rm -rf  /var/lib/etcd/*

删除后重新reboot master结点.

reset etcd后需要重新设置网络

etcdctl mk /atomic.io/network/config '{ "Network": "192.168.0.0/16" }'

启动apiserver失败

每次启动都是报

start request repeated too quickly for kube-apiserver.service

但其实不是启动频率问题,需要查看,/var/log/messages,在我的情况中是因为开启ServiceAccount后找不到ca.crt等文件,导致启动出错

May 21 07:56:41 k8s-master kube-apiserver: Flag --port has been deprecated, see --insecure-port instead.
May 21 07:56:41 k8s-master kube-apiserver: F0521 07:56:41.692480 4299 universal_validation.go:104] Validate server run options failed: unable to load client CA file: open /var/run/kubernetes/ca.crt: no such file or directory
May 21 07:56:41 k8s-master systemd: kube-apiserver.service: main process exited, code=exited, status=255/n/a
May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.
May 21 07:56:41 k8s-master systemd: Unit kube-apiserver.service entered failed state.
May 21 07:56:41 k8s-master systemd: kube-apiserver.service failed.
May 21 07:56:41 k8s-master systemd: kube-apiserver.service holdoff time over, scheduling restart.
May 21 07:56:41 k8s-master systemd: start request repeated too quickly for kube-apiserver.service
May 21 07:56:41 k8s-master systemd: Failed to start Kubernetes API Server.

在部署fluentd等日志组件的时候,很多问题都是因为需要开启ServiceAccount选项需要配置安全导致,所以说到底还是需要配置好ServiceAccount.

出现Permission denied情况

在配置fluentd时候出现cannot create /var/log/fluentd.log: Permission denied错误,这是因为没有关掉SElinux安全导致.

可以在/etc/selinux/config中将SELINUX=enforcing设置成disabled,然后reboot

基于ServiceAccount的配置

首先生成各种需要的keys,k8s-master需替换成master的主机名.

openssl genrsa -out ca.key 2048
openssl req -x509 -new -nodes -key ca.key -subj "/CN=k8s-master" -days 10000 -out ca.crt
openssl genrsa -out server.key 2048 echo subjectAltName=IP:10.254.0.1 > extfile.cnf #ip由下述命令决定 #kubectl get services --all-namespaces |grep 'default'|grep 'kubernetes'|grep '443'|awk '{print $3}' openssl req -new -key server.key -subj "/CN=k8s-master" -out server.csr openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -extfile extfile.cnf -out server.crt -days 10000

如果修改/etc/kubernetes/apiserver的配置文件参数的话,通过systemctl start kube-apiserver启动失败,出错信息为:

Validate server run options failed: unable to load client CA file: open /root/keys/ca.crt: permission denied

但可以通过命令行启动API Server

/usr/bin/kube-apiserver --logtostderr=true --v=0 --etcd-servers=http://k8s-master:2379 --address=0.0.0.0 --port=8080 --kubelet-port=10250 --allow-privileged=true --service-cluster-ip-range=10.254.0.0/16 --admission-control=ServiceAccount --insecure-bind-address=0.0.0.0 --client-ca-file=/root/keys/ca.crt --tls-cert-file=/root/keys/server.crt --tls-private-key-file=/root/keys/server.key --basic-auth-file=/root/keys/basic_auth.csv --secure-port=443 &>> /var/log/kubernetes/kube-apiserver.log &

命令行启动Controller-manager

/usr/bin/kube-controller-manager --logtostderr=true --v=0 --master=http://k8s-master:8080 --root-ca-file=/root/keys/ca.crt --service-account-private-key-file=/root/keys/server.key & >>/var/log/kubernetes/kube-controller-manage.log

ETCD启动不起来-问题<1>

etcd是kubernetes集群的zookeeper进程,几乎所有的service都依赖于etcd的启动,比如flanneld,apiserver,docker.....

在启动etcd是报错日志如下

May 24 13:39:09 k8s-master systemd: Stopped Flanneld overlay address etcd agent.
May 24 13:39:28 k8s-master systemd: Starting Etcd Server...
May 24 13:39:28 k8s-master etcd: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd:2379,http://etcd:4001
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_NAME, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadowed by corresponding flag
May 24 13:39:28 k8s-master etcd: etcd Version: 3.1.3
May 24 13:39:28 k8s-master etcd: Git SHA: 21fdcc6
May 24 13:39:28 k8s-master etcd: Go Version: go1.7.4
May 24 13:39:28 k8s-master etcd: Go OS/Arch: linux/amd64
May 24 13:39:28 k8s-master etcd: setting maximum number of CPUs to 1, total number of available CPUs is 1
May 24 13:39:28 k8s-master etcd: the server is already initialized as member before, starting as etcd member...
May 24 13:39:28 k8s-master etcd: listening for peers on http://localhost:2380
May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:2379
May 24 13:39:28 k8s-master etcd: listening for client requests on 0.0.0.0:4001
May 24 13:39:28 k8s-master etcd: recovered store from snapshot at index 140014
May 24 13:39:28 k8s-master etcd: name = master
May 24 13:39:28 k8s-master etcd: data dir = /var/lib/etcd/default.etcd
May 24 13:39:28 k8s-master etcd: member dir = /var/lib/etcd/default.etcd/member
May 24 13:39:28 k8s-master etcd: heartbeat = 100ms
May 24 13:39:28 k8s-master etcd: election = 1000ms
May 24 13:39:28 k8s-master etcd: snapshot count = 10000
May 24 13:39:28 k8s-master etcd: advertise client URLs = http://etcd:2379,http://etcd:4001
May 24 13:39:28 k8s-master etcd: ignored file 0000000000000001-0000000000012700.wal.broken in wal
May 24 13:39:29 k8s-master etcd: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 148905
May 24 13:39:29 k8s-master etcd: 8e9e05c52164694d became follower at term 12
May 24 13:39:29 k8s-master etcd: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 12, commit: 148905, applied: 140014, lastindex: 148905, lastterm: 12]
May 24 13:39:29 k8s-master etcd: enabled capabilities for version 3.1
May 24 13:39:29 k8s-master etcd: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
May 24 13:39:29 k8s-master etcd: set the cluster version to 3.1 from store
May 24 13:39:29 k8s-master etcd: starting server... [version: 3.1.3, cluster version: 3.1]
May 24 13:39:29 k8s-master etcd: raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory
May 24 13:39:29 k8s-master systemd: etcd.service: main process exited, code=exited, status=1/FAILURE
May 24 13:39:29 k8s-master systemd: Failed to start Etcd Server.
May 24 13:39:29 k8s-master systemd: Unit etcd.service entered failed state.
May 24 13:39:29 k8s-master systemd: etcd.service failed.
May 24 13:39:29 k8s-master systemd: etcd.service holdoff time over, scheduling restart.

核心语句

raft save state and entries error: open /var/lib/etcd/default.etcd/member/wal/0.tmp: is a directory进入相关目录,删除0.tmp,然后就可以启动啦!

ETCD启动不起来-超时问题<2>

问题背景:当前部署了3个etcd节点,突然有一天3台集群全部停电宕机了。重新启动之后发现K8S 集群是可以正常使用的,但是检查了一遍组件之后,发现有一个节点的etcd启动不了。

经过一遍探查,发现时间不准确,通过以下命令ntpdate ntp.aliyun.com 重新将时间调整正确,重新启动etcd,发现还是起不来,报错如下:

Mar 05 14:27:15 k8s-node2 etcd[3248]: etcd Version: 3.3.13
Mar 05 14:27:15 k8s-node2 etcd[3248]: Git SHA: 98d3084
Mar 05 14:27:15 k8s-node2 etcd[3248]: Go Version: go1.10.8
Mar 05 14:27:15 k8s-node2 etcd[3248]: Go OS/Arch: linux/amd64
Mar 05 14:27:15 k8s-node2 etcd[3248]: setting maximum number of CPUs to 4, total number of available CPUs is 4
Mar 05 14:27:15 k8s-node2 etcd[3248]: the server is already initialized as member before, starting as etcd member
...
Mar 05 14:27:15 k8s-node2 etcd[3248]: peerTLS: cert = /opt/etcd/ssl/server.pem, key = /opt/etcd/ssl/server-key.pe
m, ca = , trusted-ca = /opt/etcd/ssl/ca.pem, client-cert-auth = false, crl-file =
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for peers on https://192.168.25.226:2380
Mar 05 14:27:15 k8s-node2 etcd[3248]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert
files are presented. Ignored key/cert files.
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 127.0.0.1:2379
Mar 05 14:27:15 k8s-node2 etcd[3248]: listening for client requests on 192.168.25.226:2379
Mar 05 14:27:15 k8s-node2 etcd[3248]: member 9c166b8b7cb6ecb8 has already been bootstrapped
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Mar 05 14:27:15 k8s-node2 systemd[1]: Failed to start Etcd Server.
Mar 05 14:27:15 k8s-node2 systemd[1]: Unit etcd.service entered failed state.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service failed.
Mar 05 14:27:15 k8s-node2 systemd[1]: etcd.service holdoff time over, scheduling restart.
Mar 05 14:27:15 k8s-node2 systemd[1]: Starting Etcd Server...
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_NAME, but unused: shadowed by correspo
nding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_DATA_DIR, but unused: shadowed by corr
esponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_PEER_URLS, but unused: shadowed
by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_LISTEN_CLIENT_URLS, but unused: shadow
ed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS, but unuse
d: shadowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_ADVERTISE_CLIENT_URLS, but unused: sha
dowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER, but unused: shadowed
by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_TOKEN, but unused: sha
dowed by corresponding flag
Mar 05 14:27:15 k8s-node2 etcd[3258]: recognized environment variable ETCD_INITIAL_CLUSTER_STATE, but unused: sha
dowed by corresponding flag

解决方法:

检查日志发现并没有特别明显的错误,根据经验来讲,etcd节点坏掉一个其实对集群没有大的影响,这时集群已经可以正常使用了,但是这个坏掉的etcd节点并没有启动,解决方法如下:

  1. 进入etcd的数据存储目录进行备份

    备份原有数据:
    cd /var/lib/etcd/default.etcd/member/
    cp * /data/bak/
  2. 删除这个目录下的所有数据文件

    rm -rf /var/lib/etcd/default.etcd/member/*
  3. 停止另外两台etcd 节点,因为etcd节点启动时需要所有节点一起启动,启动成功后即可使用。

    master节点
    systemctl stop etcd
    systemctl restart etcd node1节点
    systemctl stop etcd
    systemctl restart etcd node2节点
    systemctl stop etcd
    systemctl restart etcd

CentOS下配置主机互信

  • 在每台服务器需要建立主机互信的用户名执行以下命令生成公钥/密钥,默认回车即可
ssh-keygen -t rsa

可以看到生成个公钥的文件

  • 互传公钥,第一次需要输入密码,之后就OK了
ssh-copy-id -i /root/.ssh/id_rsa.pub root@192.168.199.132 (-p 2222)

-p 端口 默认端口不加-p,如果更改过端口,就得加上-p

可以看到是在.ssh/下生成了个authorized_keys的文件,记录了能登陆这台服务器的其他服务器的公钥

  • 测试看是否能登陆
ssh 192.168.199.132 (-p 2222)

CentOS主机名的修改

hostnamectl set-hostname k8s-master1

Virtualbox实现CentOS复制和粘贴功能

如果不安装或者不输出,可以将update修改成install再运行

yum install update
yum update kernel
yum update kernel-devel
yum install kernel-headers
yum install gcc
yum install gcc make

运行完后sh VBoxLinuxAdditions.run

删除Pod一直处于Terminating状态

可以通过下面命令强制删除

kubectl delete pod NAME --grace-period=0 --force

删除namespace一直处于Terminating状态

可以通过以下脚本强制删除

[root@k8s-master1 k8s]# cat delete-ns.sh
#!/bin/bash
set -e useage(){
echo "useage:"
echo " delns.sh NAMESPACE"
} if [ $# -lt 1 ];then
useage
exit
fi NAMESPACE=$1
JSONFILE=${NAMESPACE}.json
kubectl get ns "${NAMESPACE}" -o json > "${JSONFILE}"
vi "${JSONFILE}"
curl -k -H "Content-Type: application/json" -X PUT --data-binary @"${JSONFLE}" \
http://127.0.0.1:8001/api/v1/namespaces/"${NAMESPACE}"/finalize

容器包含有效的 CPU/内存requests且没有指定limits可能会出现什么问题?

下面我们创建一个对应的容器,该容器只有requests设定,但是没有limits 设定,

- name: busybox-cnt02
image: busybox
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello from cnt02; sleep 10;done"]
resources:
requests:
memory: "100Mi"
cpu: "100m"

这个容器创建出来会有什么问题呢?

其实对于正常的环境来说没有什么问题,但是对于资源型pod来说,如果有的容器没有设定limit限制,资源会被其他的pod抢占走,可能会造成容器应用失败的情况。可以通过limitrange 策略来去匹配,让pod 自动设定,前提是要提前配置好limitrange规则。

折腾kubernetes各种问题汇总-<1>的更多相关文章

  1. 折腾kubernetes各种问题汇总

    折腾fluend-elasticsearch日志,折腾出一大堆问题,解决这些问题过程中,感觉又了解了不少. 1.如何删除不一致状态下的rc,deployment,service. 在某些情况下,经常发 ...

  2. kubernetes 报错汇总

    一. pod的报错: 1. pod的容器无法启动报错: 报错信息: Normal SandboxChanged 4m9s (x12 over 5m18s) kubelet, k8sn1 Pod san ...

  3. .NET Core Run On Docker By Kubernetes 系列文章汇总

    前言介绍 .NET Core是微软新一代主力编程平台,开源.免费.跨平台.轻量级.高性能,支持Linux.Docker.k8s等环境,适合开发微服务.云原生.大型互联网应用.全开源解决方案. Dock ...

  4. docker 学习

    vim /usr/lib/systemd/system/docker.service ExecStart=/usr/bin/docker daemon --bip=172.18.42.1/16 --r ...

  5. docker.service启动失败:Unit not found

    docker.service启动失败:Unit not found 版权声明:本文为博主原创文章,未经博主允许不得转载. 背景 因为最近一直在折腾Kubernetes集群版本升级.Docker版本升级 ...

  6. kubernetes 基本概念和资源对象汇总

    kubernetes 基本概念和知识点脑图 基本概念 kubernetes 中的绝大部分概念都抽象成kubernets管理的资源对象,主要有以下类别: Master : Master节点是kubern ...

  7. ASP.NET Core 折腾笔记二:自己写个完整的Cache缓存类来支持.NET Core

    背景: 1:.NET Core 已经没System.Web,也木有了HttpRuntime.Cache,因此,该空间下Cache也木有了. 2:.NET Core 有新的Memory Cache提供, ...

  8. kubernetes部署Fluentd+Elasticsearch+kibana 日志收集系统

    一.介绍 1. Fluentd 是一个开源收集事件和日志系统,用与各node节点日志数据的收集.处理等等.详细介绍移步-->官方地址:http://fluentd.org/ 2. Elastic ...

  9. Openstack+Kubernetes+Docker微服务实践之路--服务发布

    结合上文,我们的服务已经可以正常运行了,但它的访问方式只能通过服务器IP加上端口来访问,如何通过域名的方式来访问到我们服务,本来想使用Kubernetes的Ingress来做,折腾一天感觉比较麻烦,I ...

随机推荐

  1. switchable css dark theme in js & html custom element

    switchable css dark theme in js & html custom element dark theme / dark mode https://codepen.io/ ...

  2. OLAP

    OLAP Online Analytical Processing https://en.wikipedia.org/wiki/Online_analytical_processing 在线分析处理 ...

  3. 调整是为了更好的上涨,牛市下的SPC空投来了!

    2021年刚过没几天,比特币就开启了牛市的旅程,BTC涨到4万美元,ETH涨到1300多美元,BGV也涨到了621.05美元,牛市已然来袭. 虽然从近两日,比特币带领着主流币进行了一波调整,但是只涨不 ...

  4. 主打开放式金融的Baccarat项目如何开疆拓土?

    DeFi在这个夏天迎来了大爆发,像无托管交易.流动性挖矿.保险协议.NFT代币都在今年看到了有别于以往的应用.随着比特币走入主流,DeFi热度下降,不少人都觉得DeFi热潮已死.但事实是,DeFi的总 ...

  5. Java的稀疏数组的简单代码实现

    目录 Java的稀疏数组的简单代码实现 一.稀疏数组的基本概念 二.稀疏数组的Java代码实现思路 三.稀释数组的Java代码实现 四.结语 Java的稀疏数组的简单代码实现 一.稀疏数组的基本概念 ...

  6. 利用 Java 操作 Jenkins API 实现对 Jenkins 的控制详解

    本文转载自利用 Java 操作 Jenkins API 实现对 Jenkins 的控制详解 导语 由于最近工作需要利用 Jenkins 远程 API 操作 Jenkins 来完成一些列操作,就抽空研究 ...

  7. 微信小程序开发小技巧:

    小技巧:输入view.tabs_content就可以生成下面的代码. 输入p10,就可以得到: 输入jc:c得到:文字水平对齐 输入d:f得到: 输入ai:c得到: 输入bb得到: currentCo ...

  8. Java基本概念:接口

    一.简介 描述: 普通类只有具体实现,抽象类具体实现和规范都有,接口只有规范! 接口就是比抽象类还抽象的抽象类,可以更加规范的对子类进行约束,全面专业地实现了规范和具体实现的分离. 抽象类还提供某些具 ...

  9. 第44天学习打卡(JUC 线程和进程 并发和并行 Lock锁 生产者和消费者问题 如何判断锁(8锁问题) 集合类不安全)

    什么是JUC 1.java.util工具包 包 分类 业务:普通的线程代码 Thread Runnable 没有返回值.效率相比Callable相对较低 2.线程和进程 进程:一个程序.QQ.exe, ...

  10. linux的0,1,2号进程

    一.答案 https://blog.csdn.net/gatieme/article/details/51532804 仔细读一下作者的博客,都是操作系统底层相关. 二.补充: 1.linux代码,a ...