1. etcd安装

rpm -ivh etcd-3.2.15-1.el7.x86_64.rpm
systemctl daemon-reload
systemctl enable etcd
systemctl start etcd
export ETCDCTL_API=3
systemctl status etcd

hosts如下

192.168.0.100 etcd01
192.168.0.101 etcd02
192.168.0.102 etcd03

2. etcd配置

etcd02配置如下,详细见kubernetes1.9版本集群配置向导

# egrep -v "^$|^#" /etc/etcd/etcd.conf
ETCD_DATA_DIR="/var/lib/etcd/"
ETCD_LISTEN_PEER_URLS="https://192.168.0.101:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.0.101:2379,http://127.0.0.1:2379"
ETCD_NAME="etcd02"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.101:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.101:2379"
ETCD_INITIAL_CLUSTER="etcd01=https://192.168.0.100:2380,etcd02=https://192.168.0.101:2380,etcd03=https://192.168.0.102:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing"
ETCD_CERT_FILE="/etc/kubernetes/ssl/etcd.pem"
ETCD_KEY_FILE="/etc/kubernetes/ssl/etcd-key.pem"
ETCD_CLIENT_CERT_AUTH="true"
ETCD_TRUSTED_CA_FILE="/etc/kubernetes/ssl/ca.pem"
ETCD_AUTO_TLS="true"
ETCD_PEER_CERT_FILE="/etc/kubernetes/ssl/etcd.pem"
ETCD_PEER_KEY_FILE="/etc/kubernetes/ssl/etcd-key.pem"
ETCD_PEER_CLIENT_CERT_AUTH="true"
ETCD_PEER_TRUSTED_CA_FILE="/etc/kubernetes/ssl/ca.pem"
ETCD_PEER_AUTO_TLS="true"

3. 故障报错

3个节点做集群,直接关机后,etcd02故障,报错:

etcd: advertise client URLs = https://192.168.0.101:2379
etcd: read wal error (wal: crc mismatch) and cannot be repaired
systemd: etcd.service: main process exited, code=exited, status=1/FAILURE

wal的cec校验出错,谷歌了一下,没什么结果,于是移除这个etcd,再恢复
在正常的etcd节点移除

# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379
ac2f188e97f50eb7, started, etcd02, https://192.168.0.101:2380, https://192.168.0.101:2379
# etcdctl member remove ac2f188e97f50eb7
Member ac2f188e97f50eb7 removed from cluster 194cd14a48430083

再启动etcd服务

# systemctl start etcd

报错:

etcd: error validating peerURLs {ClusterID:194cd14a48430083 Members:[&{ID:1ce6d6d01109192 RaftAttributes:{PeerURLs:[https://192.168.0.102:2380]} Attributes:{Name:etcd03 ClientURLs:[https://192.168.0.102:2379]}} &{ID:9b534175b46ea789 RaftAttributes:{PeerURLs:[https://192.168.0.100:2380]} Attributes:{Name:etcd01 ClientURLs:[https://192.168.0.100:2379]}}] RemovedMemberIDs:[]}: member count is unequal

报错:

etcd: the member has been permanently removed from the cluster the data-dir used by this member must be removed

4. etcd恢复数据

在etcd02节点恢复一下数据试试:

# mv /var/lib/etcd/member /var/lib/member
# rm -rf /var/lib/etcd/*
# etcdctl snapshot restore /var/lib/member/snap/db --skip-hash-check=true
2018-06-22 11:28:35.622666 I | mvcc: restore compact to 10177401
2018-06-22 11:28:35.659626 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
# systemctl start etcd

服务启动了,自己把自己选做主,服务倒是启动了,加入集群还是出错,用正常的节点备份再恢复

# etcdctl snapshot save etcdback.db
# etcdctl member add etcd02 http://192.168.0.101:2380
Error: member name not provided.

看看现在集群的其他2个etcd

# curl -k --key /etc/kubernetes/ssl/etcd-key.pem --cert /etc/kubernetes/ssl/etcd.pem https://192.168.0.100:2380/members
[{"id":130161754177048978,"peerURLs":["https://192.168.0.102:2380"],"name":"etcd03","clientURLs":["https://192.168.0.102:2379"]},{"id":11192361472739944329,"peerURLs":["https://192.168.0.100:2380"],"name":"etcd01","clientURLs":["https://192.168.0.100:2379"]}]

参考文档:

etcdctl member add etcd_name –peer-urls=”https://peerURLs”

再次添加

# etcdctl member add etcd02 --peer-urls="https://192.168.0.101:2380"
Member 41c2a7b938a5e387 added to cluster 194cd14a48430083 ETCD_NAME="etcd02"
ETCD_INITIAL_CLUSTER="etcd03=https://192.168.0.102:2380,etcd02=https://192.168.0.101:2380,etcd01=https://192.168.0.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

查看etcd member状态:

# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379
ad17c3da831c84c7, unstarted, , https://192.168.0.101:2380,

报错:

etcd: request cluster ID mismatch (got 194cd14a48430083 want cdf818194e3a8c32)

发现步骤顺序错误,应该是先添加到etcd集群,再启动etcd服务,我们现在先启动etcd服务,就是一个etcd单点

etcd节点加入集群

故障的etcd主机

# systemctl stop etcd

正常的etcd主机:

# etcdctl member remove ad17c3da831c84c7
# etcdctl member add etcd02 --peer-urls="https://192.168.0.101:2380"
Member 41c2a7b938a5e387 added to cluster 194cd14a48430083 ETCD_NAME="etcd02"
ETCD_INITIAL_CLUSTER="etcd03=https://192.168.0.102:2380,etcd02=https://192.168.0.101:2380,etcd01=https://192.168.0.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

故障的etcd主机,查看现在etcd状态:

# etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 24.52485ms
# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
41c2a7b938a5e387, started, etcd02, https://192.168.0.101:2380, https://192.168.0.101:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379

到这里,etcd故障修复完毕

5. etcd常用命令

查看状态

# export ETCDCTL_API=3

# etcdctl endpoint status --write-out=table
+----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------+------------------+---------+---------+-----------+-----------+------------+
| 127.0.0.1:2379 | 41c2a7b938a5e387 | 3.2.15 | 15 MB | true | 317 | 11051403 |
+----------------+------------------+---------+---------+-----------+-----------+------------+

备份及恢复

etcdctl snapshot save etcdback.db
etcdctl snapshot status etcdback.db --write-out=table
etcdctl snapshot restore etcdback.db --skip-hash-check=true

etcd监控

# curl -L http://localhost:2379/metrics
# HELP etcd_debugging_mvcc_keys_total Total number of keys.
# TYPE etcd_debugging_mvcc_keys_total gauge
etcd_debugging_mvcc_keys_total 776
# HELP etcd_debugging_mvcc_pending_events_total Total number of pending events to be sent.
# TYPE etcd_debugging_mvcc_pending_events_total gauge
etcd_debugging_mvcc_pending_events_total 0
# HELP etcd_debugging_mvcc_put_total Total number of puts seen by this member.
# TYPE etcd_debugging_mvcc_put_total counter
etcd_debugging_mvcc_put_total 9.548201e+06
# HELP etcd_debugging_mvcc_range_total Total number of ranges seen by this member.
# TYPE etcd_debugging_mvcc_range_total counter
etcd_debugging_mvcc_range_total 2.1052143e+07
# HELP etcd_debugging_mvcc_slow_watcher_total Total number of unsynced slow watchers.
# TYPE etcd_debugging_mvcc_slow_watcher_total gauge
etcd_debugging_mvcc_slow_watcher_total 0
# HELP etcd_debugging_mvcc_txn_total Total number of txns seen by this member.
# TYPE etcd_debugging_mvcc_txn_total counter
etcd_debugging_mvcc_txn_total 0
# HELP etcd_debugging_mvcc_watch_stream_total Total number of watch streams.
# TYPE etcd_debugging_mvcc_watch_stream_total gauge
etcd_debugging_mvcc_watch_stream_total 125
# HELP etcd_debugging_mvcc_watcher_total Total number of watchers.
# TYPE etcd_debugging_mvcc_watcher_total gauge
etcd_debugging_mvcc_watcher_total 125
# HELP etcd_debugging_server_lease_expired_total The total number of expired leases.
# TYPE etcd_debugging_server_lease_expired_total counter
etcd_debugging_server_lease_expired_total 3649

适合用prometheus监控

global:
scrape_interval: 10s
scrape_configs:
- job_name: etcd
static_configs:
- targets: ['192.168.0.100:2379','192.168.0.101:2379','192.168.0.102:2379']

图解raft算法 http://thesecretlivesofdata.com/raft/

etcd获取kubernetes的数据

# export ETCDCTL_API=3
# etcdctl get /registry/namespaces/default --prefix -w json|python -m json.tool
{
"count": 1,
"header": {
"cluster_id": 1823062066148343939,
"member_id": 11192361472739944329,
"raft_term": 317,
"revision": 10880816
},
"kvs": [
{
"create_revision": 6,
"key": "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA==",
"mod_revision": 6,
"value": "azhzAAoPCgJ2MRIJTmFtZXNwYWNlEl8KRQoHZGVmYXVsdBIAGgAiACokOTVlNzdjMWEtM2Q1Ny0xMWU4LTk5YzItMDA1MDU2YmU3NWEzMgA4AEIICK7qttYFEAB6ABIMCgprdWJlcm5ldGVzGggKBkFjdGl2ZRoAIgA=",
"version": 1
}
]
}
查看key的内容
# echo L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA== |base64 -d
/registry/namespaces/default
#!/bin/bash
# Get kubernetes keys from etcd
export ETCDCTL_API=3
keys=`etcdctl get /registry --prefix -w json|python -m json.tool|grep key|cut -d ":" -f2|tr -d '"'|tr -d ","`
for x in $keys;do
echo $x|base64 -d|sort
done

获取etcd中kubernetes所有对象的key

原文出处:bbotte -> http://bbotte.com/server-config/etcd-cluster-troubleshooting/

etcd集群故障处理(转)的更多相关文章

  1. Centos7下Etcd集群搭建

    一.简介 "A highly-available key value store for shared configuration and service discovery." ...

  2. 一键部署ETCD集群脚本

    这里使用三个节点,系统版本为CentOS7 # vim deploy-etcd.sh #!/bin/bash set -x set -e #更改这里的IP, 只支持部署3个节点etcd集群 decla ...

  3. Docker 搭建 etcd 集群

    阅读目录: 主机安装 集群搭建 API 操作 API 说明和 etcdctl 命令说明 etcd 是 CoreOS 团队发起的一个开源项目(Go 语言,其实很多这类项目都是 Go 语言实现的,只能说很 ...

  4. etcd集群部署

    etcd是用于共享配置和服务发现的分布式KV存储系统,随着CoreOS和Kubernetes等项目在开源社区日益火热,它们都用到了etcd组件作为一个高可用.强一致性的服务发现存储仓库.操作系统版本: ...

  5. Docker下ETCD集群搭建

    搭建集群之前首先准备两台安装了CentOS 7的主机,并在其上安装好Docker. Master 10.100.97.46 Node 10.100.97.64 ETCD集群搭建有三种方式,分别是Sta ...

  6. kubernetes 集群安装etcd集群,带证书

    install etcd 准备证书 https://www.kubernetes.org.cn/3096.html 在master1需要安装CFSSL工具,这将会用来建立 TLS certificat ...

  7. centos下etcd集群安装

    先仔细了解学习etcd 官方: https://github.com/etcd-io/etcd https://www.cnblogs.com/softidea/p/6517959.html http ...

  8. 灵雀云:etcd 集群运维实践

    [编者的话]etcd 是 Kubernetes 集群的数据核心,最严重的情况是,当 etcd 出问题彻底无法恢复的时候,解决问题的办法可能只有重新搭建一个环境.因此围绕 etcd 相关的运维知识就比较 ...

  9. Kubernetes集群搭建之Etcd集群配置篇

    介绍 etcd 是一个分布式一致性k-v存储系统,可用于服务注册发现与共享配置,具有以下优点. 简单 : 相比于晦涩难懂的paxos算法,etcd基于相对简单且易实现的raft算法实现一致性,并通过g ...

随机推荐

  1. [UE4]移动惯性

    2个因素影响滑行: 1.摩擦力:Ground Frition 2.减速度:Braking decelearation Walking

  2. Java基础知识_毕向东_Java基础视频教程笔记(13 字符)

    13天-01-String String类适用于描述字符串事物. 常见的操作:1.获取: 1.1字符串中包含的字符数,也就是字符串的长度. int length():获取长度 1.2根据索引值获取位置 ...

  3. Find the peace with yourself

    The purpose of being mature is to find the real calm and peace with yourself. Or you can say the tur ...

  4. [电脑知识点]Excel取消受保护视图

  5. 关于easyUI异步获取数据格式问题

    后台向easyUI一般写String类型的数据 成功之后的回调函数中:应将数据转为json格式    $.parsonJSON success:function(result){            ...

  6. The Kernel Boot Process.内核引导过程

    原文标题:The Kernel Boot Process 原文地址:http://duartes.org/gustavo/blog/ [注:本人水平有限,只好挑一些国外高手的精彩文章翻译一下.一来自己 ...

  7. javascript继承之学习笔记

    今天记录一下学习javascript的继承. 继承基本上是基于“类”来说的,而javascript中并不存在真正的类,所以就出现了各种模拟“类”的行为,然后就堂而皇之的使用起了类的概念.这里不谈“类” ...

  8. [UGUI]图文混排(七):动态表情

    帧动画脚本: http://www.cnblogs.com/lyh916/p/9194823.html 这里的动态表情,我使用的是固定间隔去刷新Image上的Sprite来实现的,即帧动画.这里可以将 ...

  9. 简单方法解决bootstrap3 modal异步加载只一次的问题

    用过bootstrap3自身的modal的remote属性的人可能都有相同的疑惑:就是点击弹出modal后再次点击会从缓存中加载内容,而不会再次走后台,解决办法就是只要让modal本身的属性发生变化, ...

  10. xmlhttp.readyState的值及解释:

    xmlhttp.readyState的值及解释: 0:请求未初始化(还没有调用 open()). 1:请求已经建立,但是还没有发送(还没有调用 send()). 2:请求已发送,正在处理中(通常现在可 ...