etcd集群故障处理(转)
1. etcd安装
rpm -ivh etcd-3.2.15-1.el7.x86_64.rpm
systemctl daemon-reload
systemctl enable etcd
systemctl start etcd
export ETCDCTL_API=3
systemctl status etcd
hosts如下
192.168.0.100 etcd01
192.168.0.101 etcd02
192.168.0.102 etcd03
2. etcd配置
etcd02配置如下,详细见kubernetes1.9版本集群配置向导
# egrep -v "^$|^#" /etc/etcd/etcd.conf
ETCD_DATA_DIR="/var/lib/etcd/"
ETCD_LISTEN_PEER_URLS="https://192.168.0.101:2380"
ETCD_LISTEN_CLIENT_URLS="https://192.168.0.101:2379,http://127.0.0.1:2379"
ETCD_NAME="etcd02"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.101:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.101:2379"
ETCD_INITIAL_CLUSTER="etcd01=https://192.168.0.100:2380,etcd02=https://192.168.0.101:2380,etcd03=https://192.168.0.102:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing"
ETCD_CERT_FILE="/etc/kubernetes/ssl/etcd.pem"
ETCD_KEY_FILE="/etc/kubernetes/ssl/etcd-key.pem"
ETCD_CLIENT_CERT_AUTH="true"
ETCD_TRUSTED_CA_FILE="/etc/kubernetes/ssl/ca.pem"
ETCD_AUTO_TLS="true"
ETCD_PEER_CERT_FILE="/etc/kubernetes/ssl/etcd.pem"
ETCD_PEER_KEY_FILE="/etc/kubernetes/ssl/etcd-key.pem"
ETCD_PEER_CLIENT_CERT_AUTH="true"
ETCD_PEER_TRUSTED_CA_FILE="/etc/kubernetes/ssl/ca.pem"
ETCD_PEER_AUTO_TLS="true"
3. 故障报错
3个节点做集群,直接关机后,etcd02故障,报错:
etcd: advertise client URLs = https://192.168.0.101:2379
etcd: read wal error (wal: crc mismatch) and cannot be repaired
systemd: etcd.service: main process exited, code=exited, status=1/FAILURE
wal的cec校验出错,谷歌了一下,没什么结果,于是移除这个etcd,再恢复
在正常的etcd节点移除
# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379
ac2f188e97f50eb7, started, etcd02, https://192.168.0.101:2380, https://192.168.0.101:2379
# etcdctl member remove ac2f188e97f50eb7
Member ac2f188e97f50eb7 removed from cluster 194cd14a48430083
再启动etcd服务
# systemctl start etcd
报错:
etcd: error validating peerURLs {ClusterID:194cd14a48430083 Members:[&{ID:1ce6d6d01109192 RaftAttributes:{PeerURLs:[https://192.168.0.102:2380]} Attributes:{Name:etcd03 ClientURLs:[https://192.168.0.102:2379]}} &{ID:9b534175b46ea789 RaftAttributes:{PeerURLs:[https://192.168.0.100:2380]} Attributes:{Name:etcd01 ClientURLs:[https://192.168.0.100:2379]}}] RemovedMemberIDs:[]}: member count is unequal
报错:
etcd: the member has been permanently removed from the cluster the data-dir used by this member must be removed
4. etcd恢复数据
在etcd02节点恢复一下数据试试:
# mv /var/lib/etcd/member /var/lib/member
# rm -rf /var/lib/etcd/*
# etcdctl snapshot restore /var/lib/member/snap/db --skip-hash-check=true
2018-06-22 11:28:35.622666 I | mvcc: restore compact to 10177401
2018-06-22 11:28:35.659626 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32
# systemctl start etcd
服务启动了,自己把自己选做主,服务倒是启动了,加入集群还是出错,用正常的节点备份再恢复
# etcdctl snapshot save etcdback.db
# etcdctl member add etcd02 http://192.168.0.101:2380
Error: member name not provided.
看看现在集群的其他2个etcd
# curl -k --key /etc/kubernetes/ssl/etcd-key.pem --cert /etc/kubernetes/ssl/etcd.pem https://192.168.0.100:2380/members
[{"id":130161754177048978,"peerURLs":["https://192.168.0.102:2380"],"name":"etcd03","clientURLs":["https://192.168.0.102:2379"]},{"id":11192361472739944329,"peerURLs":["https://192.168.0.100:2380"],"name":"etcd01","clientURLs":["https://192.168.0.100:2379"]}]
参考文档:
etcdctl member add etcd_name –peer-urls=”https://peerURLs”
再次添加
# etcdctl member add etcd02 --peer-urls="https://192.168.0.101:2380"
Member 41c2a7b938a5e387 added to cluster 194cd14a48430083
ETCD_NAME="etcd02"
ETCD_INITIAL_CLUSTER="etcd03=https://192.168.0.102:2380,etcd02=https://192.168.0.101:2380,etcd01=https://192.168.0.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
查看etcd member状态:
# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379
ad17c3da831c84c7, unstarted, , https://192.168.0.101:2380,
报错:
etcd: request cluster ID mismatch (got 194cd14a48430083 want cdf818194e3a8c32)
发现步骤顺序错误,应该是先添加到etcd集群,再启动etcd服务,我们现在先启动etcd服务,就是一个etcd单点
etcd节点加入集群
故障的etcd主机
# systemctl stop etcd
正常的etcd主机:
# etcdctl member remove ad17c3da831c84c7
# etcdctl member add etcd02 --peer-urls="https://192.168.0.101:2380"
Member 41c2a7b938a5e387 added to cluster 194cd14a48430083
ETCD_NAME="etcd02"
ETCD_INITIAL_CLUSTER="etcd03=https://192.168.0.102:2380,etcd02=https://192.168.0.101:2380,etcd01=https://192.168.0.100:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
故障的etcd主机,查看现在etcd状态:
# etcdctl endpoint health
127.0.0.1:2379 is healthy: successfully committed proposal: took = 24.52485ms
# etcdctl member list
1ce6d6d01109192, started, etcd03, https://192.168.0.102:2380, https://192.168.0.102:2379
41c2a7b938a5e387, started, etcd02, https://192.168.0.101:2380, https://192.168.0.101:2379
9b534175b46ea789, started, etcd01, https://192.168.0.100:2380, https://192.168.0.100:2379
到这里,etcd故障修复完毕
5. etcd常用命令
查看状态
# export ETCDCTL_API=3
# etcdctl endpoint status --write-out=table
+----------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------+------------------+---------+---------+-----------+-----------+------------+
| 127.0.0.1:2379 | 41c2a7b938a5e387 | 3.2.15 | 15 MB | true | 317 | 11051403 |
+----------------+------------------+---------+---------+-----------+-----------+------------+
备份及恢复
etcdctl snapshot save etcdback.db
etcdctl snapshot status etcdback.db --write-out=table
etcdctl snapshot restore etcdback.db --skip-hash-check=true
etcd监控
# curl -L http://localhost:2379/metrics
# HELP etcd_debugging_mvcc_keys_total Total number of keys.
# TYPE etcd_debugging_mvcc_keys_total gauge
etcd_debugging_mvcc_keys_total 776
# HELP etcd_debugging_mvcc_pending_events_total Total number of pending events to be sent.
# TYPE etcd_debugging_mvcc_pending_events_total gauge
etcd_debugging_mvcc_pending_events_total 0
# HELP etcd_debugging_mvcc_put_total Total number of puts seen by this member.
# TYPE etcd_debugging_mvcc_put_total counter
etcd_debugging_mvcc_put_total 9.548201e+06
# HELP etcd_debugging_mvcc_range_total Total number of ranges seen by this member.
# TYPE etcd_debugging_mvcc_range_total counter
etcd_debugging_mvcc_range_total 2.1052143e+07
# HELP etcd_debugging_mvcc_slow_watcher_total Total number of unsynced slow watchers.
# TYPE etcd_debugging_mvcc_slow_watcher_total gauge
etcd_debugging_mvcc_slow_watcher_total 0
# HELP etcd_debugging_mvcc_txn_total Total number of txns seen by this member.
# TYPE etcd_debugging_mvcc_txn_total counter
etcd_debugging_mvcc_txn_total 0
# HELP etcd_debugging_mvcc_watch_stream_total Total number of watch streams.
# TYPE etcd_debugging_mvcc_watch_stream_total gauge
etcd_debugging_mvcc_watch_stream_total 125
# HELP etcd_debugging_mvcc_watcher_total Total number of watchers.
# TYPE etcd_debugging_mvcc_watcher_total gauge
etcd_debugging_mvcc_watcher_total 125
# HELP etcd_debugging_server_lease_expired_total The total number of expired leases.
# TYPE etcd_debugging_server_lease_expired_total counter
etcd_debugging_server_lease_expired_total 3649
适合用prometheus监控
global:
scrape_interval: 10s
scrape_configs:
- job_name: etcd
static_configs:
- targets: ['192.168.0.100:2379','192.168.0.101:2379','192.168.0.102:2379']
图解raft算法 http://thesecretlivesofdata.com/raft/
etcd获取kubernetes的数据
# export ETCDCTL_API=3
# etcdctl get /registry/namespaces/default --prefix -w json|python -m json.tool
{
"count": 1,
"header": {
"cluster_id": 1823062066148343939,
"member_id": 11192361472739944329,
"raft_term": 317,
"revision": 10880816
},
"kvs": [
{
"create_revision": 6,
"key": "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA==",
"mod_revision": 6,
"value": "azhzAAoPCgJ2MRIJTmFtZXNwYWNlEl8KRQoHZGVmYXVsdBIAGgAiACokOTVlNzdjMWEtM2Q1Ny0xMWU4LTk5YzItMDA1MDU2YmU3NWEzMgA4AEIICK7qttYFEAB6ABIMCgprdWJlcm5ldGVzGggKBkFjdGl2ZRoAIgA=",
"version": 1
}
]
}
查看key的内容
# echo L3JlZ2lzdHJ5L25hbWVzcGFjZXMvZGVmYXVsdA== |base64 -d
/registry/namespaces/default
#!/bin/bash
# Get kubernetes keys from etcd
export ETCDCTL_API=3
keys=`etcdctl get /registry --prefix -w json|python -m json.tool|grep key|cut -d ":" -f2|tr -d '"'|tr -d ","`
for x in $keys;do
echo $x|base64 -d|sort
done
获取etcd中kubernetes所有对象的key
原文出处:bbotte -> http://bbotte.com/server-config/etcd-cluster-troubleshooting/
etcd集群故障处理(转)的更多相关文章
- Centos7下Etcd集群搭建
一.简介 "A highly-available key value store for shared configuration and service discovery." ...
- 一键部署ETCD集群脚本
这里使用三个节点,系统版本为CentOS7 # vim deploy-etcd.sh #!/bin/bash set -x set -e #更改这里的IP, 只支持部署3个节点etcd集群 decla ...
- Docker 搭建 etcd 集群
阅读目录: 主机安装 集群搭建 API 操作 API 说明和 etcdctl 命令说明 etcd 是 CoreOS 团队发起的一个开源项目(Go 语言,其实很多这类项目都是 Go 语言实现的,只能说很 ...
- etcd集群部署
etcd是用于共享配置和服务发现的分布式KV存储系统,随着CoreOS和Kubernetes等项目在开源社区日益火热,它们都用到了etcd组件作为一个高可用.强一致性的服务发现存储仓库.操作系统版本: ...
- Docker下ETCD集群搭建
搭建集群之前首先准备两台安装了CentOS 7的主机,并在其上安装好Docker. Master 10.100.97.46 Node 10.100.97.64 ETCD集群搭建有三种方式,分别是Sta ...
- kubernetes 集群安装etcd集群,带证书
install etcd 准备证书 https://www.kubernetes.org.cn/3096.html 在master1需要安装CFSSL工具,这将会用来建立 TLS certificat ...
- centos下etcd集群安装
先仔细了解学习etcd 官方: https://github.com/etcd-io/etcd https://www.cnblogs.com/softidea/p/6517959.html http ...
- 灵雀云:etcd 集群运维实践
[编者的话]etcd 是 Kubernetes 集群的数据核心,最严重的情况是,当 etcd 出问题彻底无法恢复的时候,解决问题的办法可能只有重新搭建一个环境.因此围绕 etcd 相关的运维知识就比较 ...
- Kubernetes集群搭建之Etcd集群配置篇
介绍 etcd 是一个分布式一致性k-v存储系统,可用于服务注册发现与共享配置,具有以下优点. 简单 : 相比于晦涩难懂的paxos算法,etcd基于相对简单且易实现的raft算法实现一致性,并通过g ...
随机推荐
- [UE4]虚幻UE4 .uproject文件无关联 右键菜单少了
前一段时间因为一些事,重装系统 然后重新安装UE4跟VS ,突然发现...竟然之前的UE4原先的项目找不到了,然后用UE4打开就提示 “该文件没有与之关联的程序来执行该操作,请先安装一个程序... ...
- MySQL 之管理脚本
Mysql中查看每个IP的连接数 ) as ip , count(*) from information_schema.processlist group by ip;
- Postgres 主从复制搭建步骤
系统版本: CentOS Linux release 7.5.1804 (Core) 数据库 psql (PostgreSQL) 10.5 2台机器ip : 172.17.0.3 /172.17.0. ...
- hive spark版本对应关系
查看hive source下面的pom.xml,可以找到官方默认发布的hive版本对应的spark版本,在实际部署的时候,最好按照这个版本关系来,这样出现兼容问题的概率相对较小. 下面面列出一部分对应 ...
- python-简单的sqlite3使用
# 导入SQLite驱动: >>> import sqlite3 # 连接到SQLite数据库 # 数据库文件是test.db # 如果文件不存在,会自动在当前目录创建: >& ...
- crawlspider 多分页处理
提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法二:基 ...
- 爬虫概念 requests模块
requests模块 - 基于如下5点展开requests模块的学习 什么是requests模块 requests模块是python中原生的基于网络请求的模块,其主要作用是用来模拟浏览器发起请求.功能 ...
- [UGUI]图文混排(五):添加下划线
0.下划线标签 标签格式:<material=underline c=#ffffff h=1 n=*** p=***>blablabla...</material> mater ...
- Spring MVC 学习笔记11 —— 后端返回json格式数据
Spring MVC 学习笔记11 -- 后端返回json格式数据 我们常常听说json数据,首先,什么是json数据,总结起来,有以下几点: 1. JSON的全称是"JavaScript ...
- Spring MVC 学习笔记8 —— 实现简单的用户管理(4)用户登录
Spring MVC 学习笔记8 -- 实现简单的用户管理(4)用户登录 增删改查,login 1. login.jsp,写在外面,及跟WEB-INF同一级目录,如:ls Webcontent; &g ...