当kudu有tserver下线或者迁移或者修改hostname之后,旧的tserver会一直以dead状态出现,并且tserver日志中会有大量的连接重试日志,一天的错误日志会有几个G,

W0322 22:13:59.202749 16927 tablet_service.cc:290] Invalid argument: UpdateConsensus: Wrong destination UUID requested. Local UUID: e2f80a1fcf0c47f6b7f220a44d69297f. Requested UUID: 45bfb5b3e3ff41d9b1b1d2afab78d65c: from {username='kudu'} at 192.168.0.1:34724: tablet_id: "9933f18e59554ae6b5354e2a948469e9" caller_uuid: "9b164f37d04a484c8634ea86eae1b048" caller_term: 3 preceding_id { term: 2 index: 1873 } ops { id { term: 3 index: 1874 } timestamp: 6359719759241142272 op_type: NO_OP noop_request { } } dest_uuid: "45bfb5b3e3ff41d9b1b1d2afab78d65c" committed_index: 1874 all_replicated_index: 0 safe_timestamp: 6359719761707556864 last_idx_appended_to_leader: 1874

这时如果想要把这些dead状态的tserver去掉,并没有直接的命令,官方给出的方法如下:

Kudu does not currently have an automated way to remove a tablet server from a cluster permanently. Instead, use the following steps:

  • 1 Ensure the cluster is in good health using ksck. See Checking Cluster Health with ksck.

    •   首先保证集群是健康的(通过ksck命令)
  • 2 If the tablet server contains any replicas of tables with replication factor 1, these replicas must be manually moved off the tablet server prior to shutting it down. The kudu tablet change_config move_replica tool can be used for this.
    •   将dead状态的server上的副本进行迁移,如果有replication factor设置为1的数据,必须在下线前手工移动数据;
  • 3 Shut down the tablet server. After -follower_unavailable_considered_failed_sec, which defaults to 5 minutes, Kudu will begin to re-replicate the tablet server’s replicas to other servers. Wait until the process is finished. Progress can be monitored using ksck.
    •   只要tserver处于下线状态超过5分钟以上会自动进行副本迁移;
  • 4 Once all the copies are complete, ksck will continue to report the tablet server as unavailable. The cluster will otherwise operate fine without the tablet server. To completely remove it from the cluster so ksck shows the cluster as completely healthy, restart the masters. In the case of a single master, this will cause cluster downtime. With multimaster, restart the masters in sequence to avoid cluster downtime.
    •   当所有副本都迁移完之后,ksck依然会显示有tserver不可用,如果想完全去掉这些dead状态的server,需要重启master;

Do not shut down multiple tablet servers at once. To remove multiple tablet servers from the cluster, follow the above instructions for each tablet server, ensuring that the previous tablet server is removed from the cluster and ksck is healthy before shutting down the next.

最后,重启master之后在保证集群健康的前提下逐一重启tserver;

如果这样操作之后还是报错,说明可能有leader副本丢失,比如ksck报错

Tablet c58cef3f36a846b4bdf58447f77a6bcf of table 'impala::impala.test_kudu' is unavailable: 2 replica(s) not RUNNING
a46f0fd38eba4a5286098ff7fe260eb1: TS unavailable
45bfb5b3e3ff41d9b1b1d2afab78d65c: TS unavailable
9b164f37d04a484c8634ea86eae1b048 (server02:7050): RUNNING [LEADER]
All reported replicas are:
A = a46f0fd38eba4a5286098ff7fe260eb1
B = 45bfb5b3e3ff41d9b1b1d2afab78d65c
C = 9b164f37d04a484c8634ea86eae1b048
The consensus matrix is:
Config source | Replicas | Current term | Config index | Committed?
---------------+------------------------+--------------+--------------+------------
master | A B C* | | | Yes
A | [config not available] | | |
B | [config not available] | | |
C | [config not available] | | |

可用的副本可能存在同步延迟会丢失部分数据,这时如果已经确定leader副本不可恢复,则可以强制指定剩下的可用副本为leader,恢复tablet到健康状态;

The remaining replica is not the leader, so the leader replica failed as well. This means the chance of data loss is higher since the remaining replica on tserver-00 may have been lagging.

$ sudo -u kudu kudu remote_replica unsafe_change_config tserver-00:7150 <tablet-id> <tserver-00-uuid>

where <tablet-id> is e822cab6c0584bc0858219d1539a17e6 and <tserver-00-uuid> is the uuid of tserver-00,638a20403e3e4ae3b55d4d07d920e6de.

<tablet-id>为非健康的tablet,tserver-00:7150为可用副本所在的tserver,<tserver-00-uuid>为可用副本所在的tserver的uuid,这样就可以在可能丢失少量数据的情况下恢复tablet;

如果有问题的tablet非常多,可以参考如下命令:

$ kudu cluster ksck localhost|grep -e '^Tablet '|awk '{print $2}'|xargs -i echo "sudo -u kudu kudu remote_replica unsafe_change_config tserver-00:7150 {} <tserver-00-uuid>"

参考:

https://kudu.apache.org/docs/administration.html#tablet_server_decommissioning

https://kudu.apache.org/docs/administration.html#tablet_majority_down_recovery

【原创】大数据基础之Kudu(2)移除dead tsever的更多相关文章

  1. 【原创】大数据基础之Kudu(1)简介、安装、使用

    kudu 1.7 官方:https://kudu.apache.org/ 一 简介 kudu有很多概念,有分布式文件系统(HDFS),有一致性算法(Zookeeper),有Table(Hive Tab ...

  2. 【原创】大数据基础之Kudu(6)kudu tserver内存占用统计分析

    kudu tserver占用内存过高后会拒绝部分写请求,日志如下: 19/06/01 13:34:12 INFO AsyncKuduClient: Invalidating location 34b1 ...

  3. 【原创】大数据基础之Kudu(5)kudu增加或删除目录/数据盘

    kudu加减数据盘不能直接修改配置fs_data_dirs后重启,否则会报错: Check failed: _s.ok() Bad status: Already present: FS layout ...

  4. 【原创】大数据基础之Kudu(3)primary key

    关于kudu的primary key The primary key may not be changed after the table is created. You must drop and ...

  5. 【原创】大数据基础之Kudu(4)spark读写kudu

    spark2.4.3+kudu1.9 1 批量读 val df = spark.read.format("kudu") .options(Map("kudu.master ...

  6. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  7. 【原创】大数据基础之词频统计Word Count

    对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...

  8. 【原创】大数据基础之Impala(1)简介、安装、使用

    impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic datab ...

  9. 【原创】大数据基础之Flume(2)应用之kafka-kudu

    应用一:kafka数据同步到kudu 1 准备kafka topic # bin/kafka-topics.sh --zookeeper $zk:2181/kafka -create --topic ...

随机推荐

  1. Redis入门之增删改查等常用命令总结

    Redis是用C语言实现的,一般来说C语言实现的程序"距离"操作系统更近,执行速度相对会更快. Redis使用了单线程架构,预防了多线程可能产生的竞争问题. 作者对于Redis源代 ...

  2. 妙谈js回调函数的理解!

    很有共鸣,之前也是一直对回调函数感觉不明不白的,自己也看了不少解释说明.后来我觉得造成很多人对回调理解困难的一个原因就是,我在开发中见到的大多数使用了回调函数的情况都是直接上来就 传一个回调函数进去 ...

  3. NIO原理及案例使用

    什么是NIO Java提供了一个叫作NIO(New I/O)的第二个I/O系统,NIO提供了与标准I/O API不同的I/O处理方式.它是Java用来替代传统I/O API(自Java 1.4以来). ...

  4. AirBnB春招笔试题

    试题说明 笔试题只有一道,限时1小时. 模拟一个战争外交游戏,游戏中定义了三种操作: A city1 Hold : 军队A 占领了city1 A city1 Move city2 : 军队A从city ...

  5. 爬虫简介与request模块

    一 爬虫简介 概述 近年来,随着网络应用的逐渐扩展和深入,如何高效的获取网上数据成为了无数公司和个人的追求,在大数据时代,谁掌握了更多的数据,谁就可以获得更高的利益,而网络爬虫是其中最为常用的一种从网 ...

  6. 记录腾讯云中矿机病毒处理过程(重装系统了fu*k)

    刚想学学kafka,登录与服务器看看把,谁知ssh特别慢,很奇怪,我以为是我网速问题,断了wifi,换了网线,通过iterm想要ssh root@x.x.x.x,但是上不去? 就tm的很奇怪了,登录腾 ...

  7. SpringBoot之普通类获取Spring容器中的bean

    package com.geostar.geostack.git_branch_manager.common; import org.springframework.beans.BeansExcept ...

  8. Apache服务器配置与管理

    一.Apache服务器的目录和文件 1.WEB站点目录 /var/www Apache站点文件的目录 /var/www/html 存放WEB站点的WEB文件 /var/www/cgi-bin CGI程 ...

  9. pymongo 操作

    python 操作 mongoDB 模块 pymongo 安装方法 sudo pip3 install pymongo 操作步骤 1. 创建数据库连接对象 conn = pymonge.MomgoCl ...

  10. Magento2 可配置产品解决SKU流程

    选择可配置产品: 填写必填信息与库存 创建配置 执行四步后完成创建:4.1:选择需要的规格属性: 4.2:选择组合需要的属性值:4.3:根据您的选择,将创建3个新产品.使用此步骤自定义新产品的图像和价 ...