Short Description:

The article talks about the basic health checks to be performed when working on issues related to slow zookeeper performance

Article

Zookeeper is one of the most critical components in an HDP cluster, but it is also one that is given least importance usually when tuning cluster for performance and while troubleshooting slowness in a cluster. Here is a basic checklist for zookeeper health check that one must go through to ensure that Zookeeper is running fine.

Let's keep the zookeeper happy to be able to better manage the occupants of the zoo :)

1. Are all the Zookeeper servers given dedicated disks for transaction log directory ('dataDir' / 'dataLogDir') ?

It is very important to have fast disks to complete 'fsync' of new transactions to the log, where zookeeper writes before any update takes place and before sending a response back to the client. Slower 'fsync' for transaction log is one of the most common reasons seen in the past for slower zookeeper response. Yes, the disk space requirement is usually not very high by the zookeeper and one might wonder if its worth to dedicate a complete disk to zookeeper log directory, but its required to prevent I/O operations by other applications/processes from keeping the disk busier.

Some of the common symptoms to be noticed if zookeeper finds slower writes to transactional log are:

  • Services such as NameNode zkfc and HBase Region servers, that uses ephemeral znodes to track its liveliness, shuts down after repeated zookeeper server connection timeouts.
  • The zookeeper server log frequently reports errors such as:

WARN [SyncThread:2:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:2 took 7050ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

2. Is the zookeeper process given enough heap memory, according to the number of znodes, clients and watchers connecting the zookeepers.

To arrive at the right zookeeper heap size, one has to run load tests and find the estimate on required heap size. Insufficient memory allocation for zookeepers can affect its performance once it goes through very frequent GC cycles when the heap usage reaches close to 100% of its total heap size allocation. The following four letter zookeeper commands provide many useful information about the running zookeeper instances:

  1. # echo 'stat' | nc <ZK_HOST> 2181
  2. # echo 'mntr' | nc <ZK_HOST> 2181

In the above command output, watch for numbers against the stats such as znode count, number of watchers, number of client connections and max/avg latency among other things. In most cases a heap size between 2GB and 4GB should be a good, but as mentioned above, this depends on the kind of load on the zookeeper. In addition to the above mentioned 'four letter' commands, it is also recommended to keep an eye on the increasing heap size and the GCs, especially during the time of slowness, using tools such as:

  1. # sudo su - zookeeper ; jmap -heap <ZK_PID>
  2. # sudo su - zookeeper ; jstat -gcutil 2000 10 <ZK_PID>

3. Are there too many zookeepers in the ensemble ?

Three ZooKeeper servers is the minimum recommended size for an ensemble. And in most cases, three zookeepers are good enough too. Increased number of zookeeper servers, although gives more reliability (a 7 node ensemble can withstand loss of 3 nodes compared to the tolerance of 1 node loss in case of a 3 three node ensemble), and better read throughput when there are large number of concurrent clients connected, it can lead to slower write operations since every update/write operation is required to be committed by atleast half of the nodes in an ensemble.

Some alternatives to prevent the slower writes arising due to larger ensembles are:

  1. Use dedicated zookeeper ensemble for certain workloads in the cluster
  2. For larger ensemble, use zookeeper observers - Ref. http://zookeeper.apache.org/doc/trunk/zookeeperObservers.html (although configuration of zookeeper observer is not supported in the current Ambari version as of this writing).

4. Are the 'dataDir' / 'dataLogDir' filling up too fast ?

As mentioned above, every transaction to zookeepers are written to the transaction log file. When a large number of concurrent ZK clients continuously connects and does very frequent updates, possibly due to an error condition at the client, it can lead to the transaction logs getting rolled over multiple times in a minute due to its steadily increasing size and thus resulting in a large number of Snapshot files as well. This can further cause disks running out of free space.

For such issues, one has to identify and fix the client application. Review the stats from above in addition to zookeeper logs and/or the latest transaction log, to find the latest updates on the znodes using 'logFormatter' tool:

  1. # java -cp /usr/hdp/current/zookeeper-server/*:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.LogFormatter /hadoop/zookeeper/version-2/log.xxxxx

Further, the zookeeper properties - 'autopurge.snapRetainCount' and 'autopurge.purgeInterval' have to be tuned according to the required retention count and the frequency to limit the increasing number of transaction log and snapshot files.

Zookeeper Health Checks的更多相关文章

  1. Kong(V1.0.2) Health Checks and Circuit Breakers Reference

    介绍 您可以让Kong代理的API使用ring-balancer,通过添加包含一个或多个目标实体的 upstream 实体进行配置,每个 target指向不同的IP地址(或主机名)和端口.ring-b ...

  2. TCP Health Checks

    This chapter describes how to configure health checks for TCP. Introduction NGINX and NGINX Plus can ...

  3. UDP Health Checks

    This chapter describes how to configure different types of health checks for UDP servers in a load-b ...

  4. HTTP Health Checks

    This article describes how to configure and use HTTP health checks in NGINX Plus and open source NGI ...

  5. Service Discovery And Health Checks In ASP.NET Core With Consul

    在这篇文章中,我们将快速了解一下服务发现是什么,使用Consul在ASP.NET Core MVC框架中,并结合DnsClient.NET实现基于Dns的客户端服务发现 这篇文章的所有源代码都可以在G ...

  6. Using HAProxy as an API Gateway, Part 3 [Health Checks]

    转自:https://www.haproxy.com/blog/using-haproxy-as-an-api-gateway-part-3-health-checks/ Achieving high ...

  7. 11g新特性:Health Monitor Checks

    一.什么是Health Monitor ChecksHealth Monitor Checks能够发现文件损坏,物理.逻辑块损坏,undo.redo损坏,数据字典损坏等等.Health Monitor ...

  8. About Health Monitor Checks

    About Health Monitor Checks Health Monitor checks (also known as checkers, health checks, or checks) ...

  9. consul Consul vs. ZooKeeper, doozerd, etcd

    小结 1.Consul 功能更丰富: 2. 暴露http接口避免暴露系统复杂性 The Consul clients expose a simple HTTP interface and avoid ...

随机推荐

  1. Opencv-python画图基础知识

    相关函数介绍 1. Point 该数据结构表示了由其图像坐标 和 指定的2D点.可定义为: Point pt; pt.x = 10; pt.y = 8; 或者 Point pt = Point(10, ...

  2. [android] 采用pull解析xml文件

    /***********2016年5月6日 更新**********************/ 知乎:Android 中有哪几种解析 xml 的类,官方推荐哪种 ? 以及它们的原理和区别? 刘吉财: ...

  3. with与上下文管理器

    如果你有阅读源码的习惯,可能会看到一些优秀的代码经常出现带有 "with" 关键字的语句,它通常用在什么场景呢? 对于系统资源如文件.数据库连接.socket 而言,应用程序打开这 ...

  4. Windows驱动匹配详解

    在Windows下,几乎所有的硬件设备都需要安装驱动后才能正常工作,我们重装系统后第一件事就是要为各设备安装好驱动,我们可以下载官方驱动手动安装,也可以让我Windows自动扫描安装,还可以使用驱动精 ...

  5. 6;XHTML 超链接

    1.超链接的基本格式 2.超链接的种类 3.相对链接和绝对链接 4.书签的链接 5.基准参考点 6.超链接事件 7.为链接创建键盘快捷键 8.为链接设置制表符次序 超链接也叫 URL 中文翻译为资源定 ...

  6. #WEB安全基础 : HTML/CSS | 0x3文件夹管理网站

    没有头脑的管理方式会酿成大灾难,应该使用文件夹管理网站 这是一个典型的管理方法,现在传授给你,听好了 下面是0x3初识a标签里使用的网站的目录,我把它重新配置了一下

  7. PHP7.27: pdf

    http://www.fpdf.org/ https://github.com/Setasign/FPDF https://www.ntaso.com/fpdf-and-chinese-charact ...

  8. element vue validate验证名称重复 输入框与后台重复验证 特殊字符 字符长度 及注意事项

    <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...

  9. loj#6031. 「雅礼集训 2017 Day1」字符串(SAM 广义SAM 数据分治)

    题意 链接 Sol \(10^5\)次询问每次询问\(10^5\)个区间..这种题第一感觉就是根号/数据分治的模型. \(K\)是个定值这个很关键. 考虑\(K\)比较小的情况,可以直接暴力建SAM, ...

  10. Android为TV端助力 自定义activity

    今天公司有个需要需要自动弹出界面,而dialog又不符合要求,所以自定义的一个activity的样式 首先在androidmainfest.xml上注册你的activity <activity ...