【原创】大叔问题定位分享(1)HBase RegionServer频繁挂掉
最近hbase集群很多region server挂掉,查看其中一个RegionServer1日志发现,17:17:14挂的时候服务器压力很大,有大量的responseTooSlow,也有不少gc,但是当时内存还有很多剩余,不是因为oom被kill
2018-03-13T17:17:13.372+0800: [GC (Allocation Failure) 2018-03-13T17:17:13.372+0800: [ParNew: 3280066K->256481K(3762880K), 0.0429718 secs] 23536952K->20513367K(32801920K), 0.0432932 secs] [Times: user=1.92 sys=0.01, real=0.04 secs]
Heap
par new generation total 3762880K, used 1416770K [0x00007f2f4c000000, 0x00007f305f990000, 0x00007f305f990000)
eden space 3010368K, 38% used [0x00007f2f4c000000, 0x00007f2f92d182c0, 0x00007f3003bd0000)
from space 752512K, 34% used [0x00007f3003bd0000, 0x00007f30136487c0, 0x00007f3031ab0000)
to space 752512K, 0% used [0x00007f3031ab0000, 0x00007f3031ab0000, 0x00007f305f990000)
concurrent mark-sweep generation total 29039040K, used 19905973K [0x00007f305f990000, 0x00007f374c000000, 0x00007f374c000000)
Metaspace used 49530K, capacity 50072K, committed 55492K, reserved 57344K
region server挂掉的原因是tryRegionServerReport时发现被标记为dead server,抛出YouAreDeadException,然后HRegionServer.run方法会退出,意味着region server进程退出
2018-03-13 17:17:08,159 FATAL [regionserver/RegionServer1/RegionServer1:16020] regionserver.HRegionServer: ABORTING region server RegionServer1,16020,1519805863844: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing RegionServer1,16020,1519805863844 as dead server
org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing RegionServer1,16020,1519805863844 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:434)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:339)
at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:339)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(Reg
ionServerStatusProtos.java:8617)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330)
at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1153)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:962)
被标记为dead server是因为zk session expired
2018-03-13 17:17:08,128 INFO [regionserver/RegionServer1/RegionServer1:16020-SendThread(ZKServer1:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 41499ms for sessionid 0x161d1433ba86c84, closing socket connection and attempting reconnect
2018-03-13 17:17:08,445 FATAL [main-EventThread] regionserver.HRegionServer: ABORTING region server RegionServer1,16020,1519805863844: regionserver:16020-0x35fe18af217e685, quorum=ZKServer1:2181, baseZNode=/hbase regionserver:16020-0x35fe18af217e685 received expired from ZooKeeper, aborting
在17:17:08超时41499ms是因为17:16:38的时候有一个full gc耗时29s
2018-03-13T17:16:38.268+0800: [GC (Allocation Failure)
2018-03-13T17:16:38.268+0800: [ParNew (promotion failed): 3762880K->3762880K(3762880K), 3.2398750 secs]
2018-03-13T17:16:41.508+0800: [CMS2018-03-13T17:16:47.078+0800: [CMS-concurrent-sweep: 10.829/17.135 secs] [Times: user=161.97 sys=2.65, real=17.13 secs]
(concurrent mode failure): 28187725K->21036514K(29039040K), 26.5884371 secs] 31874835K->21036514K(32801920K), [Metaspace: 49335K->49335K(57344K)], 29.8297421 secs] [Times: user=40.45 sys=0.14, real=29.82 secs]
但是hbase-site.xml中配置session超时是180s,为什么41s就超时
<property>
<name>zookeeper.session.timeout</name>
<value>180000</value>
</property>
建立zk session时的negotiated timeout = 40000,也就是40s,为什么配置的是180s,实际是40s?
2018-03-13 17:15:11,522 INFO [main-SendThread(ZKServer1:2181)] zookeeper.ClientCnxn: Session establishment complet
e on server ZKServer1/ZKServer1:2181, sessionid = 0x35fe18af217e685, negotiated timeout = 40000
2018-03-13 17:15:11,526 INFO [regionserver/RegionServer1/RegionServer1:16020-SendThread(ZKServer1:2181)]
zookeeper.ClientCnxn: Session establishment complete on server RegionServer1/RegionServer1:2181, sessionid = 0x161d1433b
a86c84, negotiated timeout = 40000
zk中建立连接后输出
LOG.info("Session establishment complete on server "
+ clientCnxnSocket.getRemoteSocketAddress()
+ ", sessionid = 0x" + Long.toHexString(sessionId)
+ ", negotiated timeout = " + negotiatedSessionTimeout
+ (isRO ? " (READ-ONLY mode)" : ""));
negotiatedSessionTimeout是在onConnected中被回调
void onConnected(int _negotiatedSessionTimeout, long _sessionId,
byte[] _sessionPasswd, boolean isRO) throws IOException {
回调是在ClientCnxnSocket中,取得是ConnectResponse.getTimeout()
ConnectResponse conRsp = new ConnectResponse();
conRsp.deserialize(bbia, "connect");
sendThread.onConnected(conRsp.getTimeOut(), this.sessionId,
conRsp.getPasswd(), isRO);
这里取的是ServerCnxn中的sessionTimeout,这个值在ZooKeeperServer中做初始化
int minSessionTimeout = getMinSessionTimeout();
if (sessionTimeout < minSessionTimeout) {
sessionTimeout = minSessionTimeout;
}
int maxSessionTimeout = getMaxSessionTimeout();
if (sessionTimeout > maxSessionTimeout) {
sessionTimeout = maxSessionTimeout;
}
可以看到会根据minSessionTimeout和maxSessionTimeout限制
public int getMaxSessionTimeout() {
return maxSessionTimeout == -1 ? tickTime * 20 : maxSessionTimeout;
}
由于zookeeper服务器tickTime设置的是2000ms,所以maxSessionTimeout默认会被设置为40000ms,所以解决这个问题需要修改zk的maxSessionTimeout;
【原创】大叔问题定位分享(1)HBase RegionServer频繁挂掉的更多相关文章
- 20130617 hbase regionserver 老挂掉
hbase regionserver 老挂掉: 添加如下: <property><name>hbase.regionserver.restart.on.zk.expire< ...
- 【原创】大叔问题定位分享(25)ambari metrics collector内置standalone hbase启动失败
ambari metrics collector内置hbase目录位于 /usr/lib/ams-hbase 配置位于 /etc/ams-hbase/conf 通过ruby启动 /usr/lib/am ...
- 【原创】大叔问题定位分享(13)HBase Region频繁下线
问题现象:hive执行sql报错 select count(*) from test_hive_table; 报错 Error: java.io.IOException: org.apache.had ...
- 一次bug死磕经历之Hbase堆内存小导致regionserver频繁挂掉
环境如下: Centos6.5 Apache Hadoop2.7.1 Apache Hbase0.98.12 Apache Zookeeper3.4.6 JDK1.7 Ant1.9.5 Maven3. ...
- 【原创】大叔问题定位分享(24)hbase standalone方式启动报错
hbase 2.0.2 hbase standalone方式启动报错: 2019-01-17 15:49:08,730 ERROR [Thread-24] master.HMaster: Failed ...
- 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
- 【原创】大叔问题定位分享(6)Dubbo monitor服务iowait高,负载高
一 问题 Dubbo monitor所在服务器状态异常,iowait一直很高,load也一直很高,监控如下: iowait如图: load如图: 二 分析 通过iotop命令可以查看当前系统中磁盘io ...
- 【原创】大叔问题定位分享(3)Kafka集群broker进程逐个报错退出
kafka0.8.1 一 问题现象 生产环境kafka服务器134.135.136分别在10月11号.10月13号挂掉: 134日志 [2014-10-13 16:45:41,902] FATAL [ ...
- 【原创】大叔问题定位分享(30)mesos agent启动失败:Failed to perform recovery: Incompatible agent info detected
mesos agent启动失败,报错如下: Feb 15 22:03:18 server1.bj mesos-slave[1190]: E0215 22:03:18.622994 1192 slave ...
随机推荐
- Lepus搭建企业级数据库慢查询分析平台
前言 Lepus的慢查询分析平台是独立于监控系统的模块,该功能需要使用percona-toolkit工具来采集和记录慢查询日志,并且需要部署一个我们提供的shell脚本来进行数据采集.该脚本会自动开启 ...
- .Net Core应用框架Util介绍(三)
上篇介绍了Util的开发环境,并让你把Demo运行起来.本文将介绍该Demo的前端Angular运行机制以及目录结构. 目录结构 在VS上打开Util Demo,会看见如下的目录结构. 现代前端通常采 ...
- WPF中利用控件的DataContext属性为多个TextBox绑定数据
工作上需要从给定的接口获取数据,然后显示在界面的编辑框中,以往肯定会一个一个的去赋值,但这样太麻烦而且效率很低,不利于维护,于是想到了数据绑定这一方法,数据绑定主要利用INotifyPropertyC ...
- Python——hashilib 模块(哈希模块)
hashilib 模块 摘要算法 import hashlib # 提供摘要算法的模块 md5 = hashlib.md5() md5.update(b'alex3714') print(md5.he ...
- Python开发第一篇
Python 是什么? 首先他可能是比较好的一个编程开发语言!
- Qt QWidget
原文: https://www.cnblogs.com/muyuhu/archive/2012/10/26/2741184.html QWidget 类代表一般的窗口,其他窗口类都是从 QWidget ...
- 守护进程(Daemon)
守护进程的概念 守护进程(Daemon)一般是为了保护我们的程序/服务的正常运行,当程序被关闭.异常退出等时再次启动程序/恢复服务. 例如 http 服务的守护进程叫 httpd,mysql 服务的守 ...
- tomcat8 源码分析 | 组件及启动过程
tomcat 8 源码分析 ,本文主要讲解tomcat拥有哪些组件,容器,又是如何启动的 推荐访问我的个人网站,排版更好看呦: https://chenmingyu.top/tomcat-source ...
- 用Spring构建企业Java应用程序的方法
https://mp.weixin.qq.com/s?__biz=MzU0MDEwMjgwNA==&mid=2247484965&idx=1&sn=ca6b847c65e506 ...
- python 爬虫之beautifulsoup(bs4)环境准备
环境准备: bs4安装方法:https://blog.csdn.net/Bibabu135766/article/details/81662981 requests安装方法:https://blog. ...