• Issues:

customer report that real time will interrupted frequently as below:

  • Root Cause:

some storm workers execute full gc with too many time and cause nimbus reset the worker , so the data is missing.

  • Steps to invesigates this issues

1. check the storm UI with workers and found that all workers are normal ,but some task failed

check logs from storm workers and found there are some exceptions as below:

2018-09-26 07:35:55.081 FlowKafkaReadSpout getDataThread-8 [INFO] partition:8,offset:676160000,key:2018-09-26 07:35:53_15067,valueLength:1755
2018-09-26 07:36:00.490 o.a.s.m.n.StormServerHandler Netty-server-localhost-6700-worker-1 [ERROR] server errors in handling the request
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.7.0_80]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.7.0_80]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.7.0_80]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.7.0_80]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) ~[?:1.7.0_80]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [storm-core-1.1.1.jar:1.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_80]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_80]
at java.lang.Thread.run(Thread.java:745) [?:1.7.0_80]
2018-09-26 07:36:00.490 o.a.s.m.n.StormClientHandler client-worker-1 [INFO] Connection to ip-10-9-248-74.us-west-2.compute.internal/10.9.248.74:6700 failed:
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.7.0_80]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.7.0_80]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.7.0_80]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.7.0_80]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) ~[?:1.7.0_80]

it seems that some of workers can't connect and also can find others exceptions of "Connection reset by peer" with zookeeper cluster and kafaka cluster , check the port usage and found that :

tcp6       0      0 10.9.248.61:38050       10.9.248.70:9092        TIME_WAIT
tcp6 0 0 10.9.248.61:38066 10.9.248.70:9092 TIME_WAIT
tcp6 0 0 10.9.248.61:39160 10.9.248.97:2181 TIME_WAIT

we restarted all storm workers and kafaka cluster and zookeeper cluster , the issue not fix also.

we check the storm workers again and found on other workers there also have some issues as below:

2018-09-26 06:34:34.834 STDERR Thread-2 [INFO] 740.606: [Full GC [PSYoungGen: 1298054K->370618K(6126592K)] [ParOldGen: 5032811K->5122788K(6748672K)] 6330866K->5493406K(12875264K) [PSPermGen: 55526K->55525K(524288K)], 6.4880090 secs] [Times: user=100.76 sys=0.00, real=6.49 secs]
2018-09-26 06:34:34.834 o.a.s.s.o.a.z.ClientCnxn Thread-23-$mastercoord-bg1-executor[2 2]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Client session timed out, have not heard from server in 9008ms for sessionid 0xb65a69ab5380782, closing socket connection and attempting reconnect
2018-09-26 06:34:34.840 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Client session timed out, have not heard from server in 10147ms for sessionid 0xa5beb7fcf46ff88, closing socket connection and attempting reconnect
2018-09-26 06:34:34.835 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Client session timed out, have not heard from server in 9398ms for sessionid 0xa5beb7fcf46ff82, closing socket connection and attempting reconnect
2018-09-26 06:34:34.935 o.a.s.s.o.a.c.f.s.ConnectionStateManager Thread-23-$mastercoord-bg1-executor[2 2]-EventThread [INFO] State change: SUSPENDED
2018-09-26 06:34:34.941 o.a.c.f.s.ConnectionStateManager Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] State change: SUSPENDED
2018-09-26 06:34:34.942 o.a.s.s.o.a.c.f.s.ConnectionStateManager main-EventThread [INFO] State change: SUSPENDED
2018-09-26 06:34:34.943 o.a.s.u.StormBoundedExponentialBackoffRetry executor-heartbeat-timer [WARN] WILL SLEEP FOR 1001ms (NOT MAX)
2018-09-26 06:34:34.943 o.a.s.u.StormBoundedExponentialBackoffRetry refresh-active-timer [WARN] WILL SLEEP FOR 1001ms (NOT MAX)
2018-09-26 06:34:34.943 o.a.s.u.StormBoundedExponentialBackoffRetry refresh-connections-timer [WARN] WILL SLEEP FOR 1001ms (NOT MAX)
2018-09-26 06:34:34.943 o.a.s.c.zookeeper-state-factory main-EventThread [WARN] Received event :disconnected::none: with disconnected Writer Zookeeper.
2018-09-26 06:34:35.182 o.a.s.s.o.a.z.ClientCnxn Thread-23-$mastercoord-bg1-executor[2 2]-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:35.183 o.a.s.s.o.a.z.ClientCnxn Thread-23-$mastercoord-bg1-executor[2 2]-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181, initiating session
2018-09-26 06:34:35.183 o.a.s.s.o.a.z.ClientCnxn Thread-23-$mastercoord-bg1-executor[2 2]-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181, sessionid = 0xb65a69ab5380782, negotiated timeout = 10000
2018-09-26 06:34:35.183 o.a.s.s.o.a.c.f.s.ConnectionStateManager Thread-23-$mastercoord-bg1-executor[2 2]-EventThread [INFO] State change: RECONNECTED
2018-09-26 06:34:35.787 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:35.787 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session
2018-09-26 06:34:35.789 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Unable to reconnect to ZooKeeper service, session 0xa5beb7fcf46ff82 has expired, closing socket connection
2018-09-26 06:34:35.789 o.a.s.s.o.a.c.f.s.ConnectionStateManager main-EventThread [INFO] State change: LOST
2018-09-26 06:34:35.789 o.a.s.c.zookeeper-state-factory main-EventThread [WARN] Received event :expired::none: with disconnected Writer Zookeeper.
2018-09-26 06:34:35.789 o.a.s.s.o.a.c.ConnectionState main-EventThread [WARN] Session expired event received
2018-09-26 06:34:35.789 o.a.s.s.o.a.z.ZooKeeper main-EventThread [INFO] Initiating client connection, connectString=ec2-52-27-163-101.us-west-2.compute.amazonaws.com:2181,ec2-52-27-236-22.us-west-2.compute.amazonaws.com:2181,ec2-52-24-149-36.us-west-2.compute.amazonaws.com:2181/storm111 sessionTimeout=90000 watcher=org.apache.storm.shade.org.apache.curator.ConnectionState@383fbe82
2018-09-26 06:34:35.802 o.a.s.s.o.a.z.ClientCnxn main-EventThread [INFO] EventThread shut down
2018-09-26 06:34:35.802 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:35.802 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session
2018-09-26 06:34:35.804 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, sessionid = 0xb65a69ab538078c, negotiated timeout = 10000
2018-09-26 06:34:35.805 o.a.s.s.o.a.c.f.s.ConnectionStateManager main-EventThread [INFO] State change: RECONNECTED
2018-09-26 06:34:35.935 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:35.935 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session
2018-09-26 06:34:35.937 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Unable to reconnect to ZooKeeper service, session 0xa5beb7fcf46ff88 has expired, closing socket connection
2018-09-26 06:34:35.937 o.a.c.f.s.ConnectionStateManager Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] State change: LOST
2018-09-26 06:34:35.937 o.a.c.ConnectionState Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [WARN] Session expired event received
2018-09-26 06:34:35.938 o.a.z.ZooKeeper Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] Initiating client connection, connectString=zookeeper-prod-1.compass-calix.com:2181,zookeeper-prod-2.compass-calix.com:2181,zookeeper-prod-3.compass-calix.com:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@5b51a389
2018-09-26 06:34:36.177 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:36.177 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] EventThread shut down
2018-09-26 06:34:36.177 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session
2018-09-26 06:34:36.179 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, sessionid = 0xb65a69ab538078d, negotiated timeout = 10000
2018-09-26 06:34:36.180 o.a.c.f.s.ConnectionStateManager Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] State change: RECONNECTED
2018-09-26 06:34:36.217 c.c.s.r.z.ZNodeTreeListener Curator-TreeCache-0 [INFO] Listen: Add path:/realtime/subscriptions/location/1127582/1033 , timestamp is:

  from logs, we will found that , sometimes the worker will execute Full GC exceed more than 30s , and the worker's "topology.message.timeout.secs=30" , so when Full GC executed more than 30s , the other workers can't get the response from this worker and nimbers will disconnect this worker ,

investigate issues of real time interrupted的更多相关文章

  1. (转)db2top详解

    原文:https://blog.csdn.net/lyjiau/article/details/47804001 https://www.ibm.com/support/knowledgecenter ...

  2. IP-reputation feature

    IP-reputation feature https://blog.norz.at/citrix-netscaler-ip-reputation-feature/ I recently had to ...

  3. svn Previous operation has not finished; run 'cleanup' if it was interrupted

    svn cleanup failed–previous operation has not finished; run cleanup if it was interrupted Usually, a ...

  4. SVN:Previous operation has not finished; run 'cleanup' if it was interrupted

    异常处理汇总-开发工具  http://www.cnblogs.com/dunitian/p/4522988.html cleanup failed to process the following ...

  5. 【svn】在提交文件是报错:previous operation has not finished;run 'cleanup' if it was interrupted

    1.svn在提交文件是报错:previous operation has not finished;run 'cleanup' if it was interrupted2.原因,工作队列被占用,只需 ...

  6. svn报错cleanup failed–previous operation has not finished; run cleanup if it was interrupted的解决办法

    今天在svn提交的时候它卡顿了一下,我以为已经提交完了,就按了一下,结果就再也恢复不了,也继续不了了... 报错 cleanup failed–previous operation has not f ...

  7. svn报错:“Previous operation has not finished; run 'cleanup' if it was interrupted“ 的解决方法

    今天改完代码提交时,提交接近完成但窗口还未关闭电脑蓝屏了.夏天来了,电脑比人还怕热啊~~~   心里咯噔一下,估计svn又会出一些莫名其妙的问题了. 果然,待电脑重启后开eclipse,文件还是新增状 ...

  8. svn:cleanup failed previous operation has not finished; run cleanup if it was interrupted

    svn:cleanup failed previous operation has not finished; run cleanup if it was interrupted 今天 大脑一时短路 ...

  9. SVN报Previous operation has not finished; run 'cleanup' if it was interrupted错误的解决方法

    做着项目突然SVN报Previous operation has not finished; run 'cleanup' if it was interrupted,进度又要继续,烦.百度一下发现很多 ...

随机推荐

  1. 1、4 前后端分离,写静态HTML文件,通过ajax 返回数据

    1.html <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <ti ...

  2. Selenium问题集锦

    此文章用于记录使用Selenium遇见的问题~ 问题1:sendkeys直接报错如下: 解决方案:selenium 驱动和Chrome浏览器的版本必须对应,不然会报此错.驱动地址:点此跳转 下载前先看 ...

  3. 5年经验Java程序员面试20天

      写在前面 今天分享的是一位5年工作经验的Java工程师在帝都的面试经验总结,看看这些互联网公司都爱问些什么题,希望对大家的面试有指导意义. 从事Java开发也有5年经验了,4月初自己的开启面试经历 ...

  4. 【转】spring基础:@ResponseBody,PrintWriter用法

    理解:很多情况我们需要在controller接收请求然后返回一些message. 1.在springmvc中当返回值是String时,如果不加@ResponseBody的话,返回的字符串就会找这个St ...

  5. 使用google的guova开发高并发下的接口限流

    使用google的guova开发高并发下的接口限流 使用google的guova进行限流 1.guova的限流方式,在定时产生定量的令牌,令牌的数量限制了流量 2.增加一个订单接口限流类OrderRa ...

  6. CentOS7离线安装Nginx(详细安装过程)

    CentOS7离线安装Nginx(详细安装过程) 1.安装gcc.g++ 下载好所需的文件后上传至服务器(下载地址:https://download.csdn.net/download/a729360 ...

  7. (十一)easyUI之下拉框

    <%@ page language="java" contentType="text/html; charset=UTF-8" pageEncoding= ...

  8. (四)自定义多个Realm以及Authenticator与AuthenticationStrategy

    多Realm配置 #声明一个realm myRealm1=com.github.zhangkaitao.shiro.chapter2.realm.MyRealm1 myRealm2=com.githu ...

  9. eclipse怎样修改同名包(package)的显示样式、格式

    打开我们的项目,可以看到左侧的package看上去特别多,没有层级. 点击Package Explorer右上角的箭头图标. 可以看到“Flat(扁平)”,“Hierarchical(分层)”两个选项 ...

  10. python爬视频实例

    例:抓取PhotoShop视频教程 网址http://www.mxiaobei.com/?id=424 import requests import re from bs4 import Beauti ...