• Issues:

customer report that real time will interrupted frequently as below:

  • Root Cause:

some storm workers execute full gc with too many time and cause nimbus reset the worker , so the data is missing.

  • Steps to invesigates this issues

1. check the storm UI with workers and found that all workers are normal ,but some task failed

check logs from storm workers and found there are some exceptions as below:

2018-09-26 07:35:55.081 FlowKafkaReadSpout getDataThread-8 [INFO] partition:8,offset:676160000,key:2018-09-26 07:35:53_15067,valueLength:1755
2018-09-26 07:36:00.490 o.a.s.m.n.StormServerHandler Netty-server-localhost-6700-worker-1 [ERROR] server errors in handling the request
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.7.0_80]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.7.0_80]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.7.0_80]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.7.0_80]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) ~[?:1.7.0_80]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [storm-core-1.1.1.jar:1.1.1]
at org.apache.storm.shade.org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [storm-core-1.1.1.jar:1.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_80]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_80]
at java.lang.Thread.run(Thread.java:745) [?:1.7.0_80]
2018-09-26 07:36:00.490 o.a.s.m.n.StormClientHandler client-worker-1 [INFO] Connection to ip-10-9-248-74.us-west-2.compute.internal/10.9.248.74:6700 failed:
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.7.0_80]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:1.7.0_80]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[?:1.7.0_80]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.7.0_80]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384) ~[?:1.7.0_80]

it seems that some of workers can't connect and also can find others exceptions of "Connection reset by peer" with zookeeper cluster and kafaka cluster , check the port usage and found that :

tcp6       0      0 10.9.248.61:38050       10.9.248.70:9092        TIME_WAIT
tcp6 0 0 10.9.248.61:38066 10.9.248.70:9092 TIME_WAIT
tcp6 0 0 10.9.248.61:39160 10.9.248.97:2181 TIME_WAIT

we restarted all storm workers and kafaka cluster and zookeeper cluster , the issue not fix also.

we check the storm workers again and found on other workers there also have some issues as below:

2018-09-26 06:34:34.834 STDERR Thread-2 [INFO] 740.606: [Full GC [PSYoungGen: 1298054K->370618K(6126592K)] [ParOldGen: 5032811K->5122788K(6748672K)] 6330866K->5493406K(12875264K) [PSPermGen: 55526K->55525K(524288K)], 6.4880090 secs] [Times: user=100.76 sys=0.00, real=6.49 secs]
2018-09-26 06:34:34.834 o.a.s.s.o.a.z.ClientCnxn Thread-23-$mastercoord-bg1-executor[2 2]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Client session timed out, have not heard from server in 9008ms for sessionid 0xb65a69ab5380782, closing socket connection and attempting reconnect
2018-09-26 06:34:34.840 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Client session timed out, have not heard from server in 10147ms for sessionid 0xa5beb7fcf46ff88, closing socket connection and attempting reconnect
2018-09-26 06:34:34.835 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Client session timed out, have not heard from server in 9398ms for sessionid 0xa5beb7fcf46ff82, closing socket connection and attempting reconnect
2018-09-26 06:34:34.935 o.a.s.s.o.a.c.f.s.ConnectionStateManager Thread-23-$mastercoord-bg1-executor[2 2]-EventThread [INFO] State change: SUSPENDED
2018-09-26 06:34:34.941 o.a.c.f.s.ConnectionStateManager Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] State change: SUSPENDED
2018-09-26 06:34:34.942 o.a.s.s.o.a.c.f.s.ConnectionStateManager main-EventThread [INFO] State change: SUSPENDED
2018-09-26 06:34:34.943 o.a.s.u.StormBoundedExponentialBackoffRetry executor-heartbeat-timer [WARN] WILL SLEEP FOR 1001ms (NOT MAX)
2018-09-26 06:34:34.943 o.a.s.u.StormBoundedExponentialBackoffRetry refresh-active-timer [WARN] WILL SLEEP FOR 1001ms (NOT MAX)
2018-09-26 06:34:34.943 o.a.s.u.StormBoundedExponentialBackoffRetry refresh-connections-timer [WARN] WILL SLEEP FOR 1001ms (NOT MAX)
2018-09-26 06:34:34.943 o.a.s.c.zookeeper-state-factory main-EventThread [WARN] Received event :disconnected::none: with disconnected Writer Zookeeper.
2018-09-26 06:34:35.182 o.a.s.s.o.a.z.ClientCnxn Thread-23-$mastercoord-bg1-executor[2 2]-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:35.183 o.a.s.s.o.a.z.ClientCnxn Thread-23-$mastercoord-bg1-executor[2 2]-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181, initiating session
2018-09-26 06:34:35.183 o.a.s.s.o.a.z.ClientCnxn Thread-23-$mastercoord-bg1-executor[2 2]-SendThread(ip-10-9-248-98.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-98.us-west-2.compute.internal/10.9.248.98:2181, sessionid = 0xb65a69ab5380782, negotiated timeout = 10000
2018-09-26 06:34:35.183 o.a.s.s.o.a.c.f.s.ConnectionStateManager Thread-23-$mastercoord-bg1-executor[2 2]-EventThread [INFO] State change: RECONNECTED
2018-09-26 06:34:35.787 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:35.787 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session
2018-09-26 06:34:35.789 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Unable to reconnect to ZooKeeper service, session 0xa5beb7fcf46ff82 has expired, closing socket connection
2018-09-26 06:34:35.789 o.a.s.s.o.a.c.f.s.ConnectionStateManager main-EventThread [INFO] State change: LOST
2018-09-26 06:34:35.789 o.a.s.c.zookeeper-state-factory main-EventThread [WARN] Received event :expired::none: with disconnected Writer Zookeeper.
2018-09-26 06:34:35.789 o.a.s.s.o.a.c.ConnectionState main-EventThread [WARN] Session expired event received
2018-09-26 06:34:35.789 o.a.s.s.o.a.z.ZooKeeper main-EventThread [INFO] Initiating client connection, connectString=ec2-52-27-163-101.us-west-2.compute.amazonaws.com:2181,ec2-52-27-236-22.us-west-2.compute.amazonaws.com:2181,ec2-52-24-149-36.us-west-2.compute.amazonaws.com:2181/storm111 sessionTimeout=90000 watcher=org.apache.storm.shade.org.apache.curator.ConnectionState@383fbe82
2018-09-26 06:34:35.802 o.a.s.s.o.a.z.ClientCnxn main-EventThread [INFO] EventThread shut down
2018-09-26 06:34:35.802 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:35.802 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session
2018-09-26 06:34:35.804 o.a.s.s.o.a.z.ClientCnxn main-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, sessionid = 0xb65a69ab538078c, negotiated timeout = 10000
2018-09-26 06:34:35.805 o.a.s.s.o.a.c.f.s.ConnectionStateManager main-EventThread [INFO] State change: RECONNECTED
2018-09-26 06:34:35.935 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:35.935 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session
2018-09-26 06:34:35.937 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Unable to reconnect to ZooKeeper service, session 0xa5beb7fcf46ff88 has expired, closing socket connection
2018-09-26 06:34:35.937 o.a.c.f.s.ConnectionStateManager Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] State change: LOST
2018-09-26 06:34:35.937 o.a.c.ConnectionState Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [WARN] Session expired event received
2018-09-26 06:34:35.938 o.a.z.ZooKeeper Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] Initiating client connection, connectString=zookeeper-prod-1.compass-calix.com:2181,zookeeper-prod-2.compass-calix.com:2181,zookeeper-prod-3.compass-calix.com:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@5b51a389
2018-09-26 06:34:36.177 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Opening socket connection to server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-26 06:34:36.177 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] EventThread shut down
2018-09-26 06:34:36.177 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Socket connection established to ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, initiating session
2018-09-26 06:34:36.179 o.a.z.ClientCnxn Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-SendThread(ip-10-9-248-97.us-west-2.compute.internal:2181) [INFO] Session establishment complete on server ip-10-9-248-97.us-west-2.compute.internal/10.9.248.97:2181, sessionid = 0xb65a69ab538078d, negotiated timeout = 10000
2018-09-26 06:34:36.180 o.a.c.f.s.ConnectionStateManager Thread-279-spout-DataKafkaSpout1537942916091-executor[824 824]-EventThread [INFO] State change: RECONNECTED
2018-09-26 06:34:36.217 c.c.s.r.z.ZNodeTreeListener Curator-TreeCache-0 [INFO] Listen: Add path:/realtime/subscriptions/location/1127582/1033 , timestamp is:

  from logs, we will found that , sometimes the worker will execute Full GC exceed more than 30s , and the worker's "topology.message.timeout.secs=30" , so when Full GC executed more than 30s , the other workers can't get the response from this worker and nimbers will disconnect this worker ,

investigate issues of real time interrupted的更多相关文章

  1. (转)db2top详解

    原文:https://blog.csdn.net/lyjiau/article/details/47804001 https://www.ibm.com/support/knowledgecenter ...

  2. IP-reputation feature

    IP-reputation feature https://blog.norz.at/citrix-netscaler-ip-reputation-feature/ I recently had to ...

  3. svn Previous operation has not finished; run 'cleanup' if it was interrupted

    svn cleanup failed–previous operation has not finished; run cleanup if it was interrupted Usually, a ...

  4. SVN:Previous operation has not finished; run 'cleanup' if it was interrupted

    异常处理汇总-开发工具  http://www.cnblogs.com/dunitian/p/4522988.html cleanup failed to process the following ...

  5. 【svn】在提交文件是报错:previous operation has not finished;run 'cleanup' if it was interrupted

    1.svn在提交文件是报错:previous operation has not finished;run 'cleanup' if it was interrupted2.原因,工作队列被占用,只需 ...

  6. svn报错cleanup failed–previous operation has not finished; run cleanup if it was interrupted的解决办法

    今天在svn提交的时候它卡顿了一下,我以为已经提交完了,就按了一下,结果就再也恢复不了,也继续不了了... 报错 cleanup failed–previous operation has not f ...

  7. svn报错:“Previous operation has not finished; run 'cleanup' if it was interrupted“ 的解决方法

    今天改完代码提交时,提交接近完成但窗口还未关闭电脑蓝屏了.夏天来了,电脑比人还怕热啊~~~   心里咯噔一下,估计svn又会出一些莫名其妙的问题了. 果然,待电脑重启后开eclipse,文件还是新增状 ...

  8. svn:cleanup failed previous operation has not finished; run cleanup if it was interrupted

    svn:cleanup failed previous operation has not finished; run cleanup if it was interrupted 今天 大脑一时短路 ...

  9. SVN报Previous operation has not finished; run 'cleanup' if it was interrupted错误的解决方法

    做着项目突然SVN报Previous operation has not finished; run 'cleanup' if it was interrupted,进度又要继续,烦.百度一下发现很多 ...

随机推荐

  1. java中实现在线人数统计

    //java 代码public class SessionCounter implements HttpSessionListener { private static int activeSessi ...

  2. 2、1 昨天讲列表缓存,为了让列表更新,我们需要在增、删、改方法之前加 @CacheEvict(value="list",allEntries = true)

    package com.bw.service; import java.util.List; import javax.annotation.Resource; import org.springfr ...

  3. java23种设计模式之九: 抽象工厂方法模式

    一.抽象工厂定义 上一讲我们说了一下工厂方法,那么我们如何对工厂进行抽象. 因为工厂是生产产品的,现在我们需要工厂抽象,只生产抽象产品,不生产具体的产品,这同时也体现了java的多态. 现在有2个抽象 ...

  4. c++学习(二)------this指针学习

    在c++中,类的不同实例有自己的数据(储存在不同地方),有很多拷贝.而类的成员函数却只有一份备份. 而不同的类的实例却可以调用同一个函数,这是通过this指针来完成的. *this代表当前类本身,th ...

  5. SSH框架结合案例构建配置

    ssh框架概述 SSH是 struts+spring+hibernate的一个集成框架,是目前比较流行的一种Web应用程序开源框架.区别于 Secure Shell . 集成SSH框架的系统从职责上分 ...

  6. asp.net core-13.Cookie-based认证实现

    1.打开visual studio code创建一个MVC项目

  7. Hinton胶囊网络后最新研究:用“在线蒸馏”训练大规模分布式神经网络

    Hinton胶囊网络后最新研究:用“在线蒸馏”训练大规模分布式神经网络 朱晓霞发表于目标检测和深度学习订阅 457 广告关闭 11.11 智慧上云 云服务器企业新用户优先购,享双11同等价格 立即抢购 ...

  8. PHP关于VC11,VC9,VC6以及Thread Safe和Non Thread Safe版本选择的问题

    这里是我在搭建php环境时收集的资料供大家参考: 现在PHP官网上下载PHP安装包都有VC11或VC9的字样,这是什么含义,我们应该下载哪种安装包更好呢?其实PHP官网给出了答案: VC6版本是使用V ...

  9. Python练习_考试第二次

    一. 选择题(32分) 1. python不支持的数据类型有:AA. charB. intC. floatD. list 2. Ex = ‘foo’y = 2print(x + y)A. fooB. ...

  10. JoinableQueue类与线程

    生产者消费者的问题及其解决办法 问题 在之前的生产者消费者模型中,生产者和消费者只有一个, 那么生产者往队列里put几次,消费者就get几次,但是存在一个问题, 生产者不一定只有一个,消费者也不一定只 ...