转载请注明原创地址:http://www.cnblogs.com/dongxiao-yang/p/7600561.html

一  broker启动后ReplicaFetcherThread OOM

版本:0.8.2.2

错误现象,server启动日志:

WARN [ReplicaFetcherThread-1-21], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 8; ClientId: ReplicaFetcherThread-1-21; ReplicaId: 20; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [topic_a,28] -> PartitionFetchInfo(13372825,400000000),[topic_b-live,19] -> PartitionFetchInfo(2382784277,400000000),[topic_b-user,2] -> PartitionFetchInfo(2230117428,400000000),[topic_b_like,20] -> PartitionFetchInfo(0,400000000),[topic_b-group,32] -> PartitionFetchInfo(0,400000000),[topic_b_live,1] -> PartitionFetchInfo(1352237,400000000),[topic_b-log,1] -> PartitionFetchInfo(12598395616,400000000),[topic_b_microvideo,18] -> PartitionFetchInfo(0,400000000),[topic_b-group,92] -> PartitionFetchInfo(0,400000000),[bl_generic_event_detail,10] -> PartitionFetchInfo(27302121,400000000),[__consumer_offsets,14] -> PartitionFetchInfo(0,400000000),[__consumer_offsets,18] -> PartitionFetchInfo(0,400000000),[topic_b_group,12] -> PartitionFetchInfo(7323,400000000),[topic_b_feed,19] -> PartitionFetchInfo(232378,400000000),[topic_b_nearby,26] -> PartitionFetchInfo(0,400000000),[topic_b_user,20] -> PartitionFetchInfo(265,400000000),[topic_b_banners,2] -> PartitionFetchInfo(0,400000000). Possible cause: java.lang.OutOfMemoryError: Java heap space (kafka.server.ReplicaFetcherThread)
[2017-08-10 11:09:55,310] WARN [ReplicaFetcherThread-3-23], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 5; ClientId: ReplicaFetcherThread-3-23; ReplicaId: 20; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [topic_b_microvideo,8] -> PartitionFetchInfo(37116,400000000),[topic_b-common,12] -> PartitionFetchInfo(5378147776,400000000),[topic_b_user,2] -> PartitionFetchInfo(76267,400000000),[topic_b-group,82] -> PartitionFetchInfo(0,400000000),[bl_generic_event_detail,24] -> PartitionFetchInfo(27457189,400000000),[topic_b-group,22] -> PartitionFetchInfo(0,400000000),[topic_b_like,10] -> PartitionFetchInfo(0,400000000),[topic_b-nearby,28] -> PartitionFetchInfo(471861516,400000000),[topic_b_nearby,16] -> PartitionFetchInfo(28,400000000),[topic_b_feed,9] -> PartitionFetchInfo(79605,400000000),[topic_a,10] -> PartitionFetchInfo(13365304,400000000),[topic_b-live,1] -> PartitionFetchInfo(2381944301,400000000),[topic_b_log,13] -> PartitionFetchInfo(366613,400000000). Possible cause: java.nio.channels.ClosedSelectorException (kafka.server.ReplicaFetcherThread)
[2017-08-10 11:09:55,310] INFO Reconnect due to socket error: java.nio.channels.ClosedChannelException (kafka.consumer.SimpleConsumer)
[2017-08-10 11:09:55,671] ERROR OOME with size 491278751 (kafka.network.BoundedByteBufferReceive)
java.lang.OutOfMemoryError: Java heap space

原因 :replica.fetch.max.bytes参数设置的太大,在集群正常运行的情况下不会有问题,一旦某台broker挂掉重启后需要追的分区数过多,每个需要追数的partition都是申请一块replica.fetch.max.bytes大小的内存造成OOM。

解决办法:减少replica.fetch.max.bytes的值

二 重启后分区完全无法自平衡leader

版本:0.8.2.2

错误现象:1 broker挂掉重启后启动正常,追数据正常,所有副本进入isr后controller不会自动指定为leader。所有服务端日志正常,crontroller不打日志

2 jstack crontroller 线程栈发现delete-topics-thread一直block,另有Controller-xx-to-broker-yy-send-thread block在deleteTopicStopReplicaCallback位置

"Controller-29-to-broker-26-send-thread" prio=10 tid=0x00007f6520041000 nid=0x4d9d waiting on condition [0x00007f649e90e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000005d3831d38> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:871)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1201)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at kafka.utils.Utils$.inLock(Utils.scala:533)
at kafka.controller.TopicDeletionManager.kafka$controller$TopicDeletionManager$$deleteTopicStopReplicaCallback(TopicDeletionManager.scala:371)
at kafka.controller.TopicDeletionManager$$anonfun$startReplicaDeletion$2$$anonfun$apply$3.apply(TopicDeletionManager.scala:338)
at kafka.controller.TopicDeletionManager$$anonfun$startReplicaDeletion$2$$anonfun$apply$3.apply(TopicDeletionManager.scala:338)
at kafka.controller.ControllerBrokerRequestBatch$$anonfun$addStopReplicaRequestForBrokers$2$$anonfun$apply$mcVI$sp$2.apply(ControllerChannelManager.scala:229)
at kafka.controller.ControllerBrokerRequestBatch$$anonfun$addStopReplicaRequestForBrokers$2$$anonfun$apply$mcVI$sp$2.apply(ControllerChannelManager.scala:229)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:160)
- locked <0x000000065d279300> (a java.lang.Object)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)

"delete-topics-thread-29" prio=10 tid=0x00007f652006c000 nid=0x4da3 waiting on condition [0x00007f649dbdd000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000065d2800e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:282)
at kafka.controller.ControllerChannelManager.sendRequest(ControllerChannelManager.scala:57)
- locked <0x000000065d279858> (a java.lang.Object)
at kafka.controller.KafkaController.sendRequest(KafkaController.scala:668)
at kafka.controller.ControllerBrokerRequestBatch$$anonfun$sendRequestsToBrokers$3$$anonfun$apply$9.apply(ControllerChannelManager.scala:312)
at kafka.controller.ControllerBrokerRequestBatch$$anonfun$sendRequestsToBrokers$3$$anonfun$apply$9.apply(ControllerChannelManager.scala:309)
at scala.collection.immutable.List.foreach(List.scala:318)
at kafka.controller.ControllerBrokerRequestBatch$$anonfun$sendRequestsToBrokers$3.apply(ControllerChannelManager.scala:309)
at kafka.controller.ControllerBrokerRequestBatch$$anonfun$sendRequestsToBrokers$3.apply(ControllerChannelManager.scala:302)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at kafka.controller.ControllerBrokerRequestBatch.sendRequestsToBrokers(ControllerChannelManager.scala:302)
at kafka.controller.ReplicaStateMachine.handleStateChanges(ReplicaStateMachine.scala:115)
at kafka.controller.TopicDeletionManager$$anonfun$startReplicaDeletion$2.apply(TopicDeletionManager.scala:337)
at kafka.controller.TopicDeletionManager$$anonfun$startReplicaDeletion$2.apply(TopicDeletionManager.scala:327)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
at kafka.controller.TopicDeletionManager.startReplicaDeletion(TopicDeletionManager.scala:327)
at kafka.controller.TopicDeletionManager.kafka$controller$TopicDeletionManager$$onPartitionDeletion(TopicDeletionManager.scala:360)
at kafka.controller.TopicDeletionManager$$anonfun$kafka$controller$TopicDeletionManager$$onTopicDeletion$2.apply(TopicDeletionManager.scala:306)
at kafka.controller.TopicDeletionManager$$anonfun$kafka$controller$TopicDeletionManager$$onTopicDeletion$2.apply(TopicDeletionManager.scala:305)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
at kafka.controller.TopicDeletionManager.kafka$controller$TopicDeletionManager$$onTopicDeletion(TopicDeletionManager.scala:305)
at kafka.controller.TopicDeletionManager$DeleteTopicsThread$$anonfun$doWork$1$$anonfun$apply$mcV$sp$4.apply(TopicDeletionManager.scala:424)
at kafka.controller.TopicDeletionManager$DeleteTopicsThread$$anonfun$doWork$1$$anonfun$apply$mcV$sp$4.apply(TopicDeletionManager.scala:396)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
at kafka.controller.TopicDeletionManager$DeleteTopicsThread$$anonfun$doWork$1.apply$mcV$sp(TopicDeletionManager.scala:396)
at kafka.controller.TopicDeletionManager$DeleteTopicsThread$$anonfun$doWork$1.apply(TopicDeletionManager.scala:390)
at kafka.controller.TopicDeletionManager$DeleteTopicsThread$$anonfun$doWork$1.apply(TopicDeletionManager.scala:390)
at kafka.utils.Utils$.inLock(Utils.scala:535)
at kafka.controller.TopicDeletionManager$DeleteTopicsThread.doWork(TopicDeletionManager.scala:390)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)

原因:kafka版本bug,原因参考https://issues.apache.org/jira/browse/KAFKA-2046和https://issues.apache.org/jira/browse/KAFKA-2122,controller内部事件处理共用一个与controller.message.queue.size大小相关的队列,如果这个参数配置为一个过小的值的话,会导致controller内部线程互相block。

解决办法:调大controller.message.queue.size或者注销掉(默认值为Int.MaxValue)然后滚动重启所有状态不正常的controller,直到controller落到一个配置正常的broker上

三 WARN [Controller-60-to-broker-60-send-thread], Controller 60's connection to broker kafka60:9092 (id: 60 rack: null) was unsuccessful (kafka.controller.RequestSendThread)

版本 0.10.2.0,1.0.0

现象:

kafka服务端配置好kerberos后,controller持续报无法连接到broker的错误(包括连接自身实例),大概错误如下

[2018-01-25 17:48:41,864] WARN [Controller-60-to-broker-60-send-thread], Controller 60's connection to broker kafka60:9092 (id: 60 rack: null) was unsuccessful (kafka.controller.RequestSendThread)
java.io.IOException: Connection to kafka60:9092 (id: 60 rack: null) failed
at kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:84)
at kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:94)
at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:232)
at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:185)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:184)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2018-01-25 17:48:41,970] WARN [Controller-60-to-broker-60-send-thread], Controller 60's connection to broker kafka60:9092 (id: 60 rack: null) was unsuccessful (kafka.controller.RequestSendThread)

......

原因: 打开kafka-authorizer.log的DEBUG日志会看到具体错误,这个是由于线上jre的环境缺少kerberos认证的算法库导致的,更新jre相关类库即可

[2018-01-25 17:55:31,155] DEBUG Connection with /host disconnected (org.apache.kafka.common.network.Selector)
java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Encryption type AES256 CTS mode with HMAC SHA1-96 is not supported/enabled)]
at org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:250)
at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:71)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:350)
at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
at kafka.network.Processor.poll(SocketServer.scala:494)
at kafka.network.Processor.run(SocketServer.scala:432)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: Failure unspecified at GSS-API level (Mechanism level: Encryption type AES256 CTS mode with HMAC SHA1-96 is not supported/enabled)]
at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:199)
at org.apache.kafka.common.security.authenticator.SaslServerAuthenticator.authenticate(SaslServerAuthenticator.java:235)
... 6 more
Caused by: GSSException: Failure unspecified at GSS-API level (Mechanism level: Encryption type AES256 CTS mode with HMAC SHA1-96 is not supported/enabled)
at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:856)
at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:342)
at sun.security.jgss.GSSContextImpl.acceptSecContext(GSSContextImpl.java:285)
at com.sun.security.sasl.gsskerb.GssKrb5Server.evaluateResponse(GssKrb5Server.java:167)
... 7 more
Caused by: KrbException: Encryption type AES256 CTS mode with HMAC SHA1-96 is not supported/enabled
at sun.security.krb5.EncryptionKey.findKey(EncryptionKey.java:522)
at sun.security.krb5.KrbApReq.authenticate(KrbApReq.java:273)
at sun.security.krb5.KrbApReq.<init>(KrbApReq.java:149)
at sun.security.jgss.krb5.InitSecContextToken.<init>(InitSecContextToken.java:108)
at sun.security.jgss.krb5.Krb5Context.acceptSecContext(Krb5Context.java:829)
... 10 more

四 WARN Map failed (kafka.utils.CoreUtils$)
java.io.IOException: Map failed

WARN Map failed (kafka.utils.CoreUtils$)
java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:940)
at kafka.log.AbstractIndex$$anonfun$resize$1.apply(AbstractIndex.scala:114)
at kafka.log.AbstractIndex$$anonfun$resize$1.apply(AbstractIndex.scala:104)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
at kafka.log.AbstractIndex.resize(AbstractIndex.scala:104)
at kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply$mcV$sp(AbstractIndex.scala:163)
at kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply(AbstractIndex.scala:163)
at kafka.log.AbstractIndex$$anonfun$trimToValidSize$1.apply(AbstractIndex.scala:163)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
at kafka.log.AbstractIndex.trimToValidSize(AbstractIndex.scala:162)
at kafka.log.AbstractIndex.close(AbstractIndex.scala:174)
at kafka.log.LogSegment$$anonfun$close$2.apply$mcV$sp(LogSegment.scala:471)
at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:82)
at kafka.utils.Logging$class.swallowWarn(Logging.scala:96)
at kafka.utils.CoreUtils$.swallowWarn(CoreUtils.scala:46)
at kafka.utils.Logging$class.swallow(Logging.scala:98)
at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:46)
at kafka.log.LogSegment.close(LogSegment.scala:471)
at kafka.log.Log$$anonfun$close$1$$anonfun$apply$mcV$sp$1.apply(Log.scala:573)
at kafka.log.Log$$anonfun$close$1$$anonfun$apply$mcV$sp$1.apply(Log.scala:573)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at kafka.log.Log$$anonfun$close$1.apply$mcV$sp(Log.scala:573)
at kafka.log.Log$$anonfun$close$1.apply(Log.scala:568)
at kafka.log.Log$$anonfun$close$1.apply(Log.scala:568)
at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
at kafka.log.Log.close(Log.scala:568)
at kafka.log.LogManager$$anonfun$shutdown$4$$anonfun$10$$anonfun$apply$3.apply$mcV$sp(LogManager.scala:403)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:61)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)
... 35 more

解决方案:

确保jvm是64位的以后

echo "vm.max_map_count=262144" >> /etc/sysctl.conf
sysctl -p

参考 https://jira.apache.org/jira/browse/KAFKA-5962

kafka相关patch

源码版本 1.0

(1)KAFKA-6185  Selector memory leak with high likelihood of OOM in case of down conversion

patch 4193

(2)KAFKA-6362 auto commit not work since coordinatorUnknown() is always true.

patch 4349,4326,4641

kafka 部分问题处理记录的更多相关文章

  1. Kafka学习(学习过程记录)

    Apache kafka 这,仅是我学习过程中记录的笔记.确定了一个待研究的主题,对这个主题进行全方面的剖析.笔记是用来方便我回顾与学习的,欢迎大家与我进行交流沟通,共同成长.不止是技术. Kafka ...

  2. 华为云kafka POC 踩坑记录

    2019/03/08 18:29 最近在进行华为云相关POC验证,个人主要负责华为云DMS kafka相关.大致数据流程是,从DIS取出数据,进行解析处理,然后放入kafka,再从kafka中取出数据 ...

  3. kafka集群搭建记录

    本文记录搭建kafka搭建过程. 一.硬件机器介绍 192.168.183.195 master-node 192.168.183.194 data-node1 192.168.183.196 dat ...

  4. Kafka远程调试简单记录

    Kafka启动脚本: ./kafka-server-start.sh -daemon ../config/server.properties 最终翻阅脚本可以确定是调用kafka-run-class. ...

  5. kafka服务端实验记录

    kafka单机实验: 环境准备: 1.下载kafka,zookeeper,并解压 wget http://mirror.bit.edu.cn/apache/kafka/2.3.0/kafka_2.11 ...

  6. 简单测试flume+kafka+storm的集成

    集成 Flume/kafka/storm 是为了收集日志文件而引入的方法,最终将日志转到storm中进行分析.storm的分析方法见后面文章,这里只讨论集成方法. 以下为具体步骤及测试方法: 1.分别 ...

  7. Kafka实战系列--Kafka API使用体验

    前言: kafka是linkedin开源的消息队列, 淘宝的metaq就是基于kafka而研发. 而消息队列作为一个分布式组件, 在服务解耦/异步化, 扮演非常重要的角色. 本系列主要研究kafka的 ...

  8. Flume+Kafka+Strom基于伪分布式环境的结合使用

    目录: 一.Flume.Kafka.Storm是什么,如何安装? 二.Flume.Kafka.Storm如何结合使用? 1) 原理是什么? 2) Flume和Kafka的整合  3) Kafka和St ...

  9. storm集成kafka

    kafkautil: import java.util.Properties; import kafka.javaapi.producer.Producer; import kafka.produce ...

随机推荐

  1. NSOperation的并发与非并发

    NSoperation也是多线程的一种,NSopertaion有2种形式  (1) 并发执行       并发执行你需要重载如下4个方法     //执行任务主函数,线程运行的入口函数    - (v ...

  2. 狗日的Javascript中的闭包

    前面的话: 闭包,是 javascript 中重要的一个概念,对于初学者来讲,闭包是一个特别抽象的概念,特别是ECMA规范给的定义,如果没有实战经验,你很难从定义去理解它.下面是作者从作用域链慢慢讲到 ...

  3. NEXUS7 学习

    一.编译环境搭建 (更细节的环境搭建请参考:How to Build CyanogenMod for Nexus 7 (Wi-Fi, 2012 version) (codename: grouper) ...

  4. Asp.Net生命周期和Http管道技术

    本篇主要介绍一下内容: 1.ASP.NET生命周期 2.Http运行时 3.Http管道技术 a)inetinfo.exe b)asp.net_isapi.dll c)aspnet_wp.exe d) ...

  5. 织梦默认分页样式改动 解决分页列表显示,去掉li

    近期装了个织梦dedecmsV5.7版本号时,调用分页显示出现的结果出现好几行,怎么也不能在一排显示,找了非常多资料,才了解到是由织梦模板中分页加了<Li>列表标签,解决有两种方法,以下将 ...

  6. u-boot修改出错的问题

    一.u-boot找不到nandflash 我用的开发板是fl2440的,里面没有norflash只有nandflash.我用了s3c2410的u-boot改的,所以要把内核从norflash启动改成适 ...

  7. JavaScript面向对象总结

    对象(Object)应该算是js中最为重要的部分,也是js中非常难懂晦涩的一部分.更是面试以及框架设计中各出没.本文章,主要参考JavaScript红宝书(JavaScript高级程序设计 第六章)以 ...

  8. 实现一个Cglib代理Demo

    Cglib动态代理采用的是创建目标类的子类的方式.优点:不用实现额外接口,只操作我们关心类,高性能. package jesse.test; import java.lang.reflect.Meth ...

  9. Solr删除数据

    步骤: 1.在Solr客户端左下方 Core Selector 中点选想要删除数据的索引库 2.点选Documents 3.右侧Document Type中点选XML 4.Document(s)中输入 ...

  10. 蓝鲸安装Agent

    1. APPO 所在机器(在 app 运行所在机器) 必须能通过 ssh 登陆到 Agent 机器2. Agent 所在机器可以访问到 zk 的端口3. 支持 Linux/Windows/AIX 操作 ...