【原创】大叔问题定位分享(7)Spark任务中Job进度卡住不动
Spark2.1.1
最近运行spark任务时会发现任务经常运行很久,具体job如下:
|
Stages: Succeeded/Total |
Tasks (for all stages): Succeeded/Total |
||||
|
16 |
2018/12/03 12:39:50 |
2.3 h |
0/5 |
196/4723 |
job中正在运行的stage如下:
|
Tasks: Succeeded/Total |
||||||||
|
60 |
2018/12/03 12:39:57 |
2.3 h |
196/200 |
4.5 GB |
1455.1 MB |
该stage中有4个task一直处于running状态,这些task的统计信息异常(Input Size / Records和Shuffle Write Size / Records均为0.0B/0),并且这4个task都位于同一个executor上:
|
33 |
8938 |
0 |
RUNNING |
PROCESS_LOCAL |
12 / $executor_server_ip |
2018/12/03 12:39:57 |
2.3 h |
0.0 B / 0 |
0.0 B / 0 |
有问题的task所在的executor统计信息也有异常(Total Tasks为0),该executor如下:
|
12 |
$executor_server_ip:36755 |
0 ms |
0 |
0 |
0 |
0 |
0.0 B / 0 |
0.0 B / 0 |
此时Driver堆栈信息如下:
"Driver" #26 prio=5 os_prio=0 tid=0x00007f163a116000 nid=0x5192 waiting on condition [0x00007f15bb9a0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000001a8c4f9e0> (a scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1988)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
at org.apache.spark.rdd.RDD$$anonfun$treeReduce$1.apply(RDD.scala:1059)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeReduce(RDD.scala:1037)
at breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
at breeze.optimize.LineSearch$$anon$1.calculate(LineSearch.scala:41)
at breeze.optimize.LineSearch$$anon$1.calculate(LineSearch.scala:30)
at breeze.optimize.StrongWolfeLineSearch.breeze$optimize$StrongWolfeLineSearch$$phi$1(StrongWolfe.scala:69)
at breeze.optimize.StrongWolfeLineSearch$$anonfun$minimize$1.apply$mcVI$sp(StrongWolfe.scala:142)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at breeze.optimize.StrongWolfeLineSearch.minimize(StrongWolfe.scala:141)
at breeze.optimize.LBFGS.determineStepSize(LBFGS.scala:78)
at breeze.optimize.LBFGS.determineStepSize(LBFGS.scala:40)
at breeze.optimize.FirstOrderMinimizer$$anonfun$infiniteIterations$1.apply(FirstOrderMinimizer.scala:64)
at breeze.optimize.FirstOrderMinimizer$$anonfun$infiniteIterations$1.apply(FirstOrderMinimizer.scala:62)
at scala.collection.Iterator$$anon$7.next(Iterator.scala:129)
at breeze.util.IteratorImplicits$RichIterator$$anon$2.next(Implicits.scala:71)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at app.package.AppClass.main(AppClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
可见正在runJob,并且等待executor执行结果;
有问题的executor上堆栈信息有一个可疑的thread长时间一直在running:
"shuffle-client-5-4" #94 daemon prio=5 os_prio=0 tid=0x00007fbae0e42800 nid=0x2a3a runnable [0x00007fbae4760000]
java.lang.Thread.State: RUNNABLE
at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:476)
at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
at io.netty.util.Recycler.get(Recycler.java:144)
at io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
at io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
ps:出问题的executor上当时的内存资源很空闲,进程状态也正常:
-bash-4.2$ free -m
total used free shared buff/cache available
Mem: 257676 29251 5274 517 223150 226669
Swap: 0 0 0
怀疑此处可能有死循环,spark2.1.1使用的netty版本是4.0.42,查看netty代码:
io.netty.util.Recycler
boolean scavengeSome() {
WeakOrderQueue cursor = this.cursor;
if (cursor == null) {
cursor = head;
if (cursor == null) {
return false;
}
}
boolean success = false;
WeakOrderQueue prev = this.prev;
do {
if (cursor.transfer(this)) {
success = true;
break;
}
WeakOrderQueue next = cursor.next;
if (cursor.owner.get() == null) {
// If the thread associated with the queue is gone, unlink it, after
// performing a volatile read to confirm there is no data left to collect.
// We never unlink the first queue, as we don't want to synchronize on updating the head.
if (cursor.hasFinalData()) {
for (;;) {
if (cursor.transfer(this)) {
success = true;
} else {
break;
}
}
}
if (prev != null) {
prev.next = next;
}
} else {
prev = cursor;
}
cursor = next;
} while (cursor != null && !success);
this.prev = prev;
this.cursor = cursor;
return success;
}
问题在于cursor初始化的时候没有清空prev:
if (cursor == null) {
cursor = head;
该问题在4.0.43中被修复,升级spark2.1.1中的netty到4.0.43或以上版本可以修复问题;
官方issues位于:https://github.com/netty/netty/issues/6153
【原创】大叔问题定位分享(7)Spark任务中Job进度卡住不动的更多相关文章
- 【原创】大叔问题定位分享(18)beeline连接spark thrift有时会卡住
spark 2.1.1 beeline连接spark thrift之后,执行use database有时会卡住,而use database 在server端对应的是 setCurrentDatabas ...
- 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration
spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...
- 【原创】大叔问题定位分享(27)spark中rdd.cache
spark 2.1.1 spark应用中有一些task非常慢,持续10个小时,有一个task日志如下: 2019-01-24 21:38:56,024 [dispatcher-event-loop-2 ...
- 【原创】大叔问题定位分享(21)spark执行insert overwrite非常慢,比hive还要慢
最近把一些sql执行从hive改到spark,发现执行更慢,sql主要是一些insert overwrite操作,从执行计划看到,用到InsertIntoHiveTable spark-sql> ...
- 【原创】大叔问题定位分享(19)spark task在executors上分布不均
最近提交一个spark应用之后发现执行非常慢,点开spark web ui之后发现卡在一个job的一个stage上,这个stage有100000个task,但是绝大部分task都分配到两个execut ...
- 【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException
spark查orc格式的数据有时会报这个错 Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc. ...
- 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
- 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...
- 【原创】大叔问题定位分享(12)Spark保存文本类型文件(text、csv、json等)到hdfs时为什么是压缩格式的
问题重现 rdd.repartition(1).write.csv(outPath) 写文件之后发现文件是压缩过的 write时首先会获取hadoopConf,然后从中获取是否压缩以及压缩格式 org ...
随机推荐
- 打包优化实践(如何Code Spliting)
项目地址:ReactSPA 使用 webpack 插件找出占用空间较大的包 开发环境中可使用 analyze-webpack-plugin 观察各模块的占用情况.以该项目为例:浏览器中输入 http: ...
- First ServiceStack Service
博客1:ServiceStack Web Service 创建与调用简单示列 博客2:Hbuilder+vs2017 web api开发app 官方文档:servicestack docs 1.手动下 ...
- ES6类封装判断用户上下左右滑动事件!
/** * param 原生js方式实现判断用户的滑动方向 * 返回1 向上 * 返回2 向下 * 返回3 向左 * 返回4 向右 */ class juedgeSlide { constructor ...
- iview 无法缓存的问题
- Kafka--消息队列
说明:解耦指的是客户端A和客户端B不需要同步,两者之间的通信是异步的:消息队列是可以复制备份的,因此消息队列具有冗余性和可恢复性:所谓拓展性指的是,在大型分布式系统中,消息队列是分布在集群中的:队列是 ...
- ABP中的模块初始化过程(一)
在总结完整个ABP项目的结构之后,我们就来看一看ABP中这些主要的模块是按照怎样的顺序进行加载的,在加载的过程中我们会一步步分析源代码来进行解释,从而使自己对于整个框架有一个清晰的脉络,在整个Asp. ...
- js 实现数据结构 -- 队列
原文: 在 Javascript 中学习数据结构与算法. 概念: 与栈相反,队列是一种遵循先进先出 (FIFO / First In First Out) 原则的一组有序的项:队列在尾部添加新元素,并 ...
- springdata find立即加载 get延迟加载 get返回的是一个动态代理对象 特点是 用的时候才会查询 否则不查询
- Kindle官方广告“自封”泡面盖
动点科技获悉:亚马逊出品的电子书阅读器 kindle 在最新投放天猫平台上的 banner 栏醒目位置投放了一组广告,令人大跌眼镜的是图片使用了 kindle 放在一桶泡面上的形式,怕你看不懂用意的官 ...
- python基础4 列表和元组
一. 列表列表:python基础数据类型之一:其他语言中也有列表的概念,js 数组,可索引,可切片,可加步长li = ['hello', 100, True, [1, 2, 3], {'name':' ...