【原创】大叔问题定位分享(7)Spark任务中Job进度卡住不动
Spark2.1.1
最近运行spark任务时会发现任务经常运行很久,具体job如下:
|
Stages: Succeeded/Total |
Tasks (for all stages): Succeeded/Total |
||||
|
16 |
2018/12/03 12:39:50 |
2.3 h |
0/5 |
196/4723 |
job中正在运行的stage如下:
|
Tasks: Succeeded/Total |
||||||||
|
60 |
2018/12/03 12:39:57 |
2.3 h |
196/200 |
4.5 GB |
1455.1 MB |
该stage中有4个task一直处于running状态,这些task的统计信息异常(Input Size / Records和Shuffle Write Size / Records均为0.0B/0),并且这4个task都位于同一个executor上:
|
33 |
8938 |
0 |
RUNNING |
PROCESS_LOCAL |
12 / $executor_server_ip |
2018/12/03 12:39:57 |
2.3 h |
0.0 B / 0 |
0.0 B / 0 |
有问题的task所在的executor统计信息也有异常(Total Tasks为0),该executor如下:
|
12 |
$executor_server_ip:36755 |
0 ms |
0 |
0 |
0 |
0 |
0.0 B / 0 |
0.0 B / 0 |
此时Driver堆栈信息如下:
"Driver" #26 prio=5 os_prio=0 tid=0x00007f163a116000 nid=0x5192 waiting on condition [0x00007f15bb9a0000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000001a8c4f9e0> (a scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1988)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
at org.apache.spark.rdd.RDD$$anonfun$treeReduce$1.apply(RDD.scala:1059)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.treeReduce(RDD.scala:1037)
at breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)
at breeze.optimize.LineSearch$$anon$1.calculate(LineSearch.scala:41)
at breeze.optimize.LineSearch$$anon$1.calculate(LineSearch.scala:30)
at breeze.optimize.StrongWolfeLineSearch.breeze$optimize$StrongWolfeLineSearch$$phi$1(StrongWolfe.scala:69)
at breeze.optimize.StrongWolfeLineSearch$$anonfun$minimize$1.apply$mcVI$sp(StrongWolfe.scala:142)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at breeze.optimize.StrongWolfeLineSearch.minimize(StrongWolfe.scala:141)
at breeze.optimize.LBFGS.determineStepSize(LBFGS.scala:78)
at breeze.optimize.LBFGS.determineStepSize(LBFGS.scala:40)
at breeze.optimize.FirstOrderMinimizer$$anonfun$infiniteIterations$1.apply(FirstOrderMinimizer.scala:64)
at breeze.optimize.FirstOrderMinimizer$$anonfun$infiniteIterations$1.apply(FirstOrderMinimizer.scala:62)
at scala.collection.Iterator$$anon$7.next(Iterator.scala:129)
at breeze.util.IteratorImplicits$RichIterator$$anon$2.next(Implicits.scala:71)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at app.package.AppClass.main(AppClass.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
可见正在runJob,并且等待executor执行结果;
有问题的executor上堆栈信息有一个可疑的thread长时间一直在running:
"shuffle-client-5-4" #94 daemon prio=5 os_prio=0 tid=0x00007fbae0e42800 nid=0x2a3a runnable [0x00007fbae4760000]
java.lang.Thread.State: RUNNABLE
at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:476)
at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
at io.netty.util.Recycler.get(Recycler.java:144)
at io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
at io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
ps:出问题的executor上当时的内存资源很空闲,进程状态也正常:
-bash-4.2$ free -m
total used free shared buff/cache available
Mem: 257676 29251 5274 517 223150 226669
Swap: 0 0 0
怀疑此处可能有死循环,spark2.1.1使用的netty版本是4.0.42,查看netty代码:
io.netty.util.Recycler
boolean scavengeSome() {
WeakOrderQueue cursor = this.cursor;
if (cursor == null) {
cursor = head;
if (cursor == null) {
return false;
}
}
boolean success = false;
WeakOrderQueue prev = this.prev;
do {
if (cursor.transfer(this)) {
success = true;
break;
}
WeakOrderQueue next = cursor.next;
if (cursor.owner.get() == null) {
// If the thread associated with the queue is gone, unlink it, after
// performing a volatile read to confirm there is no data left to collect.
// We never unlink the first queue, as we don't want to synchronize on updating the head.
if (cursor.hasFinalData()) {
for (;;) {
if (cursor.transfer(this)) {
success = true;
} else {
break;
}
}
}
if (prev != null) {
prev.next = next;
}
} else {
prev = cursor;
}
cursor = next;
} while (cursor != null && !success);
this.prev = prev;
this.cursor = cursor;
return success;
}
问题在于cursor初始化的时候没有清空prev:
if (cursor == null) {
cursor = head;
该问题在4.0.43中被修复,升级spark2.1.1中的netty到4.0.43或以上版本可以修复问题;
官方issues位于:https://github.com/netty/netty/issues/6153
【原创】大叔问题定位分享(7)Spark任务中Job进度卡住不动的更多相关文章
- 【原创】大叔问题定位分享(18)beeline连接spark thrift有时会卡住
spark 2.1.1 beeline连接spark thrift之后,执行use database有时会卡住,而use database 在server端对应的是 setCurrentDatabas ...
- 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration
spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...
- 【原创】大叔问题定位分享(27)spark中rdd.cache
spark 2.1.1 spark应用中有一些task非常慢,持续10个小时,有一个task日志如下: 2019-01-24 21:38:56,024 [dispatcher-event-loop-2 ...
- 【原创】大叔问题定位分享(21)spark执行insert overwrite非常慢,比hive还要慢
最近把一些sql执行从hive改到spark,发现执行更慢,sql主要是一些insert overwrite操作,从执行计划看到,用到InsertIntoHiveTable spark-sql> ...
- 【原创】大叔问题定位分享(19)spark task在executors上分布不均
最近提交一个spark应用之后发现执行非常慢,点开spark web ui之后发现卡在一个job的一个stage上,这个stage有100000个task,但是绝大部分task都分配到两个execut ...
- 【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException
spark查orc格式的数据有时会报这个错 Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc. ...
- 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat
spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...
- 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead
spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...
- 【原创】大叔问题定位分享(12)Spark保存文本类型文件(text、csv、json等)到hdfs时为什么是压缩格式的
问题重现 rdd.repartition(1).write.csv(outPath) 写文件之后发现文件是压缩过的 write时首先会获取hadoopConf,然后从中获取是否压缩以及压缩格式 org ...
随机推荐
- [2019BUAA软工助教]第0次个人作业
[2019BUAA软工助教]第0次个人作业 一.前言 我认为人生就是一次次地从<存在>到<光明>. 二.软件工程师的成长 博客索引 同学们在上这门课的时候基本都是大三,觉得在大 ...
- h5-canvas 单像素操作
###1. 自定义获取指定坐标像素 var canvas = document.querySelector("#cav"); if(canvas.getContext){ var ...
- NOIP2010提高组复赛C 关押罪犯
题目链接:https://ac.nowcoder.com/acm/contest/258/C 题目大意: 略 分析: 这题是并查集的一个变题,先按积怨值从大到小排序,然后一个一个看能否完全分开,遇到的 ...
- 2733: [HNOI2012]永无乡 线段树合并
题目: https://www.lydsy.com/JudgeOnline/problem.php?id=2733 题解: 建n棵动态开点的权值线段树,然后边用并查集维护连通性,边合并线段树维护第k重 ...
- CF226D The table
题目链接 题意 给出一个\(n\times m\)的矩阵,可以把某些行和某些列上面的数字变为相反数.问修改那些行和哪些列可以使得所有行和所有列之和都为非负数. 思路 每次将负数的行或者列变为相反数.因 ...
- 测试框架httpclent 4.HttpClient Post方法实现
startupWithCookies.json [ { "description":"这是一个会返回cookies信息的get请求", "reques ...
- Linux 下 boost 库的安装,配置个人环境变量
部分引自: https://blog.csdn.net/this_capslock/article/details/47170313 1. 下载boost安装包并解压缩到http://www.boos ...
- 金融量化分析【day113】:布林带策略
一.布林带策略简介 1.简介 2.计算公式 3.图形 二.布林带策略代码 import jqdata def initialize(context): set_benchmark('0000 ...
- Quartz.net 3.x使用总结(一)——入门介绍
1.Quartz.net简介 Quartz.NET是一个强大.开源.轻量级的任务调度框架.任务调度在我们的开发中经常遇到,如说:每天晚上三点让程序或网站执行某些代码,或者每隔5秒种执行一个方法等.Wi ...
- 语义化标签和jQuery选择器
关于语义化标签 https://blog.csdn.net/nongweiyilady/article/details/53885433 更详细的语义化标签:https://www.cnblogs.c ...