Spark2.1.1

最近运行spark任务时会发现任务经常运行很久,具体job如下:

Job Id  ▾

Description

Submitted

Duration

Stages: Succeeded/Total

Tasks (for all stages): Succeeded/Total

16

(kill)treeReduce at CRFWithLBFGS.scala:160

2018/12/03 12:39:50

2.3 h

0/5

196/4723

job中正在运行的stage如下:

Stage Id  ▾

Description

Submitted

Duration

Tasks: Succeeded/Total

Input

Output

Shuffle Read

Shuffle Write

60

(kill)treeReduce at CRFWithLBFGS.scala:160+details

2018/12/03 12:39:57

2.3 h

196/200

4.5 GB

   

1455.1 MB

该stage中有4个task一直处于running状态,这些task的统计信息异常(Input Size / RecordsShuffle Write Size / Records均为0.0B/0),并且这4个task都位于同一个executor上:

33

8938

0

RUNNING

PROCESS_LOCAL

12 / $executor_server_ip

stdout

stderr

2018/12/03 12:39:57

2.3 h

 

0.0 B / 0

 

0.0 B / 0

有问题的task所在的executor统计信息也有异常(Total Tasks0),该executor如下:

12

stdout

stderr

$executor_server_ip:36755

0 ms

0

0

0

0

0.0 B / 0

0.0 B / 0

此时Driver堆栈信息如下:

"Driver" #26 prio=5 os_prio=0 tid=0x00007f163a116000 nid=0x5192 waiting on condition [0x00007f15bb9a0000]

java.lang.Thread.State: WAITING (parking)

at sun.misc.Unsafe.park(Native Method)

- parking to wait for  <0x00000001a8c4f9e0> (a scala.concurrent.impl.Promise$CompletionLatch)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)

at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)

at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)

at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:619)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1988)

at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)

at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)

at org.apache.spark.rdd.RDD$$anonfun$treeReduce$1.apply(RDD.scala:1059)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)

at org.apache.spark.rdd.RDD.treeReduce(RDD.scala:1037)

at breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)

at breeze.optimize.LineSearch$$anon$1.calculate(LineSearch.scala:41)

at breeze.optimize.LineSearch$$anon$1.calculate(LineSearch.scala:30)

at breeze.optimize.StrongWolfeLineSearch.breeze$optimize$StrongWolfeLineSearch$$phi$1(StrongWolfe.scala:69)

at breeze.optimize.StrongWolfeLineSearch$$anonfun$minimize$1.apply$mcVI$sp(StrongWolfe.scala:142)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)

at breeze.optimize.StrongWolfeLineSearch.minimize(StrongWolfe.scala:141)

at breeze.optimize.LBFGS.determineStepSize(LBFGS.scala:78)

at breeze.optimize.LBFGS.determineStepSize(LBFGS.scala:40)

at breeze.optimize.FirstOrderMinimizer$$anonfun$infiniteIterations$1.apply(FirstOrderMinimizer.scala:64)

at breeze.optimize.FirstOrderMinimizer$$anonfun$infiniteIterations$1.apply(FirstOrderMinimizer.scala:62)

at scala.collection.Iterator$$anon$7.next(Iterator.scala:129)

at breeze.util.IteratorImplicits$RichIterator$$anon$2.next(Implicits.scala:71)

at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

at scala.collection.immutable.Range.foreach(Range.scala:160)

at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)

at app.package.AppClass.main(AppClass.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)

可见正在runJob,并且等待executor执行结果;

有问题的executor上堆栈信息有一个可疑的thread长时间一直在running:

"shuffle-client-5-4" #94 daemon prio=5 os_prio=0 tid=0x00007fbae0e42800 nid=0x2a3a runnable [0x00007fbae4760000]

java.lang.Thread.State: RUNNABLE

at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:476)

at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)

at io.netty.util.Recycler$Stack.pop(Recycler.java:435)

at io.netty.util.Recycler.get(Recycler.java:144)

at io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)

at io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)

at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)

at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)

at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)

at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)

at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)

at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)

at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)

at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)

at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)

at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)

at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)

at java.lang.Thread.run(Thread.java:745)

ps:出问题的executor上当时的内存资源很空闲,进程状态也正常:

-bash-4.2$ free -m

total        used        free      shared  buff/cache   available

Mem:         257676       29251        5274         517      223150      226669

Swap:             0           0           0

怀疑此处可能有死循环,spark2.1.1使用的netty版本是4.0.42,查看netty代码:

io.netty.util.Recycler

        boolean scavengeSome() {

            WeakOrderQueue cursor = this.cursor;

            if (cursor == null) {

                cursor = head;

                if (cursor == null) {

                    return false;

                }

            }

            boolean success = false;

            WeakOrderQueue prev = this.prev;

            do {

                if (cursor.transfer(this)) {

                    success = true;

                    break;

                }

                WeakOrderQueue next = cursor.next;

                if (cursor.owner.get() == null) {

                    // If the thread associated with the queue is gone, unlink it, after

                    // performing a volatile read to confirm there is no data left to collect.

                    // We never unlink the first queue, as we don't want to synchronize on updating the head.

                    if (cursor.hasFinalData()) {

                        for (;;) {

                            if (cursor.transfer(this)) {

                                success = true;

                            } else {

                                break;

                            }

                        }

                    }

                    if (prev != null) {

                        prev.next = next;

                    }

                } else {

                    prev = cursor;

                }

                cursor = next;

            } while (cursor != null && !success);

            this.prev = prev;

            this.cursor = cursor;

            return success;

        }

问题在于cursor初始化的时候没有清空prev:

if (cursor == null) {

cursor = head;

该问题在4.0.43中被修复,升级spark2.1.1中的netty到4.0.43或以上版本可以修复问题;

官方issues位于:https://github.com/netty/netty/issues/6153

【原创】大叔问题定位分享(7)Spark任务中Job进度卡住不动的更多相关文章

  1. 【原创】大叔问题定位分享(18)beeline连接spark thrift有时会卡住

    spark 2.1.1 beeline连接spark thrift之后,执行use database有时会卡住,而use database 在server端对应的是 setCurrentDatabas ...

  2. 【原创】大叔问题定位分享(10)提交spark任务偶尔报错 org.apache.spark.SparkException: A master URL must be set in your configuration

    spark 2.1.1 一 问题重现 问题代码示例 object MethodPositionTest { val sparkConf = new SparkConf().setAppName(&qu ...

  3. 【原创】大叔问题定位分享(27)spark中rdd.cache

    spark 2.1.1 spark应用中有一些task非常慢,持续10个小时,有一个task日志如下: 2019-01-24 21:38:56,024 [dispatcher-event-loop-2 ...

  4. 【原创】大叔问题定位分享(21)spark执行insert overwrite非常慢,比hive还要慢

    最近把一些sql执行从hive改到spark,发现执行更慢,sql主要是一些insert overwrite操作,从执行计划看到,用到InsertIntoHiveTable spark-sql> ...

  5. 【原创】大叔问题定位分享(19)spark task在executors上分布不均

    最近提交一个spark应用之后发现执行非常慢,点开spark web ui之后发现卡在一个job的一个stage上,这个stage有100000个task,但是绝大部分task都分配到两个execut ...

  6. 【原创】大叔问题定位分享(17)spark查orc格式数据偶尔报错NullPointerException

    spark查orc格式的数据有时会报这个错 Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc. ...

  7. 【原创】大叔问题定位分享(16)spark写数据到hive外部表报错ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat

    spark 2.1.1 spark在写数据到hive外部表(底层数据在hbase中)时会报错 Caused by: java.lang.ClassCastException: org.apache.h ...

  8. 【原创】大叔问题定位分享(15)spark写parquet数据报错ParquetEncodingException: empty fields are illegal, the field should be ommited completely instead

    spark 2.1.1 spark里执行sql报错 insert overwrite table test_parquet_table select * from dummy 报错如下: org.ap ...

  9. 【原创】大叔问题定位分享(12)Spark保存文本类型文件(text、csv、json等)到hdfs时为什么是压缩格式的

    问题重现 rdd.repartition(1).write.csv(outPath) 写文件之后发现文件是压缩过的 write时首先会获取hadoopConf,然后从中获取是否压缩以及压缩格式 org ...

随机推荐

  1. 解读 IoC 框架 InversifyJS

    原文链接 InversityJS 是一个 IoC 框架.IoC(Inversion of Control) 包括依赖注入(Dependency Injection) 和依赖查询(Dependency ...

  2. 数据交换格式与SpringIOC底层实现

    1.数据交换格式 1.1 有哪些数据交换格式 客户端与服务器常用数据交换格式xml.json.html 1.2 数据交换格式应用场景 1.2.1 移动端(安卓.iOS)通讯方式采用http协议+JSO ...

  3. Manacher算法详解

    问题 什么是回文串,如果一个字符串正着度读和反着读是一样的,这个字符串就被称为回文串. such as noon level aaa bbb 既然有了回文,那就要有关于回文的问题,于是就有了-- 最长 ...

  4. codeforces选做

    收录了最近本人完成的一部分codeforces习题,不定期更新 codeforces 1132E Knapsack 注意到如果只使用某一种物品,那么这八种物品可以达到的最小相同重量为\(840\) 故 ...

  5. FWT快速沃尔什变换学习笔记

    FWT快速沃尔什变换学习笔记 1.FWT用来干啥啊 回忆一下多项式的卷积\(C_k=\sum_{i+j=k}A_i*B_j\) 我们可以用\(FFT\)来做. 甚至在一些特殊情况下,我们\(C_k=\ ...

  6. [hashcat]基于字典和暴力破解尝试找到rar3-hp的压缩包密码

    1.使用rar2john找到md5 2.基于字典 hashcat -a 0 -m 12500 /root/Desktop/md5.txt /usr/share/wordlists/weakpass.t ...

  7. (一)Qt5模块,QtCreator常用快捷键,命名规范

    常用快捷键 1)帮助文件:F1 (光标在函数名字或类名上,按 F1 即可跳转到对应帮助文档,查看其详细用法) 2).h 文件和对应.cpp 文件切换:F4 3)编译并运行:Ctrl + R 4)函数声 ...

  8. [TJOI2007] 调整队形

    题目链接 区间 DP 的经典模型之一. 题意是将整个串通过四种操作变成一个回文串,根据套路,不难设计出 dp[i][j] 表示为使区间 [i, j] 成为回文串的最少操作次数. 先判断 a[i] 是否 ...

  9. hive笔记

    cast cast(number as string),  可以将整数转成字符串 lpad  rpad lpad(target, 10, '0')   表示在target字符串前面补0,构成一个长度为 ...

  10. 剑指Offer_编程题_24

    题目描述 输入一颗二叉树和一个整数,打印出二叉树中结点值的和为输入整数的所有路径.路径定义为从树的根结点开始往下一直到叶结点所经过的结点形成一条路径. /* struct TreeNode { int ...