Spark源码分析：

https://yq.aliyun.com/articles/28400?utm_campaign=wenzhang&utm_medium=article&utm_source=QQ-qun&utm_content=m_11999

Spark shuffle:

http://blog.csdn.net/johnny_lee/article/details/22619585

Spark java.lang.OutOfMemoryError: Java heap space

My cluster: 1 master, 11 slaves, each node has 6 GB memory.

My settings:
spark.executor.memory=4g, Dspark.akka.frameSize=512
Here is the problem:

First, I read some data (2.19 GB) from HDFS to RDD:
val imageBundleRDD = sc.newAPIHadoopFile(...)
Second, do something on this RDD:

val res = imageBundleRDD.map(data => {
val desPoints = threeDReconstruction(data._2, bg)
(data._1, desPoints)
})
Last, output to HDFS:

res.saveAsNewAPIHadoopFile(...)
When I run my program it shows:

.....
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:24 as TID 33 on executor 9: Salve7.Hadoop (NODE_LOCAL)
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:24 as 30618515 bytes in 210 ms
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:36 as TID 34 on executor 2: Salve11.Hadoop (NODE_LOCAL)
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:36 as 30618515 bytes in 449 ms
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Starting task 1.0:32 as TID 35 on executor 7: Salve4.Hadoop (NODE_LOCAL)
Uncaught error from thread [spark-akka.actor.default-dispatcher-3] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[spark]

I have a few suggestions:

If your nodes are configured to have 6g maximum for Spark (and are leaving a little for other processes), then use 6g rather than 4g, spark.executor.memory=6g. Make sure you're using as much memory as possible by checking the UI (it will say how much mem you're using)
Try using more partitions, you should have 2 - 4 per CPU. IME increasing the number of partitions is often the easiest way to make a program more stable (and often faster). For huge amounts of data you may need way more than 4 per CPU, I've had to use 8000 partitions in some cases!
Decrease the fraction of memory reserved for caching, using spark.storage.memoryFraction. If you don't use cache() or persist in your code, this might as well be 0. It's default is 0.6, which means you only get 0.4 * 4g memory for your heap. IME reducing the mem frac often makes OOMs go away. UPDATE: From spark 1.6 apparently we will no longer need to play with these values, spark will determine them automatically.
Similar to above but shuffle memory fraction. If your job doesn't need much shuffle memory then set it to a lower value (this might cause your shuffles to spill to disk which can have catastrophic impact on speed). Sometimes when it's a shuffle operation that's OOMing you need to do the opposite i.e. set it to something large, like 0.8, or make sure you allow your shuffles to spill to disk (it's the default since 1.0.0).
Watch out for memory leaks, these are often caused by accidentally closing over objects you don't need in your lambdas. The way to diagnose is to look out for the "task serialized as XXX bytes" in the logs, if XXX is larger than a few k or more than an MB, you may have a memory leak. See http://stackoverflow.com/a/25270600/1586965
Related to above; use broadcast variables if you really do need large objects.
If you are caching large RDDs and can sacrifice some access time consider serialising the RDDhttp://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage. Or even caching them on disk (which sometimes isn't that bad if using SSDs).
(Advanced) Related to above, avoid String and heavily nested structures (like Map and nested case classes). If possible try to only use primitive types and index all non-primitives especially if you expect a lot of duplicates. Choose WrappedArray over nested structures whenever possible. Or even roll out your own serialisation - YOU will have the most information regarding how to efficiently back your data into bytes, USE IT!
(bit hacky) Again when caching, consider using a Dataset to cache your structure as it will use more efficient serialisation. This should be regarded as a hack when compared to the previous bullet point. Building your domain knowledge into your algo/serialisation can minimise memory/cache-space by 100x or 1000x, whereas all a Dataset will likely give is 2x - 5x in memory and 10x compressed (parquet) on disk.

http://spark.apache.org/docs/1.2.1/configuration.html

EDIT: (So I can google myself easier) The following is also indicative of this problem:

java.lang.OutOfMemoryError : GC overhead limit exceeded

Answer2:

Have a look at the start up scripts a Java heap size is set there, it looks like you're not setting this before running Spark worker.

# Set SPARK_MEM if it isn't already set since we also use it for this process

SPARK_MEM=${SPARK_MEM:-512m}

export SPARK_MEM

# Set JAVA_OPTS to be able to load native libraries and to set heap size

JAVA_OPTS="$OUR_JAVA_OPTS"

JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"

JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM"

You can find the documentation to deploy scripts here.

Spark高级的更多相关文章

Spark高级数据分析——纽约出租车轨迹的空间和时间数据分析
Spark高级数据分析--纽约出租车轨迹的空间和时间数据分析一.地理空间分析: 二.pom.xml 原文地址:https://www.jianshu.com/p/eb6f3e0c09b5 作者:II ...
Learning Spark中文版--第六章--Spark高级编程（2）
Working on a Per-Partition Basis(基于分区的操作) 以每个分区为基础处理数据使我们可以避免为每个数据项重做配置工作.如打开数据库连接或者创建随机数生成器这样的操作,我们 ...
Learning Spark中文版--第六章--Spark高级编程（1）
Introduction(介绍) 本章介绍了之前章节没有涵盖的高级Spark编程特性.我们介绍两种类型的共享变量:用来聚合信息的累加器和能有效分配较大值的广播变量.基于对RDD现有的transform ...
spark高级排序彻底解秘
排序,真的非常重要! RDD.scala(源码) 在其,没有罗列排序,不是说它不重要! 1.基础排序算法实战 2.二次排序算法实战 3.更高级别排序算法 4.排序算法内幕解密 1.基础排序算法实战启 ...
spark高级编程
启动spark-shell 如果你有一个Hadoop 集群, 并且Hadoop 版本支持YARN, 通过为Spark master 设定yarn-client 参数值,就可以在集群上启动Spark 作 ...
Spark高级数据分析· 3推荐引擎
推荐算法流程推荐算法预备 wget http://www.iro.umontreal.ca/~lisa/datasets/profiledata_06-May-2005.tar.gz cd /Us ...
Spark高级数据分析-第2章用Scala和Spark进行数据分析
2.4 小试牛刀:Spark shell和SparkContext 本章使用的资料来自加州大学欧文分校机器学习资料库(UC Irvine Machine Learning Repository),这个 ...
Spark高级函数应用【combineByKey、transform】
一.combineByKey算子简介功能:实现分组自定义求和及计数. 特点:用于处理(key,value)类型的数据. 实现步骤: 1.对要处理的数据进行初始化,以及一些转化操作 2.检测key是否 ...
10、spark高级编程
一.基于排序机制的wordcount程序 1.要求 1.对文本文件内的每个单词都统计出其出现的次数. 2.按照每个单词出现次数的数量,降序排序. 2.代码实现 ------java实现------- ...

随机推荐

BZOJ2425 [HAOI2010]计数【数位dp】
题目你有一组非零数字(不一定唯一),你可以在其中插入任意个0,这样就可以产生无限个数.比如说给定{1,2},那么可以生成数字12,21,102,120,201,210,1002,1020,等等. 现 ...
洛谷P3327 - [SDOI2015]约数个数和
Portal Description 共\(T(T\leq5\times10^4)\)组数据.给出\(n,m(n,m\leq5\times10^4)\),求\[\sum_{i=1}^n\sum_{j= ...
Java:Session详解
以下情况,Session结束生命周期,Servlet容器将Session所占资源释放:1.客户端关闭浏览器2.Session过期3.服务器端调用了HttpSession的invalidate()方法. ...
java面2
面试试题汇总集: <[面试题]2018年最全Java面试通关秘籍汇总集!> <[面试题]2018年最全Java面试通关秘籍第二套!> <[面试题]2018年最全Java面 ...
java多线程总结一：线程的两种创建方式及比较
1.线程的概念:线程(thread)是指一个任务从头至尾的执行流,线程提供一个运行任务的机制,对于java而言,一个程序中可以并发的执行多个线程,这些线程可以在多处理器系统上同时运行.当程序作为一个应 ...
Day 11 正则表达式
正则表达式一.简介 Linux系统中grep命令是一种强大的文本搜索工具,它能使用正则表达式搜索文本,并把匹配到的行打印出来.grep全称是Globally search for a Regular ...
Day 1 计算机基础
计算机基础一.为什么学习计算机基础? 编程语言的作用:人类使机器明白并动作的指令.类似:人文社会的英语. 关系:计算机硬件 —— 操作系统(OS) —— 软件(编程语言成品,学习成果). 自语: ...
Java普通员工管理系统
login GUI界面(登录) package 普通员工管理系统; import java.awt.event.ActionEvent; import java.awt.event.ActionLis ...
InteliJ 安装PlantUML插件
打开InteliJ点击Setting 在[Plugins]搜索PlantUML插件,点击绿色的Install安装然后重启完成
VS2015 当前不会命中断点，还没有为该文档加载任何符号
这种小问题,我想只需要清理解决方案重新生成就好啦,结果...2个小时过去后.. 最后问了昨天做过修改的同事,修改了什么.. 设置成生成调试信息,仅以此文纪念我那逝去的青春

Spark高级

Spark java.lang.OutOfMemoryError: Java heap space

I have a few suggestions:

Spark高级的更多相关文章

随机推荐

热门专题