Spark高级
Spark源码分析:
https://yq.aliyun.com/articles/28400?utm_campaign=wenzhang&utm_medium=article&utm_source=QQ-qun&utm_content=m_11999
Spark shuffle:
http://blog.csdn.net/johnny_lee/article/details/22619585
Spark java.lang.OutOfMemoryError: Java heap space
My cluster: 1 master, 11 slaves, each node has 6 GB memory.
My settings:
spark.executor.memory=4g, Dspark.akka.frameSize=512
Here is the problem:
First, I read some data (2.19 GB) from HDFS to RDD:
val imageBundleRDD = sc.newAPIHadoopFile(...)
Second, do something on this RDD:
val res = imageBundleRDD.map(data => {
val desPoints = threeDReconstruction(data._2, bg)
(data._1, desPoints)
})
Last, output to HDFS:
res.saveAsNewAPIHadoopFile(...)
When I run my program it shows:
.....
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:24 as TID 33 on executor 9: Salve7.Hadoop (NODE_LOCAL)
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:24 as 30618515 bytes in 210 ms
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:36 as TID 34 on executor 2: Salve11.Hadoop (NODE_LOCAL)
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:36 as 30618515 bytes in 449 ms
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Starting task 1.0:32 as TID 35 on executor 7: Salve4.Hadoop (NODE_LOCAL)
Uncaught error from thread [spark-akka.actor.default-dispatcher-3] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[spark]
I have a few suggestions:
- If your nodes are configured to have 6g maximum for Spark (and are leaving a little for other processes), then use 6g rather than 4g,
spark.executor.memory=6g. Make sure you're using as much memory as possible by checking the UI (it will say how much mem you're using) - Try using more partitions, you should have 2 - 4 per CPU. IME increasing the number of partitions is often the easiest way to make a program more stable (and often faster). For huge amounts of data you may need way more than 4 per CPU, I've had to use 8000 partitions in some cases!
- Decrease the fraction of memory reserved for caching, using
spark.storage.memoryFraction. If you don't usecache()orpersistin your code, this might as well be 0. It's default is 0.6, which means you only get 0.4 * 4g memory for your heap. IME reducing the mem frac often makes OOMs go away. UPDATE: From spark 1.6 apparently we will no longer need to play with these values, spark will determine them automatically. - Similar to above but shuffle memory fraction. If your job doesn't need much shuffle memory then set it to a lower value (this might cause your shuffles to spill to disk which can have catastrophic impact on speed). Sometimes when it's a shuffle operation that's OOMing you need to do the opposite i.e. set it to something large, like 0.8, or make sure you allow your shuffles to spill to disk (it's the default since 1.0.0).
- Watch out for memory leaks, these are often caused by accidentally closing over objects you don't need in your lambdas. The way to diagnose is to look out for the "task serialized as XXX bytes" in the logs, if XXX is larger than a few k or more than an MB, you may have a memory leak. See http://stackoverflow.com/a/25270600/1586965
- Related to above; use broadcast variables if you really do need large objects.
- If you are caching large RDDs and can sacrifice some access time consider serialising the RDDhttp://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage. Or even caching them on disk (which sometimes isn't that bad if using SSDs).
- (Advanced) Related to above, avoid
Stringand heavily nested structures (likeMapand nested case classes). If possible try to only use primitive types and index all non-primitives especially if you expect a lot of duplicates. ChooseWrappedArrayover nested structures whenever possible. Or even roll out your own serialisation - YOU will have the most information regarding how to efficiently back your data into bytes, USE IT! - (bit hacky) Again when caching, consider using a
Datasetto cache your structure as it will use more efficient serialisation. This should be regarded as a hack when compared to the previous bullet point. Building your domain knowledge into your algo/serialisation can minimise memory/cache-space by 100x or 1000x, whereas all aDatasetwill likely give is 2x - 5x in memory and 10x compressed (parquet) on disk.
http://spark.apache.org/docs/1.2.1/configuration.html
EDIT: (So I can google myself easier) The following is also indicative of this problem:
java.lang.OutOfMemoryError : GC overhead limit exceeded
Answer2:
Have a look at the start up scripts a Java heap size is set there, it looks like you're not setting this before running Spark worker.
# Set SPARK_MEM if it isn't already set since we also use it for this process
SPARK_MEM=${SPARK_MEM:-512m}
export SPARK_MEM
# Set JAVA_OPTS to be able to load native libraries and to set heap size
JAVA_OPTS="$OUR_JAVA_OPTS"
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM"
You can find the documentation to deploy scripts here.
Spark高级的更多相关文章
- Spark高级数据分析——纽约出租车轨迹的空间和时间数据分析
Spark高级数据分析--纽约出租车轨迹的空间和时间数据分析 一.地理空间分析: 二.pom.xml 原文地址:https://www.jianshu.com/p/eb6f3e0c09b5 作者:II ...
- Learning Spark中文版--第六章--Spark高级编程(2)
Working on a Per-Partition Basis(基于分区的操作) 以每个分区为基础处理数据使我们可以避免为每个数据项重做配置工作.如打开数据库连接或者创建随机数生成器这样的操作,我们 ...
- Learning Spark中文版--第六章--Spark高级编程(1)
Introduction(介绍) 本章介绍了之前章节没有涵盖的高级Spark编程特性.我们介绍两种类型的共享变量:用来聚合信息的累加器和能有效分配较大值的广播变量.基于对RDD现有的transform ...
- spark高级排序彻底解秘
排序,真的非常重要! RDD.scala(源码) 在其,没有罗列排序,不是说它不重要! 1.基础排序算法实战 2.二次排序算法实战 3.更高级别排序算法 4.排序算法内幕解密 1.基础排序算法实战 启 ...
- spark高级编程
启动spark-shell 如果你有一个Hadoop 集群, 并且Hadoop 版本支持YARN, 通过为Spark master 设定yarn-client 参数值,就可以在集群上启动Spark 作 ...
- Spark高级数据分析· 3推荐引擎
推荐算法流程 推荐算法 预备 wget http://www.iro.umontreal.ca/~lisa/datasets/profiledata_06-May-2005.tar.gz cd /Us ...
- Spark高级数据分析-第2章 用Scala和Spark进行数据分析
2.4 小试牛刀:Spark shell和SparkContext 本章使用的资料来自加州大学欧文分校机器学习资料库(UC Irvine Machine Learning Repository),这个 ...
- Spark高级函数应用【combineByKey、transform】
一.combineByKey算子简介 功能:实现分组自定义求和及计数. 特点:用于处理(key,value)类型的数据. 实现步骤: 1.对要处理的数据进行初始化,以及一些转化操作 2.检测key是否 ...
- 10、spark高级编程
一.基于排序机制的wordcount程序 1.要求 1.对文本文件内的每个单词都统计出其出现的次数. 2.按照每个单词出现次数的数量,降序排序. 2.代码实现 ------java实现------- ...
随机推荐
- Virtual Box 安装过程(卸载Vmware后)
VirtualBox安装前的操作:(或许某些操作不一定有用,但是我是这么做下来的,最后也安装成功了) 步骤一:停止之前安装的vmware的所有服务(如果之前没有安装过虚拟机软件,无需做此操作)VMwa ...
- 浅谈java内存泄漏
最近有朋友遇到个问题,tomcat在运行几天后就会报outofmemory,然后就死了,我就稍微总结了下内存泄漏的一些原因,纯属个人理解,欢迎大侠们劈砖: 一.字符串问题 这个也是一个常见的问题,我们 ...
- [NOIP2011] 洛谷P1313 计算系数
题目描述 给定一个多项式(by+ax)^k,请求出多项式展开后x^n*y^m 项的系数. 输入输出格式 输入格式: 输入文件名为factor.in. 共一行,包含5 个整数,分别为 a ,b ,k , ...
- 16.1113 模拟考试T1
笔记[问题描述]给定一个长度为m的序列a,下标编号为1~m.序列的每个元素都是1~N的整数.定义序列的代价为累加(1->m-1 abs(ai+1-ai))你现在可以选择两个数x和y,并将序列?中 ...
- Codeforces983D. Arkady and Rectangles
$n \leq 100000$个矩形,一个一个覆盖在坐标系上,每个颜色都不一样,问最后能看到几种颜色. 由于后面的颜色可以覆盖前面的颜色,可以把颜色与时间联系上,第$i$个矩形颜色$i$来把时间维变成 ...
- Peaks BZOJ 3545 / Peaks加强版 BZOJ 3551
Peaks [问题描述] 在Bytemountains有N座山峰,每座山峰有他的高度h_i.有些山峰之间有双向道路相连,共M条路径,每条路径有一个困难值,这个值越大表示越难走,现在有Q组询问,每组询问 ...
- mongodb的入门CURD
mongodb的入门CURD #查看所有数据库show dbs;show databases; #有些版本可能不行 #使用数据库use 数据库名 #查看集合(集合即mysql的表)show table ...
- hdu 4587 2013南京邀请赛B题/ / 求割点后连通分量数变形。
题意:求一个无向图的,去掉两个不同的点后最多有几个连通分量. 思路:枚举每个点,假设去掉该点,然后对图求割点后连通分量数,更新最大的即可.算法相对简单,但是注意几个细节: 1:原图可能不连通. 2:有 ...
- 使用EventNext实现基于事件驱动的业务处理
事件驱动模型相信对大家来说并不陌生,因为这是一套非常高效的逻辑处理模型,通过事件来驱动接下来需要完成的工作,而不像传统同步模型等待任务完成后再继续!虽然事件驱动有着这样的好处,但在传统设计上基于消息回 ...
- java 常用的解析工具
这里介绍两种 java 解析工具. 第一种:java 解析 html 工具 jsoup 第二种: java 解析 XML 工具 Dom4j jsoup jsoup是一个用于处理真实HTML的Java库 ...