spark通过合理设置spark.default.parallelism参数提高执行效率

spark中有partition的概念（和slice是同一个概念，在spark1.2中官网已经做出了说明），一般每个partition对应一个task。在我的测试过程中，如果没有设置spark.default.parallelism参数，spark计算出来的partition非常巨大，与我的cores非常不搭。我在两台机器上（8cores *2 +6g * 2）上，spark计算出来的partition达到2.8万个，也就是2.9万个tasks，每个task完成时间都是几毫秒或者零点几毫秒，执行起来非常缓慢。在我尝试设置了 spark.default.parallelism 后，任务数减少到10，执行一次计算过程从minute降到20second。

参数可以通过spark_home/conf/spark-default.conf配置文件设置。

eg.

 spark.master                       spark://master:7077

 spark.default.parallelism

 spark.driver.memory                2g

 spark.serializer                   org.apache.spark.serializer.KryoSerializer

 spark.sql.shuffle.partitions

Property Name Default Meaning

Property Name	Default	Meaning
`spark.default.parallelism`	For distributed shuffle operations like `reduceByKey` and `join`, the largest number of partitions in a parent RDD. For operations like`parallelize` with no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Mesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations likeparallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

from:http://spark.apache.org/docs/latest/tuning.html

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config propertyspark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

spark通过合理设置spark.default.parallelism参数提高执行效率的更多相关文章

Eclipse:设置自动补全，提高编程效率
一.设置自动补全 1.进入eclipse的window里的perferences页面 2.找到java->Editor->Content Assist设置界面 3.在Auto activa ...
spark系列-7、spark调优
官网说明:http://spark.apache.org/docs/2.1.1/tuning.html#data-serialization 一.JVM调优 1.1.Java虚拟机垃圾回收调优的背景 ...
spark提交命令 spark-submit 的参数 executor-memory、executor-cores、num-executors、spark.default.parallelism分析
转载:https://blog.csdn.net/zimiao552147572/article/details/96482120 nohup spark-submit --master yarn - ...
spark.sql.shuffle.partitions和spark.default.parallelism的区别
在关于spark任务并行度的设置中,有两个参数我们会经常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什 ...
[Spark]What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used ...
streaming优化：spark.default.parallelism调整处理并行度
官方是这么说的: Cluster resources can be under-utilized if the number of parallel tasks used in any stage o ...
【Spark调优】提交job资源参数调优
[场景] Spark提交作业job的时候要指定该job可以使用的CPU.内存等资源参数,生产环境中,任务资源分配不足会导致该job执行中断.失败等问题,所以对Spark的job资源参数分配调优非常重要 ...
【Spark调优】内存模型与参数调优
[Spark内存模型] Spark在一个executor中的内存分为3块:storage内存.execution内存.other内存. 1. storage内存:存储broadcast,cache,p ...
[spark]-Spark2.x集群搭建与参数详解
在前面的Spark发展历程和基本概念中介绍了Spark的一些基本概念,熟悉了这些基本概念对于集群的搭建是很有必要的.我们可以了解到每个参数配置的作用是什么.这里将详细介绍Spark集群搭建以及xml参 ...

随机推荐

iOS：二维码的扫描
iOS 中二维码的扫描借用#import <AVFoundation/AVFoundation.h> 实现,会用到<AVCaptureMetadataOutputObjectsDel ...
windows 7系统搭建PHP网站环境
2.新建数据库打开浏览器,输入http://localhost:9999或者http://127.0.0.1:9999回车填写用户名root和密码回车登录点击权限-添加新用户填写用户名,主机选择本地, ...
Hadoop Trash回收站使用指南
转载:https://blog.csdn.net/sunnyyoona/article/details/78869778 我们在删除一个文件时,遇到如下问题,提示我们不能删除文件放回回收站: sudo ...
Spark（十二） -- Spark On Yarn & Spark as a Service & Spark On Tachyon
Spark On Yarn: 从0.6.0版本其,就可以在在Yarn上运行Spark 通过Yarn进行统一的资源管理和调度进而可以实现不止Spark,多种处理框架并存工作的场景部署Spark On ...
谋哥：玩App怎么赚钱（三）
谋哥每天坚持写文章,如今写作速度是越来越快了,当然这样也能节省点时间.只是坚持每天写,确实须要极大的耐力和毅力,由于偶然事件会影响你心情和灵感.只是我一直相信秦刚老师(微信/QQ1111884 )说的 ...
Sql中存在斜杠“/”怎么办？
比如下面的语句 select concat(name,'/',description) from table1 这样的语句在数据库访问工具中执行没问题,到java中就报错. 解决办法也很简单,用单引号 ...
[Android] android:visibility属性应用
在Android开发中,有这样一种场景: 当edittext中有内容时候,旁边的提交或确定按钮显示: 当edittext中内容为空的时候,提交或确定按钮隐藏: 要实现起来其实并不困难.很多控件都有vi ...
【Java】Java_11运算符
1.运算符(operator) Java 语言支持如下运算符: 算术运算符: +,-,*,/,%,++ 赋值运算符 = 关系运算符: >,<,>=,<=,==,!= i ...
C语言-十进制转换为二进制函数
char * itobs(int num, char * str) { int i; * sizeof(int); ; i >= ; i--, num >>= ) { str[i] ...
窥探try ... catch与__try ... __except的区别
VC中的这两个东西肯定谁都用过, 不过它们之间有什么区别, 正好有时间研究了一下, 如果有错误欢迎拍砖.基于VC2005, 32位XP 平台测试通过. 估计对于其他版本的VC和操作系统是不通用的. 1 ...

spark通过合理设置spark.default.parallelism参数提高执行效率

Level of Parallelism

spark通过合理设置spark.default.parallelism参数提高执行效率的更多相关文章

随机推荐

热门专题