spark通过合理设置spark.default.parallelism参数提高执行效率

spark中有partition的概念（和slice是同一个概念，在spark1.2中官网已经做出了说明），一般每个partition对应一个task。在我的测试过程中，如果没有设置spark.default.parallelism参数，spark计算出来的partition非常巨大，与我的cores非常不搭。我在两台机器上（8cores *2 +6g * 2）上，spark计算出来的partition达到2.8万个，也就是2.9万个tasks，每个task完成时间都是几毫秒或者零点几毫秒，执行起来非常缓慢。在我尝试设置了 spark.default.parallelism 后，任务数减少到10，执行一次计算过程从minute降到20second。

参数可以通过spark_home/conf/spark-default.conf配置文件设置。

eg.

 spark.master                       spark://master:7077

 spark.default.parallelism

 spark.driver.memory                2g

 spark.serializer                   org.apache.spark.serializer.KryoSerializer

 spark.sql.shuffle.partitions

Property Name Default Meaning

Property Name	Default	Meaning
`spark.default.parallelism`	For distributed shuffle operations like `reduceByKey` and `join`, the largest number of partitions in a parent RDD. For operations like`parallelize` with no parent RDDs, it depends on the cluster manager: Local mode: number of cores on the local machine Mesos fine grained mode: 8 Others: total number of cores on all executor nodes or 2, whichever is larger	Default number of partitions in RDDs returned by transformations like `join`, `reduceByKey`, and `parallelize` when not set by user.

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations likeparallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.

from:http://spark.apache.org/docs/latest/tuning.html

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config propertyspark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

spark通过合理设置spark.default.parallelism参数提高执行效率的更多相关文章

Eclipse:设置自动补全，提高编程效率
一.设置自动补全 1.进入eclipse的window里的perferences页面 2.找到java->Editor->Content Assist设置界面 3.在Auto activa ...
spark系列-7、spark调优
官网说明:http://spark.apache.org/docs/2.1.1/tuning.html#data-serialization 一.JVM调优 1.1.Java虚拟机垃圾回收调优的背景 ...
spark提交命令 spark-submit 的参数 executor-memory、executor-cores、num-executors、spark.default.parallelism分析
转载:https://blog.csdn.net/zimiao552147572/article/details/96482120 nohup spark-submit --master yarn - ...
spark.sql.shuffle.partitions和spark.default.parallelism的区别
在关于spark任务并行度的设置中,有两个参数我们会经常遇到,spark.sql.shuffle.partitions 和 spark.default.parallelism, 那么这两个参数到底有什 ...
[Spark]What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used ...
streaming优化：spark.default.parallelism调整处理并行度
官方是这么说的: Cluster resources can be under-utilized if the number of parallel tasks used in any stage o ...
【Spark调优】提交job资源参数调优
[场景] Spark提交作业job的时候要指定该job可以使用的CPU.内存等资源参数,生产环境中,任务资源分配不足会导致该job执行中断.失败等问题,所以对Spark的job资源参数分配调优非常重要 ...
【Spark调优】内存模型与参数调优
[Spark内存模型] Spark在一个executor中的内存分为3块:storage内存.execution内存.other内存. 1. storage内存:存储broadcast,cache,p ...
[spark]-Spark2.x集群搭建与参数详解
在前面的Spark发展历程和基本概念中介绍了Spark的一些基本概念,熟悉了这些基本概念对于集群的搭建是很有必要的.我们可以了解到每个参数配置的作用是什么.这里将详细介绍Spark集群搭建以及xml参 ...

随机推荐

php之变量覆盖漏洞讲解
1.变量没有初始化的问题(1): wooyun连接1:[link href="WooYun: PHPCMS V9 member表内容随意修改漏洞"]tenzy[/link] $up ...
Java程序员到架构师的推荐阅读书籍
作为Java程序员来说,最痛苦的事情莫过于可以选择的范围太广,可以读的书太多,往往容易无所适从.我想就我自己读过的技术书籍中挑选出来一些,按照学习的先后顺序,推荐给大家,特别是那些想不断提高自己技术水 ...
算法导论-求x的n次方
目录 1.分治求x的n次方思路 2.c++代码实现内容 1.分治求x的n次方思路T(n)=Θ(lgn) 为了计算乘方数a^n,传统的做法(所谓的Naive algorithm)就是循环相乘n次,算法 ...
ARP协议具体解释之ARP动态与静态条目的生命周期
ARP协议详细解释之ARP动态与静态条目的生命周期 ARP动态条目的生命周期动态条目随时间推移自己主动加入和删除. q 每一个动态ARP缓存条目默认的生命周期是两分钟.当超过两分钟,该条目会被删掉 ...
Myeclipse中文件已经上传到server文件夹下，文件也没有被占用，可是页面中无法读取和使用问题的解决方法
这个问题是因为Myeclipse中文件不同步引起的.在Myeclipse中,project文件是由Myeclipse自己主动扫描加入的,假设在外部改动了project文件夹中的文件但又关闭了自己主动刷 ...
idea tomcat 怎样出现update classes and resources
idea Tomcat 出现update classes and resources 出现热加载正确配置应该是这个在 Deployment (调度,部署) 中点击 + 选择war explored ...
DevExpress.Build
using System.Collections.Generic; using Microsoft.Build.AppxPackage; using Microsoft.Build.Framework ...
POJ3264 Balanced Lineup 【线段树】+【单点更新】
Balanced Lineup Time Limit: 5000MS Memory Limit: 65536K Total Submissions: 32778 Accepted: 15425 ...
倍福TwinCAT(贝福Beckhoff)基础教程7.1 TwinCAT如何简单执行NC功能块 TC2
TC2的程序是在TC3的基础上稍作调整,只说明不同点,请先看TC3的. TC2中的一个原本是AXIS_REF类型变量被拆成了两个(PLCTONC_AXLESTRUCT和NCTOPLC_AXLESTRU ...
Java8 对多个异步任务进行流水线操作(笔记)
现在我们要对商店商品进行折扣服务.每个折扣代码对应不同的折扣率,使用一个枚举变量Discount.Code来实现这一想法,具体代码如下所示. 以枚举类型定义的折扣代码 /** * 折扣服务api * ...

spark通过合理设置spark.default.parallelism参数提高执行效率

Level of Parallelism

spark通过合理设置spark.default.parallelism参数提高执行效率的更多相关文章

随机推荐

热门专题