How Many Partitions Does An RDD Have
From https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
For tuning and troubleshooting, it's often necessary to know how many paritions an RDD represents. There are a few ways to find this information:
View Task Execution Against Partitions Using the UI
When a stage executes, you can see the number of partitions for a given stage in the Spark UI. For example, the following simple job creates an RDD of 100 elements across 4 partitions, then distributes a dummy map task before collecting the elements back to the driver program:
scala> val someRDD = sc.parallelize(1 to 100, 4)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
In Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions:

View Partition Caching Using the UI
When persisting (a.k.a. caching) RDDs, it's useful to understand how many partitions have been stored. The example below is identical to the one prior, except that we'll now cache the RDD prior to processing it. After this completes, we can use the UI to understand what has been stored from this operation.
scala> someRDD.setName("toy").cache
res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
Note from the screenshot that there are four partitions cached.

Inspect RDD Partitions Programatically
In the Scala API, an RDD holds a reference to it's Array of partitions, which you can use to find out how many partitions there are:
scala> val someRDD = sc.parallelize(1 to 100, 30)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.partitions.size
res0: Int = 30
In the python API, there is a method for explicitly listing the number of partitions:
In [1]: someRDD = sc.parallelize(range(101),30)
In [2]: someRDD.getNumPartitions()
Out[2]: 30
Note in the examples above, the number of partitions was intentionally set to 30 upon initialization.
How Many Partitions Does An RDD Have的更多相关文章
- Spark核心概念之RDD
RDD: Resilient Distributed Dataset RDD的特点: 1.A list of partitions 一系列的分片:比如说64M一片:类似于Hadoop中的s ...
- RDD的依赖关系
RDD的依赖关系 Rdd之间的依赖关系通过rdd中的getDependencies来进行表示, 在提交job后,会通过在DAGShuduler.submitStage-->getMissingP ...
- RDD.scala(源码)
---- map. --- flatMap.fliter.distinct.repartition.coalesce.sample.randomSplit.randomSampleWithRange. ...
- Spark函数详解系列之RDD基本转换
摘要: RDD:弹性分布式数据集,是一种特殊集合 ‚ 支持多种来源 ‚ 有容错机制 ‚ 可以被缓存 ‚ 支持并行操作,一个RDD代表一个分区里的数据集 RDD有两种操作算子: ...
- Spark编程模型及RDD操作
转载自:http://blog.csdn.net/liuwenbo0920/article/details/45243775 1. Spark中的基本概念 在Spark中,有下面的基本概念.Appli ...
- 【原创】大数据基础之Spark(4)RDD原理及代码解析
一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
- Spark源码系列:RDD repartition、coalesce 对比
在上一篇文章中 Spark源码系列:DataFrame repartition.coalesce 对比 对DataFrame的repartition.coalesce进行了对比,在这篇文章中,将会对R ...
- 【Spark-core学习之二】 RDD和算子
环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk1.8 scala-2.10.4(依赖jdk1.8) spark ...
- spark 算子之RDD
map map(func) Return a new distributed dataset formed by passing each element of the source through ...
随机推荐
- Disconf使用简单Demo
创建配置文件 在敲Demo之前,需要在Disconf上创建自己的APP,然后在APP的某个环境下创建配置文件,如下面截图中的流程,这里就简单创建了一个redis.properties,内容是redis ...
- Python笔记9-----不等长列表转化成DataFrame
1.不同长度的列表合并成DataFrame. 法1: ntest=['a','b'] ltest=[[1,2],[4,5,6]] 先变成等长的列表:(a:1),(a:2),(b:4),(b:5),(b ...
- C++进阶 STL(3) 第三天 函数对象适配器、常用遍历算法、常用排序算法、常用算数生成算法、常用集合算法、 distance_逆序遍历_修改容器元素
01昨天课程回顾 02函数对象适配器 函数适配器是用来让一个函数对象表现出另外一种类型的函数对象的特征.因为,许多情况下,我们所持有的函数对象或普通函数的参数个数或是返回值类型并不是我们想要的,这时候 ...
- 使用kubeadm在CentOS上搭建Kubernetes1.14.3集群
练习环境说明:参考1 参考2 主机名称 IP地址 部署软件 备注 M-kube12 192.168.10.12 master+etcd+docker+keepalived+haproxy master ...
- NOIP2018提高组省一冲奖班模测训练(五)
NOIP2018提高组省一冲奖班模测训练(五) http://www.51nod.com/Contest/ContestDescription.html#!#contestId=79 今天有点浪…… ...
- redis helloworld
一.启动 redis 服务 [root@MyLinux bin]# ./redis-server redis.conf 二.使用客户端连接服务 [root@MyLinux bin]# ./redis- ...
- BA--无风机冷却塔
无风机冷却塔的优点 1.无运转振动噪音传统冷却塔噪声源为冷却风扇马达运转所产生,其诱发之塔体振动,具有噪音共振加强性,为有效抑制振动之传导,须加装防震铁架及避震器.因冷却方式不同,LFC-N型无风机科 ...
- wifi共享精灵2014.04.25.001已经更新,wifi热点中文名走起!
五一回来后,有个惊喜,wifi共享精灵有了最新动向.不晓得wifi共享精灵是啥的朋友,我来解释下,它就相当于一个无线路由器.说起来,Wifi共享精灵正式版2014.04.25.001(http://w ...
- leetcode_Isomorphic Strings _easy
Given two strings s and t, determine if they are isomorphic. Two strings are isomorphic if the chara ...
- 小米2S电池电量用尽充电无法开机解决方法
背景: 昨晚睡觉前关机,记得电量还有百分之七八十,但早上起床后,指示灯一直红灯闪烁.按开机键和其它键都没反应! ! 解决方法: 扣下电池,用万能充冲电,略微多冲一会,由于 ...