How Many Partitions Does An RDD Have
From https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
For tuning and troubleshooting, it's often necessary to know how many paritions an RDD represents. There are a few ways to find this information:
View Task Execution Against Partitions Using the UI
When a stage executes, you can see the number of partitions for a given stage in the Spark UI. For example, the following simple job creates an RDD of 100 elements across 4 partitions, then distributes a dummy map task before collecting the elements back to the driver program:
scala> val someRDD = sc.parallelize(1 to 100, 4)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
In Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions:

View Partition Caching Using the UI
When persisting (a.k.a. caching) RDDs, it's useful to understand how many partitions have been stored. The example below is identical to the one prior, except that we'll now cache the RDD prior to processing it. After this completes, we can use the UI to understand what has been stored from this operation.
scala> someRDD.setName("toy").cache
res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
Note from the screenshot that there are four partitions cached.

Inspect RDD Partitions Programatically
In the Scala API, an RDD holds a reference to it's Array of partitions, which you can use to find out how many partitions there are:
scala> val someRDD = sc.parallelize(1 to 100, 30)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.partitions.size
res0: Int = 30
In the python API, there is a method for explicitly listing the number of partitions:
In [1]: someRDD = sc.parallelize(range(101),30)
In [2]: someRDD.getNumPartitions()
Out[2]: 30
Note in the examples above, the number of partitions was intentionally set to 30 upon initialization.
How Many Partitions Does An RDD Have的更多相关文章
- Spark核心概念之RDD
RDD: Resilient Distributed Dataset RDD的特点: 1.A list of partitions 一系列的分片:比如说64M一片:类似于Hadoop中的s ...
- RDD的依赖关系
RDD的依赖关系 Rdd之间的依赖关系通过rdd中的getDependencies来进行表示, 在提交job后,会通过在DAGShuduler.submitStage-->getMissingP ...
- RDD.scala(源码)
---- map. --- flatMap.fliter.distinct.repartition.coalesce.sample.randomSplit.randomSampleWithRange. ...
- Spark函数详解系列之RDD基本转换
摘要: RDD:弹性分布式数据集,是一种特殊集合 ‚ 支持多种来源 ‚ 有容错机制 ‚ 可以被缓存 ‚ 支持并行操作,一个RDD代表一个分区里的数据集 RDD有两种操作算子: ...
- Spark编程模型及RDD操作
转载自:http://blog.csdn.net/liuwenbo0920/article/details/45243775 1. Spark中的基本概念 在Spark中,有下面的基本概念.Appli ...
- 【原创】大数据基础之Spark(4)RDD原理及代码解析
一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
- Spark源码系列:RDD repartition、coalesce 对比
在上一篇文章中 Spark源码系列:DataFrame repartition.coalesce 对比 对DataFrame的repartition.coalesce进行了对比,在这篇文章中,将会对R ...
- 【Spark-core学习之二】 RDD和算子
环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk1.8 scala-2.10.4(依赖jdk1.8) spark ...
- spark 算子之RDD
map map(func) Return a new distributed dataset formed by passing each element of the source through ...
随机推荐
- Project Euler 11 Largest product in a grid
题意:在这个20×20方阵中,四个在同一方向(从下至上.从上至下.从右至左.从左至右或者对角线)上相邻的数的乘积最大是多少? 思路:暴力去枚举以 ( x , y ) 为中心拓展的四个方向 /***** ...
- MyBatis中的大于号小于号表示
可以使用转义字符把大于号和小于号这种直接替换掉: select* from table where '字段1'>=10怎么表示,问题来啦 xml转义可以使用 根据这个规则上面的sql写法应该变成 ...
- python_格式化拼接、format,编码、解码
一.格式化拼接.format 1.字符串拼接 name = "Monica", age = 16 print("姓名"+name+“年龄”+age+" ...
- 模拟ArrayList底层实现
package chengbaoDemo; import java.util.ArrayList; import java.util.Arrays; import comman.Human; /** ...
- smartctl----硬盘状态监控
smartmontools介绍 smartmontools是一款开源的磁盘控制,监视工具,可以运行在Linux,Unix,BSD,Solaris,Mac OS,OS/2,Cygwin和Windows上 ...
- BA-Honeywell R300系统
- 洛谷 P1270 “访问”美术馆(树形DP)
P1270 “访问”美术馆 题目描述 经过数月的精心准备,Peer Brelstet,一个出了名的盗画者,准备开始他的下一个行动.艺术馆的结构,每条走廊要么分叉为两条走廊,要么通向一个展览室.Peer ...
- 在IntelliJ IDEA中创建Maven多模块项目
在IntelliJ IDEA中创建Maven多模块项目 1,创建多模块项目选择File>New>Project 出现New Project窗口左侧导航选择Maven,勾选右侧的Create ...
- Codeforces 425A Sereja and Swaps(暴力枚举)
题目链接:A. Sereja and Swaps 题意:给定一个序列,能够交换k次,问交换完后的子序列最大值的最大值是多少 思路:暴力枚举每一个区间,然后每一个区间[l,r]之内的值先存在优先队列内, ...
- ubuntu14.04无法安装Curl
ubuntu14.04无法安装Curl apt-get install curl 提示没有这个软件 源 更换软件源到163也不行,更新软件源也不行. 解决:參考http://www.linuxidc. ...