From https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html

For tuning and troubleshooting, it's often necessary to know how many paritions an RDD represents. There are a few ways to find this information:

View Task Execution Against Partitions Using the UI

When a stage executes, you can see the number of partitions for a given stage in the Spark UI. For example, the following simple job creates an RDD of 100 elements across 4 partitions, then distributes a dummy map task before collecting the elements back to the driver program:

scala> val someRDD = sc.parallelize(1 to 100, 4)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.map(x => x).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

In Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions:

View Partition Caching Using the UI

When persisting (a.k.a. caching) RDDs, it's useful to understand how many partitions have been stored. The example below is identical to the one prior, except that we'll now cache the RDD prior to processing it. After this completes, we can use the UI to understand what has been stored from this operation.

scala> someRDD.setName("toy").cache
res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.map(x => x).collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

Note from the screenshot that there are four partitions cached.

Inspect RDD Partitions Programatically

In the Scala API, an RDD holds a reference to it's Array of partitions, which you can use to find out how many partitions there are:

scala> val someRDD = sc.parallelize(1 to 100, 30)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.partitions.size
res0: Int = 30

In the python API, there is a method for explicitly listing the number of partitions:

In [1]: someRDD = sc.parallelize(range(101),30)

In [2]: someRDD.getNumPartitions()
Out[2]: 30

Note in the examples above, the number of partitions was intentionally set to 30 upon initialization.

How Many Partitions Does An RDD Have的更多相关文章

  1. Spark核心概念之RDD

    RDD: Resilient Distributed Dataset RDD的特点: 1.A list of partitions       一系列的分片:比如说64M一片:类似于Hadoop中的s ...

  2. RDD的依赖关系

    RDD的依赖关系 Rdd之间的依赖关系通过rdd中的getDependencies来进行表示, 在提交job后,会通过在DAGShuduler.submitStage-->getMissingP ...

  3. RDD.scala(源码)

    ---- map. --- flatMap.fliter.distinct.repartition.coalesce.sample.randomSplit.randomSampleWithRange. ...

  4. Spark函数详解系列之RDD基本转换

    摘要:   RDD:弹性分布式数据集,是一种特殊集合 ‚ 支持多种来源 ‚ 有容错机制 ‚ 可以被缓存 ‚ 支持并行操作,一个RDD代表一个分区里的数据集   RDD有两种操作算子:         ...

  5. Spark编程模型及RDD操作

    转载自:http://blog.csdn.net/liuwenbo0920/article/details/45243775 1. Spark中的基本概念 在Spark中,有下面的基本概念.Appli ...

  6. 【原创】大数据基础之Spark(4)RDD原理及代码解析

    一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...

  7. Spark源码系列:RDD repartition、coalesce 对比

    在上一篇文章中 Spark源码系列:DataFrame repartition.coalesce 对比 对DataFrame的repartition.coalesce进行了对比,在这篇文章中,将会对R ...

  8. 【Spark-core学习之二】 RDD和算子

    环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk1.8 scala-2.10.4(依赖jdk1.8) spark ...

  9. spark 算子之RDD

    map map(func) Return a new distributed dataset formed by passing each element of the source through ...

随机推荐

  1. 判断list数组里的json对象有无重复,有则去重留1个

    查找有无重复的 var personLength = [{ certType: '2015-10-12', certCode:'Apple'}, { certType: '2015-10-12', c ...

  2. GDI 直线和折线(6)

    设置开始点 MoveToEx 函数用于移动画笔到指定的位置: BOOL MoveToEx( HDC hdc, // 设备环境句柄 int X, // 要移动到的 x 坐标 int Y, // 要移动到 ...

  3. mysql 插入更新在一条sql ON DUPLICATE KEY UPDATE

    有时候需要进行数据操作的,如果有数据则更新数据, 没有数据则插入. 以往的做法是先查询,再根据查询结果进行判断,执行插入或更新操作 其实 有一种 ON DUPLICATE KEY UPDATE 语法, ...

  4. 00065字符串缓冲区_StringBuilder类

    1.StringBuilder类,它也是字符串缓冲区,StringBuilder与它和StringBuffer的有什么不同呢? 它一个可变的字符序列.此类提供一个与 StringBuffer 兼容的 ...

  5. Ajax发送GET和POST请求案例

    使用ajax实现菜单联动 通常情况下,GET请求用于从服务器上获取数据,POST请求用于向服务器发送数据. 需求:选择第一个下拉框的值,根据第一个下拉框的值显示第二个下拉框的值 首先使用GET方式. ...

  6. POJ 1948

    这道题我记得是携程比赛上的一道. 开始时想直接设面积,但发现不可以,改设能否构成三角形.设dp[i][j][k]为前i根木棍构成边长为j和k的三角形,那么转移可以为dp[i][j][k]=dp[i-1 ...

  7. NAT&amp;Port Forwarding&amp;Port Triggering

    NAT     Nat,网络地址转换协议.主要功能是实现局域网内的本地主机与外网通信.     在连接外网时,内部Ip地址须要转换为网关(一般为路由器Ip地址)(port号也须要对应的转换)     ...

  8. Android应用之——自己定义控件ToggleButton

    我们经常会看到非常多优秀的app上面都有一些非常美丽的控件,用户体验非常好.比方togglebutton就是一个非常好的样例,IOS系统以下那个精致的togglebutton现在在android以下也 ...

  9. iis browse的时候,直接通过本地的局域网ip打开页面

    http://www.codepal.co.uk/show/make_IIS_work_with_local_IP_addresses_instead_of_localhost 只需要设置一下webs ...

  10. 使用Chrome插件Postman进行简单的Get/Post测试

    转自:https://blog.csdn.net/dearmorning/article/details/56854236 Postman插件: 一种网页调试与发送网页http请求的chrome插件, ...