From https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html

For tuning and troubleshooting, it's often necessary to know how many paritions an RDD represents. There are a few ways to find this information:

View Task Execution Against Partitions Using the UI

When a stage executes, you can see the number of partitions for a given stage in the Spark UI. For example, the following simple job creates an RDD of 100 elements across 4 partitions, then distributes a dummy map task before collecting the elements back to the driver program:

scala> val someRDD = sc.parallelize(1 to 100, 4)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.map(x => x).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

In Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions:

View Partition Caching Using the UI

When persisting (a.k.a. caching) RDDs, it's useful to understand how many partitions have been stored. The example below is identical to the one prior, except that we'll now cache the RDD prior to processing it. After this completes, we can use the UI to understand what has been stored from this operation.

scala> someRDD.setName("toy").cache
res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.map(x => x).collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

Note from the screenshot that there are four partitions cached.

Inspect RDD Partitions Programatically

In the Scala API, an RDD holds a reference to it's Array of partitions, which you can use to find out how many partitions there are:

scala> val someRDD = sc.parallelize(1 to 100, 30)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12 scala> someRDD.partitions.size
res0: Int = 30

In the python API, there is a method for explicitly listing the number of partitions:

In [1]: someRDD = sc.parallelize(range(101),30)

In [2]: someRDD.getNumPartitions()
Out[2]: 30

Note in the examples above, the number of partitions was intentionally set to 30 upon initialization.

How Many Partitions Does An RDD Have的更多相关文章

  1. Spark核心概念之RDD

    RDD: Resilient Distributed Dataset RDD的特点: 1.A list of partitions       一系列的分片:比如说64M一片:类似于Hadoop中的s ...

  2. RDD的依赖关系

    RDD的依赖关系 Rdd之间的依赖关系通过rdd中的getDependencies来进行表示, 在提交job后,会通过在DAGShuduler.submitStage-->getMissingP ...

  3. RDD.scala(源码)

    ---- map. --- flatMap.fliter.distinct.repartition.coalesce.sample.randomSplit.randomSampleWithRange. ...

  4. Spark函数详解系列之RDD基本转换

    摘要:   RDD:弹性分布式数据集,是一种特殊集合 ‚ 支持多种来源 ‚ 有容错机制 ‚ 可以被缓存 ‚ 支持并行操作,一个RDD代表一个分区里的数据集   RDD有两种操作算子:         ...

  5. Spark编程模型及RDD操作

    转载自:http://blog.csdn.net/liuwenbo0920/article/details/45243775 1. Spark中的基本概念 在Spark中,有下面的基本概念.Appli ...

  6. 【原创】大数据基础之Spark(4)RDD原理及代码解析

    一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...

  7. Spark源码系列:RDD repartition、coalesce 对比

    在上一篇文章中 Spark源码系列:DataFrame repartition.coalesce 对比 对DataFrame的repartition.coalesce进行了对比,在这篇文章中,将会对R ...

  8. 【Spark-core学习之二】 RDD和算子

    环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk1.8 scala-2.10.4(依赖jdk1.8) spark ...

  9. spark 算子之RDD

    map map(func) Return a new distributed dataset formed by passing each element of the source through ...

随机推荐

  1. NGUI学习随笔

    一.NGUI的直接用法 1.      Attach a Collider:表示为NGUI的某些物体添加碰撞器,如果界面是用NGUI做的,只能这样添加.(注:用Component添加无效). 2.  ...

  2. Nginx 配置 Gzip 压缩

    打开配置文件 /etc/nginx/nginx.conf,取消掉以下的注释项: #gzip on; 取消后: gzip on; 在此配置后加上以下内容: gzip on; gzip_vary on; ...

  3. Hadoop Word Count程序

    Hadoop Word Count程序 pom.xml文件: <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns ...

  4. Nginx 做系统的前端反向proxy

    Nginx是一款很优秀的基于event的webserver.吞吐量大.占用资源少,只是文档就很让人郁闷了,免费的Nginx和收费的Nginx+的文档共用一份,配置完之后才发现免费的Nginx启动某些命 ...

  5. Access restriction: The method createJPEGEncoder(OutputStream) from the type JPEGCodec is not access

    准备使用Java进行图片压缩的时候,使用 import com.sun.image.codec.jpeg.*; 结果出现错误: Access restriction: The method creat ...

  6. 2015.03.12,外语,读书笔记-《Word Power Made Easy》 10 “如何讨论交谈习惯”学习笔记 SESSION 25

    1.about keeping one's mouth shut taciturn,名词形式taciturnity,沉默寡言. 美国第30任总统库里奇,以沉默寡言著称.他来自新英格兰,那里视tacit ...

  7. Android中添加自己的模块 【转】

    本文转载自:http://wallage.blog.163.com/blog/static/17389624201021791333695/ 转:http://blog.csdn.net/yili_x ...

  8. 0x15 KMP

    这个算法本身就不难. poj1961 #include<cstdio> #include<iostream> #include<cstring> #include& ...

  9. gridview in webform

    How to: Enable Default Paging in the GridView Web Server Control https://msdn.microsoft.com/en-us/li ...

  10. 2017-3-6 leetcode 118 169 189

    今天什么都没发生 ================================================= leetcode118 https://leetcode.com/problems ...