How Many Partitions Does An RDD Have
From https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
For tuning and troubleshooting, it's often necessary to know how many paritions an RDD represents. There are a few ways to find this information:
View Task Execution Against Partitions Using the UI
When a stage executes, you can see the number of partitions for a given stage in the Spark UI. For example, the following simple job creates an RDD of 100 elements across 4 partitions, then distributes a dummy map task before collecting the elements back to the driver program:
scala> val someRDD = sc.parallelize(1 to 100, 4)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res1: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
In Spark's application UI, you can see from the following screenshot that the "Total Tasks" represents the number of partitions:

View Partition Caching Using the UI
When persisting (a.k.a. caching) RDDs, it's useful to understand how many partitions have been stored. The example below is identical to the one prior, except that we'll now cache the RDD prior to processing it. After this completes, we can use the UI to understand what has been stored from this operation.
scala> someRDD.setName("toy").cache
res2: someRDD.type = toy ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.map(x => x).collect
res3: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
Note from the screenshot that there are four partitions cached.

Inspect RDD Partitions Programatically
In the Scala API, an RDD holds a reference to it's Array of partitions, which you can use to find out how many partitions there are:
scala> val someRDD = sc.parallelize(1 to 100, 30)
someRDD: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12
scala> someRDD.partitions.size
res0: Int = 30
In the python API, there is a method for explicitly listing the number of partitions:
In [1]: someRDD = sc.parallelize(range(101),30)
In [2]: someRDD.getNumPartitions()
Out[2]: 30
Note in the examples above, the number of partitions was intentionally set to 30 upon initialization.
How Many Partitions Does An RDD Have的更多相关文章
- Spark核心概念之RDD
RDD: Resilient Distributed Dataset RDD的特点: 1.A list of partitions 一系列的分片:比如说64M一片:类似于Hadoop中的s ...
- RDD的依赖关系
RDD的依赖关系 Rdd之间的依赖关系通过rdd中的getDependencies来进行表示, 在提交job后,会通过在DAGShuduler.submitStage-->getMissingP ...
- RDD.scala(源码)
---- map. --- flatMap.fliter.distinct.repartition.coalesce.sample.randomSplit.randomSampleWithRange. ...
- Spark函数详解系列之RDD基本转换
摘要: RDD:弹性分布式数据集,是一种特殊集合 ‚ 支持多种来源 ‚ 有容错机制 ‚ 可以被缓存 ‚ 支持并行操作,一个RDD代表一个分区里的数据集 RDD有两种操作算子: ...
- Spark编程模型及RDD操作
转载自:http://blog.csdn.net/liuwenbo0920/article/details/45243775 1. Spark中的基本概念 在Spark中,有下面的基本概念.Appli ...
- 【原创】大数据基础之Spark(4)RDD原理及代码解析
一 简介 spark核心是RDD,官方文档地址:https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-di ...
- Spark源码系列:RDD repartition、coalesce 对比
在上一篇文章中 Spark源码系列:DataFrame repartition.coalesce 对比 对DataFrame的repartition.coalesce进行了对比,在这篇文章中,将会对R ...
- 【Spark-core学习之二】 RDD和算子
环境 虚拟机:VMware 10 Linux版本:CentOS-6.5-x86_64 客户端:Xshell4 FTP:Xftp4 jdk1.8 scala-2.10.4(依赖jdk1.8) spark ...
- spark 算子之RDD
map map(func) Return a new distributed dataset formed by passing each element of the source through ...
随机推荐
- Linux Shell脚本编程-基础1
概述: shell脚本在Linux系统管理员的运维工作中非常重要.shell脚本能够帮助我们很方便的管理服务器,因为我们可以指定一个任务计划,定时的去执行某一个脚本以满足我们的需求.本篇将从编程基础 ...
- Mysql错误:#1054 - Unknown column '字段名' in 'field list'
# 1054 - Unknown column '字段名' in 'field list' 第一个就是你的表中没有这个字段 另一个就是你的这个字段前后可能有空格!!!,去掉空格即可!
- ActiveMQ 整合 spring
一.添加 jar 包 <dependency> <groupId>org.apache.activemq</groupId> <artifactId>a ...
- POJ 2470 Ambiguous permutations(简单题 理解题意)
[题目简述]:事实上就是依据题目描写叙述:A permutation of the integers 1 to n is an ordering of these integers. So the n ...
- 移动端js手指滑动事件初体验
今天在公司遇到做一个移动端手指滑动的效果,刚開始用了swiper.js插件发现效果不好(文字存在模糊效果).后来查了一些资料,自己手写了一个使用原生js写的滑动效果. 以下直接上代码: <!do ...
- HDU 2732 Leapin' Lizards(拆点+最大流)
HDU 2732 Leapin' Lizards 题目链接 题意:有一些蜥蜴在一个迷宫里面,有一个跳跃力表示能跳到多远的柱子,然后每根柱子最多被跳一定次数,求这些蜥蜴还有多少是不管怎样都逃不出来的. ...
- 【数字图像处理】六.MFC空间几何变换之图像平移、镜像、旋转、缩放具体解释
本文主要讲述基于VC++6.0 MFC图像处理的应用知识,主要结合自己大三所学课程<数字图像处理>及课件进行解说,主要通过MFC单文档视图实现显示BMP图片空间几何变换.包含图像平移.图形 ...
- TeamTalk Android代码分析(业务流程篇)---消息发送和接收的整体逻辑说明
第一次纪录东西,也没有特别的顺序,想到哪里就随手画了一下,后续会继续整理- 6.2消息页面动作流程 6.2.1 消息页面初始化的总体思路 1.页面数据的填充更新直接由页面主线程从本地数据库请求 2.数 ...
- Oozie4.2.0配置安装实战
软件版本号: Oozie4.2.0.Hadoop2.6.0,Spark1.4.1.Hive0.14.Pig0.15.0.Maven3.2.JDK1.7,zookeeper3.4.6.HBase1.1. ...
- jenkins下载
前置条件:需要有java环境的jdk 一.安装使用 下载地址:https://jenkins-ci.org/content/thank-you-downloading-windows-installe ...