【原创】大数据基础之SPARK（9）SPARK中COLLECT和TAKE实现原理

spark中要将计算结果取回driver，有两种方式：collect和take，这两种方式有什么差别？来看代码：

org.apache.spark.rdd.RDD

  /**

   * Return an array that contains all of the elements in this RDD.

   *

   * @note This method should only be used if the resulting array is expected to be small, as

   * all the data is loaded into the driver's memory.

   */

  def collect(): Array[T] = withScope {

    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

    Array.concat(results: _*)

  }

  /**

   * Take the first num elements of the RDD. It works by first scanning one partition, and use the

   * results from that partition to estimate the number of additional partitions needed to satisfy

   * the limit.

   *

   * @note This method should only be used if the resulting array is expected to be small, as

   * all the data is loaded into the driver's memory.

   *

   * @note Due to complications in the internal implementation, this method will raise

   * an exception if called on an RDD of `Nothing` or `Null`.

   */

  def take(num: Int): Array[T] = withScope {

    val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)

    if (num == 0) {

      new Array[T](0)

    } else {

      val buf = new ArrayBuffer[T]

      val totalParts = this.partitions.length

      var partsScanned = 0

      while (buf.size < num && partsScanned < totalParts) {

        // The number of partitions to try in this iteration. It is ok for this number to be

        // greater than totalParts because we actually cap it at totalParts in runJob.

        var numPartsToTry = 1L

        if (partsScanned > 0) {

          // If we didn't find any rows after the previous iteration, quadruple and retry.

          // Otherwise, interpolate the number of partitions we need to try, but overestimate

          // it by 50%. We also cap the estimation in the end.

          if (buf.isEmpty) {

            numPartsToTry = partsScanned * scaleUpFactor

          } else {

            // the left side of max is >=1 whenever partsScanned >= 2

            numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)

            numPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)

          }

        }

        val left = num - buf.size

        val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)

        val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)

        res.foreach(buf ++= _.take(num - buf.size))

        partsScanned += p.size

      }

      buf.toArray

    }

  }

可见collect是直接计算所有结果，然后将每个partition的结果变成array，然后再合并成一个array；

而take的实现就要复杂一些，它会首先计算1个partition，然后根据结果的数量推断出还需要计算几个分区，然后再计算这几个分区，然后再看结果够不够，这是一个迭代的过程，计算越简单或者take数量越少，越有可能在前边的迭代中满足条件返回；

【原创】大数据基础之SPARK（9）SPARK中COLLECT和TAKE实现原理的更多相关文章

大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...
大数据基础知识问答----spark篇，大数据生态圈
Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...
大数据学习系列之六 ----- Hadoop+Spark环境搭建
引言在上一篇中大数据学习系列之五 ----- Hive整合HBase图文详解 : http://www.panchengming.com/2017/12/18/pancm62/ 中使用Hive整合 ...

随机推荐

基于 HTML5 的 WebGL 自定义 3D 摄像头监控模型
前言随着视频监控联网系统的不断普及和发展, 网络摄像机更多的应用于监控系统中,尤其是高清时代的来临,更加快了网络摄像机的发展和应用. 在监控摄像机数量的不断庞大的同时,在监控系统中面临着严峻的现状问 ...
【Swift 3.0】iOS 国际化切换语言
有的 App 可能有切换语言的选项,结合系统自动切换最简单的办法: fileprivate var localizedBundle: Bundle = { return Bundle(path: Bu ...
分享：大型Web网站架构演变之9大阶段
前言我们以Java Web为例,来搭建一个简单的电商系统,看看这个系统可以如何一步步演变. 该系统具备的功能: 用户模块:用户注册和管理商品模块:商品展示和管理交易模块:创建交易和管理正文阶 ...
MySQL索引原理及慢查询优化（转自：美团tech）
背景 MySQL凭借着出色的性能.低廉的成本.丰富的资源,已经成为绝大多数互联网公司的首选关系型数据库.虽然性能出色,但所谓“好马配好鞍”,如何能够更好的使用它,已经成为开发工程师的必修课,我们经常会 ...
MySQL8.0-NoSQL和SQL的对比及MySQL的优势
一.SQL VS NoSQL SQL:关系型数据库,用SQL语句来操作数据 NOSQL:非关系型数据库,NoSQL的含义是不仅仅有SQL,而实际上大多数NoSQL不用SQL来操作数据常见的关系型数据 ...
Shell命令-文件及内容处理之sort、uniq
文件及内容处理 - sort.unip 1. sort:对文件的文本内容排序 sort命令的功能说明 sort 命令用于将文本文件内容加以排序.sort 可针对文本文件的内容,以行为单位来排序. so ...
spoj 839-Optimal Marks
Description SPOJ.com - Problem OPTM Solution 容易发现各个位之间互不影响, 因此分开考虑每一位. 考虑题中是怎样的一个限制: 对每个点确定一个0/1的权值; ...
Hack The Box 获取邀请码
TL DR; 使用curl请求下面的地址 curl -X POST https://www.hackthebox.eu/api/invite/generate {"success" ...
go实现json数组嵌套
go实现json数组嵌套引用包 "encoding/json" 定义以下结构体 type person struct { Name string `json:"name ...
Django--ORM相关操作
必知必会13条 <1> all(): 查询所有结果 <2> filter(**kwargs): 它包含了与所给筛选条件相匹配的对象 <3> get(**kwargs ...

【原创】大数据基础之SPARK（9）SPARK中COLLECT和TAKE实现原理

【原创】大数据基础之SPARK（9）SPARK中COLLECT和TAKE实现原理的更多相关文章

随机推荐

热门专题