简介：典型的Spark作业读取位于OSS的Parquet外表时，源端的并发度（task/partition）如何确定？特别是在做TPCH测试时有一些疑问，如源端扫描文件的并发度是如何确定的？是否一个parquet文件对应一个partition？多个parquet文件对应一个partition？还是一个parquet文件对应多个partition？本文将从源码角度进行分析进而解答这些疑问。

引言

典型的Spark作业读取位于OSS的Parquet外表时，源端的并发度（task/partition）如何确定？特别是在做TPCH测试时有一些疑问，如源端扫描文件的并发度是如何确定的？是否一个parquet文件对应一个partition？多个parquet文件对应一个partition？还是一个parquet文件对应多个partition？本文将从源码角度进行分析进而解答这些疑问。

分析

数据源读取对应的物理执行节点为FileSourceScanExec，读取数据代码块如下

lazy val inputRDD: RDD[InternalRow] = {

    val readFile: (PartitionedFile) => Iterator[InternalRow] =

      relation.fileFormat.buildReaderWithPartitionValues(

        sparkSession = relation.sparkSession,

        dataSchema = relation.dataSchema,

        partitionSchema = relation.partitionSchema,

        requiredSchema = requiredSchema,

        filters = pushedDownFilters,

        options = relation.options,

        hadoopConf = relation.sparkSession.sessionState.newHadoopConfWithOptions(relation.options))

    val readRDD = if (bucketedScan) {

      createBucketedReadRDD(relation.bucketSpec.get, readFile, dynamicallySelectedPartitions,

        relation)

    } else {

      createReadRDD(readFile, dynamicallySelectedPartitions, relation)

    }

    sendDriverMetrics()

    readRDD

  }

主要关注非bucket的处理，对于非bucket的扫描调用createReadRDD方法定义如下

/**

   * Create an RDD for non-bucketed reads.

   * The bucketed variant of this function is [[createBucketedReadRDD]].

   *

   * @param readFile a function to read each (part of a) file.

   * @param selectedPartitions Hive-style partition that are part of the read.

   * @param fsRelation [[HadoopFsRelation]] associated with the read.

   */

  private def createReadRDD(

      readFile: (PartitionedFile) => Iterator[InternalRow],

      selectedPartitions: Array[PartitionDirectory],

      fsRelation: HadoopFsRelation): RDD[InternalRow] = {

    // 文件打开开销，每次打开文件最少需要读取的字节

    val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes

    // 最大切分分片大小

    val maxSplitBytes =

      FilePartition.maxSplitBytes(fsRelation.sparkSession, selectedPartitions)

    logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +

      s"open cost is considered as scanning $openCostInBytes bytes.")

    // Filter files with bucket pruning if possible

    val bucketingEnabled = fsRelation.sparkSession.sessionState.conf.bucketingEnabled

    val shouldProcess: Path => Boolean = optionalBucketSet match {

      case Some(bucketSet) if bucketingEnabled =>

        // Do not prune the file if bucket file name is invalid

        filePath => BucketingUtils.getBucketId(filePath.getName).forall(bucketSet.get)

      case _ =>

        _ => true

    }

    // 对分区下文件进行切分并按照从大到小进行排序

    val splitFiles = selectedPartitions.flatMap { partition =>

      partition.files.flatMap { file =>

        // getPath() is very expensive so we only want to call it once in this block:

        val filePath = file.getPath

        if (shouldProcess(filePath)) {

          // 文件是否可split，parquet/orc/avro均可被split

          val isSplitable = relation.fileFormat.isSplitable(

            relation.sparkSession, relation.options, filePath)

          // 切分文件

          PartitionedFileUtil.splitFiles(

            sparkSession = relation.sparkSession,

            file = file,

            filePath = filePath,

            isSplitable = isSplitable,

            maxSplitBytes = maxSplitBytes,

            partitionValues = partition.values

          )

        } else {

          Seq.empty

        }

      }

    }.sortBy(_.length)(implicitly[Ordering[Long]].reverse)

    val partitions =

      FilePartition.getFilePartitions(relation.sparkSession, splitFiles, maxSplitBytes)

    new FileScanRDD(fsRelation.sparkSession, readFile, partitions)

  }

可以看到确定最大切分分片大小maxSplitBytes对于后续切分为多少个文件非常重要，其核心逻辑如下

def maxSplitBytes(

      sparkSession: SparkSession,

      selectedPartitions: Seq[PartitionDirectory]): Long = {

    // 读取文件时打包成最大的partition大小，默认为128MB，对应一个block大小

    val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes

    // 打开每个文件的开销，默认为4MB

    val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes

    // 建议的（不保证）最小分割文件分区数，默认未设置，从leafNodeDefaultParallelism获取

    // 代码逻辑调用链 SparkSession#leafNodeDefaultParallelism -> SparkContext#defaultParallelism

    // -> TaskSchedulerImpl#defaultParallelism -> CoarseGrainedSchedulerBackend#defaultParallelism

    // -> 总共多少核max(executor core总和, 2)，最少为2

    val minPartitionNum = sparkSession.sessionState.conf.filesMinPartitionNum

      .getOrElse(sparkSession.leafNodeDefaultParallelism)

    // 总共读取的大小

    val totalBytes = selectedPartitions.flatMap(_.files.map(_.getLen + openCostInBytes)).sum

    // 单core读取的大小

    val bytesPerCore = totalBytes / minPartitionNum

    // 计算大小，不会超过设置的128MB

    Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

  }

对于PartitionedFileUtil#splitFiles，其核心逻辑如下，较为简单，直接按照最大切分大小切分大文件来进行分片

def splitFiles(

      sparkSession: SparkSession,

      file: FileStatus,

      filePath: Path,

      isSplitable: Boolean,

      maxSplitBytes: Long,

      partitionValues: InternalRow): Seq[PartitionedFile] = {

    if (isSplitable) {

      // 切分为多个分片

      (0L until file.getLen by maxSplitBytes).map { offset =>

        val remaining = file.getLen - offset

        val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining

        val hosts = getBlockHosts(getBlockLocations(file), offset, size)

        PartitionedFile(partitionValues, filePath.toUri.toString, offset, size, hosts)

      }

    } else {

      Seq(getPartitionedFile(file, filePath, partitionValues))

    }

  }

在获取到Seq[PartitionedFile]列表后，还并没有完成对文件的切分，还需要调用FilePartition#getFilePartitions做最后的处理，方法核心逻辑如下

def getFilePartitions(

      sparkSession: SparkSession,

      partitionedFiles: Seq[PartitionedFile],

      maxSplitBytes: Long): Seq[FilePartition] = {

    val partitions = new ArrayBuffer[FilePartition]

    val currentFiles = new ArrayBuffer[PartitionedFile]

    var currentSize = 0L

    /** Close the current partition and move to the next. */

    def closePartition(): Unit = {

      if (currentFiles.nonEmpty) {

        // Copy to a new Array.

        // 重新生成一个新的PartitionFile

        val newPartition = FilePartition(partitions.size, currentFiles.toArray)

        partitions += newPartition

      }

      currentFiles.clear()

      currentSize = 0

    }

    // 打开文件开销，默认为4MB

    val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes

    // Assign files to partitions using "Next Fit Decreasing"

    partitionedFiles.foreach { file =>

      if (currentSize + file.length > maxSplitBytes) {

        // 如果累加的文件大小大于的最大切分大小，则关闭该分区，表示完成一个Task读取的数据切分

        closePartition()

      }

      // Add the given file to the current partition.

      currentSize += file.length + openCostInBytes

      currentFiles += file

    }

    // 最后关闭一次分区，文件可能较小

    closePartition()

    partitions.toSeq

  }

可以看到经过这一步后，会把一些小文件做合并，生成maxSplitBytes大小的PartitionFile，这样可以避免拉起太多task读取太多小的文件。

生成的FileScanRDD(new FileScanRDD(fsRelation.sparkSession, readFile, partitions))的并发度为partitions的长度，也即最后Spark生成的Task个数

override protected def getPartitions: Array[RDDPartition] = filePartitions.toArray

整体流程图如下图所示

拆分、合并过程如下图所示

实战

对于TPCH 10G生成的customer parquet表

https://oss.console.aliyun.com/bucket/oss-cn-hangzhou/fengzetest/object?path=rt_spark_test%2Fcustomer-parquet%2F

共8个Parquet文件，总文件大小为113.918MB

Spark作业配置如下，executor只有1core

conf spark.driver.resourceSpec=small;

conf spark.executor.instances=1;

conf spark.executor.resourceSpec=small;

conf spark.app.name=Spark SQL Test;

conf spark.adb.connectors=oss;

use tpcd;

select * from customer order by C_CUSTKEY desc limit 100;

根据前面的公式计算

defaultMaxSplitBytes = 128MB

openCostInBytes = 4MB

minPartitionNum = max(1, 2) = 2

totalBytes = 113.918 + 8 * 4MB = 145.918MB

bytesPerCore = 145.918MB / 2 = 72.959MB

maxSplitBytes = 72.959MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

得到maxSplitBytes为72.959MB，从日志中也可看到对应大小

经过排序后的文件顺序为(00000, 00001, 00002, 00003, 00004, 00006, 00005, 00007)，再次经过合并后得到3个FilePartitioned，分别对应

FilePartitioned 1: 00000, 00001, 00002
FilePartitioned 2: 00003, 00004, 00006
FilePartitioned 3: 00005, 00007

即总共会生成3个Task

从Spark UI查看确实生成3个Task

从日志查看也是生成3个Task

变更Spark作业配置，5个executor共10core

conf spark.driver.resourceSpec=small;

conf spark.executor.instances=5;

conf spark.executor.resourceSpec=medium;

conf spark.app.name=Spark SQL Test;

conf spark.adb.connectors=oss;

use tpcd;

select * from customer order by C_CUSTKEY desc limit 100;

根据前面的公式计算

defaultMaxSplitBytes = 128MB

openCostInBytes = 4MB

minPartitionNum = max(10, 2) = 10

totalBytes = 113.918 + 8 * 4MB = 145.918MB

bytesPerCore = 145.918MB / 10 = 14.5918MB

maxSplitBytes = 14.5918MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

查看日志

此时可以看到14.5918MB会对源文件进行切分，会对00001, 00002,00003,00004,00005,00006进行切分，切分成两份，00007由于小于14.5918MB，因此不会进行切分，经过PartitionedFileUtil#splitFiles后，总共存在7 * 2 + 1 = 15个PartitionedFile

00000(0 -> 14.5918MB), 00000(14.5918MB -> 15.698MB)
00001(0 -> 14.5918MB), 00001(14.5918MB -> 15.632MB)
00002(0 -> 14.5918MB), 00002(14.5918MB -> 15.629MB)
00003(0 -> 14.5918MB), 00003(14.5918MB -> 15.624MB)
00004(0 -> 14.5918MB), 00004(14.5918MB -> 15.617MB)
00005(0 -> 14.5918MB), 00005(14.5918MB -> 15.536MB)
00006(0 -> 14.5918MB), 00006(14.5918MB -> 15.539MB)
00007(0 -> 4.634MB)

经过排序后得到如下以及合并后得到10个FilePartitioned，分别对应

FilePartitioned 1: 00000(0 -> 14.5918MB)
FilePartitioned 2: 00001(0 -> 14.5918MB)
FilePartitioned 3: 00002(0 -> 14.5918MB)
FilePartitioned 4: 00003(0 -> 14.5918MB)
FilePartitioned 5: 00004(0 -> 14.5918MB)
FilePartitioned 6: 00005(0 -> 14.5918MB)
FilePartitioned 7: 00006(0 -> 14.5918MB)
FilePartitioned 8: 00007(0 -> 4.634MB),00000(14.5918MB -> 15.698MB)
FilePartitioned 9: 00001(14.5918MB -> 15.632MB),00002(14.5918MB -> 15.629MB),00003(14.5918MB -> 15.624MB)
FilePartitioned 10: 00004(14.5918MB -> 15.617MB),00005(14.5918MB -> 15.536MB),00006(14.5918MB -> 15.539MB)

即总共会生成10个Task

通过Spark UI也可查看到生成了10个Task

查看日志，000004(14.5918MB -> 15.617MB),00005(14.5918MB -> 15.536MB),00006(14.5918MB -> 15.539MB)在同一个Task中

00007(0 -> 4.634MB),00000(14.5918MB -> 15.698MB)

00001(14.5918MB -> 15.632MB),00002(14.5918MB -> 15.629MB),00003(14.5918MB -> 15.624MB)在同一个Task中

总结

通过源码可知Spark对于源端Partition切分，会考虑到分区下所有文件大小以及打开每个文件的开销，同时会涉及对大文件的切分以及小文件的合并，最后得到一个相对合理的Partition。

原文链接：http://click.aliyun.com/m/1000349867/

本文为阿里云原创内容，未经允许不得转载。

Spark如何对源端数据做切分？的更多相关文章

源端控制的OpenFlow数据面
OpenFlow 交换机一般采用 TCAM 存储和查找流表,从而带来了扩展性.成本和能耗的问题.TCAM 成本和能耗过高,存储容量有限,一般交换机中的 TCAM 仅能存储几千条流表项,对 OpenFl ...
Spark RPC框架源码分析（一）简述
Spark RPC系列: Spark RPC框架源码分析(一)运行时序 Spark RPC框架源码分析(二)运行时序 Spark RPC框架源码分析(三)运行时序一. Spark rpc框架概述 S ...
Spark RPC框架源码分析（二）RPC运行时序
前情提要: Spark RPC框架源码分析(一)简述一. Spark RPC概述上一篇我们已经说明了Spark RPC框架的一个简单例子,Spark RPC相关的两个编程模型,Actor模型和Re ...
背水一战 Windows 10 (20) - 绑定: DataContextChanged, UpdateSourceTrigger, 对绑定的数据做自定义转换
[源码下载] 背水一战 Windows 10 (20) - 绑定: DataContextChanged, UpdateSourceTrigger, 对绑定的数据做自定义转换作者:webabcd 介 ...
Spark Scheduler模块源码分析之TaskScheduler和SchedulerBackend
本文是Scheduler模块源码分析的第二篇,第一篇Spark Scheduler模块源码分析之DAGScheduler主要分析了DAGScheduler.本文接下来结合Spark-1.6.0的源码继 ...
Spark RPC框架源码分析（三）Spark心跳机制分析
一.Spark心跳概述前面两节中介绍了Spark RPC的基本知识,以及深入剖析了Spark RPC中一些源码的实现流程. 具体可以看这里: Spark RPC框架源码分析(二)运行时序 Spark ...
量化派基于Hadoop、Spark、Storm的大数据风控架构--转
原文地址:http://www.csdn.net/article/2015-10-06/2825849 量化派是一家金融大数据公司,为金融机构提供数据服务和技术支持,也通过旗下产品“信用钱包”帮助个人 ...
绑定: DataContextChanged, UpdateSourceTrigger, 对绑定的数据做自定义转换
介绍背水一战 Windows 10 之绑定 DataContextChanged - FrameworkElement 的 DataContext 发生变化时触发的事件 UpdateSourceTr ...
配置ogg异构oracle-mysql（2）源端配置
源端配置大致分为如下三个步骤:配置mgr,配置抽取进程,配置投递进程在源端先创建一张表,记得带主键: SQL> create table ah4(id int ,name varchar(10 ...
LoRaWAN协议(三)--Server端数据协议
LoRaWAN Server 端架构 LoRaWAN 的server包括 NS(Network server).AS(application server).CS(Custom server).... ...

随机推荐

掌握python的dataclass，让你的代码更简洁优雅
dataclass是从Python3.7版本开始,作为标准库中的模块被引入.随着Python版本的不断更新,dataclass也逐步发展和完善,为Python开发者提供了更加便捷的数据类创建和管理方式 ...
3DCAT投屏功能升级，助力企业营销与培训
3DCAT实时渲染云推出以来,深受广大客户的喜爱,3DCAT也一直根据客户的反馈优化我们的产品. 但是这段时间来,不同行业的客户都反馈着同一个问题. 汽车销售顾问:"什么时候支持投屏功能呢, ...
首届实时渲染3D动画创作大赛最佳人气奖？你说了算！
根据评选标准,经过评委组层层选拔,首届实时渲染3D动画创作大赛「最佳人气奖」投票开始啦!!! 本次赛事报名人数达212人,入围作品共40份,其中Omniverse组11份,专业组15份,学生组14份. ...
Error: Command failed: C:\windows\system32\cmd.exe /s /c "./configure --disable-shared
错误记录之: Error: Command failed: C:\windows\system32\cmd.exe /s /c "./configure --disable-shared 错 ...
初探修模的三维模型OBJ格式轻量化压缩的遇到常见问题与处理方法
初探修模的三维模型OBJ格式轻量化压缩的遇到常见问题与处理方法在对经过修模的三维模型进行OBJ格式轻量化压缩处理的过程中,可能会遇到一些常见问题.以下是一些常见问题以及相应的处理方法: 1.顶点丢失 ...
记录--Object.assign 这算是深拷贝吗
这里给大家分享我在网上总结出来的一些知识,希望对大家有所帮助在JavaScript中,Object.assign() 是一个用于合并对象属性的常见方法.然而,对于许多开发者来说,关于它是否执行深拷贝 ...
module 'numpy' has no attribute 'bool'
module 'numpy' has no attribute 'bool' 问题: Traceback (most recent call last): File "/home/test. ...
可变形卷积系列(一) 打破常规，MSRA提出DCNv1 | ICCV 2017 Oral
论文提出可变形卷积帮助模型高效地学习几何变换能力,能够简单地应用到分类模型和检测模型中,思想新颖,效果显著,十分值得学习来源:晓飞的算法工程笔记公众号论文: Deformable Convo ...
k8s CustomResourceDefinition invalid 错误
安装 CRD 出现这个错误,多数是版本问题,缺少openAPIV3Schema段定义. The CustomResourceDefinition "crontabs.stable.examp ...
JAVA下载文件防重复点击,防止多次下载请求，Cookie方式快速简单集成教程
JAVA下载文件防重复点击,防止多次下载请求,Cookie方式快速简单集成教程 JS文件在最下面: 引入 <script src="${path}/js/jquery-2.0.3.mi ...

Spark如何对源端数据做切分？

引言

分析

实战

总结

Spark如何对源端数据做切分？的更多相关文章

随机推荐

热门专题