spark中join有两种，一种是RDD的join，一种是sql中的join，分别来看：

1 RDD join

org.apache.spark.rdd.PairRDDFunctions

  /**

   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each

   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and

   * (k, v2) is in `other`. Performs a hash join across the cluster.

   */

  def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope {

    join(other, defaultPartitioner(self, other))

  }

  /**

   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each

   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and

   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.

   */

  def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {

    this.cogroup(other, partitioner).flatMapValues( pair =>

      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)

    )

  }

  /**

   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the

   * list of values for that key in `this` as well as `other`.

   */

  def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)

      : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {

    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {

      throw new SparkException("HashPartitioner cannot partition array keys.")

    }

    val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)

    cg.mapValues { case Array(vs, w1s) =>

      (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])

    }

  }

join操作会返回CoGroupedRDD，CoGroupedRDD构造参数为rdd数组，即多个需要join的rdd，下面看CoGroupedRDD：

org.apache.spark.rdd.CoGroupedRDD

class CoGroupedRDD[K: ClassTag](

    @transient var rdds: Seq[RDD[_ <: Product2[K, _]]],

    part: Partitioner)

  extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) {

  override def getDependencies: Seq[Dependency[_]] = {

    rdds.map { rdd: RDD[_] =>

      if (rdd.partitioner == Some(part)) {

        logDebug("Adding one-to-one dependency with " + rdd)

        new OneToOneDependency(rdd)

      } else {

        logDebug("Adding shuffle dependency with " + rdd)

        new ShuffleDependency[K, Any, CoGroupCombiner](

          rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)

      }

    }

  }

  override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {

    val split = s.asInstanceOf[CoGroupPartition]

    val numRdds = dependencies.length

    // A list of (rdd iterator, dependency number) pairs

    val rddIterators = new ArrayBuffer[(Iterator[Product2[K, Any]], Int)]

    for ((dep, depNum) <- dependencies.zipWithIndex) dep match {

      case oneToOneDependency: OneToOneDependency[Product2[K, Any]] @unchecked =>

        val dependencyPartition = split.narrowDeps(depNum).get.split

        // Read them from the parent

        val it = oneToOneDependency.rdd.iterator(dependencyPartition, context)

        rddIterators += ((it, depNum))

      case shuffleDependency: ShuffleDependency[_, _, _] =>

        // Read map outputs of shuffle

        val it = SparkEnv.get.shuffleManager

          .getReader(shuffleDependency.shuffleHandle, split.index, split.index + 1, context)

          .read()

        rddIterators += ((it, depNum))

    }

    val map = createExternalMap(numRdds)

    for ((it, depNum) <- rddIterators) {

      map.insertAll(it.map(pair => (pair._1, new CoGroupValue(pair._2, depNum))))

    }

    context.taskMetrics().incMemoryBytesSpilled(map.memoryBytesSpilled)

    context.taskMetrics().incDiskBytesSpilled(map.diskBytesSpilled)

    context.taskMetrics().incPeakExecutionMemory(map.peakMemoryUsedBytes)

    new InterruptibleIterator(context,

      map.iterator.asInstanceOf[Iterator[(K, Array[Iterable[_]])]])

  }

  private def createExternalMap(numRdds: Int)

    : ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner] = {

    val createCombiner: (CoGroupValue => CoGroupCombiner) = value => {

      val newCombiner = Array.fill(numRdds)(new CoGroup)

      newCombiner(value._2) += value._1

      newCombiner

    }

    val mergeValue: (CoGroupCombiner, CoGroupValue) => CoGroupCombiner =

      (combiner, value) => {

      combiner(value._2) += value._1

      combiner

    }

    val mergeCombiners: (CoGroupCombiner, CoGroupCombiner) => CoGroupCombiner =

      (combiner1, combiner2) => {

        var depNum = 0

        while (depNum < numRdds) {

          combiner1(depNum) ++= combiner2(depNum)

          depNum += 1

        }

        combiner1

      }

    new ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner](

      createCombiner, mergeValue, mergeCombiners)

  }

CoGroupedRDD首先将rdds逐个转化为dependency，然后将所有的dependency转化为rddIterators，最后通过ExternalAppendOnlyMap来实现合并；

如果rdd需要shuffle，是通过ShuffleManager实现，ShuffleManager实现类为SortShuffleManager，shuffle过程详见：https://www.cnblogs.com/barneywill/p/10158457.html

附：spark中dependency结构，即常说的宽依赖、窄依赖：

org.apache.spark.Dependency

Dependency

NarrowDependency

OneToOneDependency

RangeDependency

ShuffleDependency

区别就是shuffle，不需要shuffle就是NarrowDependency，需要就是ShuffleDependency；

2 sql join

sql中的join有一个选择策略：

org.apache.spark.sql.execution.SparkStrategies.JoinSelection

    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {

      // --- BroadcastHashJoin --------------------------------------------------------------------

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)

        if canBuildRight(joinType) && canBroadcast(right) =>

        Seq(joins.BroadcastHashJoinExec(

          leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right)))

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)

        if canBuildLeft(joinType) && canBroadcast(left) =>

        Seq(joins.BroadcastHashJoinExec(

          leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right)))

      // --- ShuffledHashJoin ---------------------------------------------------------------------

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)

         if !conf.preferSortMergeJoin && canBuildRight(joinType) && canBuildLocalHashMap(right)

           && muchSmaller(right, left) ||

           !RowOrdering.isOrderable(leftKeys) =>

        Seq(joins.ShuffledHashJoinExec(

          leftKeys, rightKeys, joinType, BuildRight, condition, planLater(left), planLater(right)))

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)

         if !conf.preferSortMergeJoin && canBuildLeft(joinType) && canBuildLocalHashMap(left)

           && muchSmaller(left, right) ||

           !RowOrdering.isOrderable(leftKeys) =>

        Seq(joins.ShuffledHashJoinExec(

          leftKeys, rightKeys, joinType, BuildLeft, condition, planLater(left), planLater(right)))

      // --- SortMergeJoin ------------------------------------------------------------

      case ExtractEquiJoinKeys(joinType, leftKeys, rightKeys, condition, left, right)

        if RowOrdering.isOrderable(leftKeys) =>

        joins.SortMergeJoinExec(

          leftKeys, rightKeys, joinType, condition, planLater(left), planLater(right)) :: Nil

...

其中conf.preferSortMergeJoin

org.apache.spark.sql.internal.SQLConf

  val PREFER_SORTMERGEJOIN = SQLConfigBuilder("spark.sql.join.preferSortMergeJoin")

    .internal()

    .doc("When true, prefer sort merge join over shuffle hash join.")

    .booleanConf

    .createWithDefault(true)

配置spark.sql.join.preferSortMergeJoin，默认为true，即是否优先使用SortMergeJoin；

可以看到join实现主要有3种，即BroadcastHashJoinExec、ShuffledHashJoinExec和SortMergeJoinExec，优先级为

1 如果canBroadcast，则BroadcastHashJoinExec；
2 如果spark.sql.join.preferSortMergeJoin=false，则ShuffledHashJoinExec；
3 否则为SortMergeJoinExec；

其中BroadcastHashJoinExec和ShuffledHashJoinExec都会用到HashJoin，先看HashJoin：

2.1 HashJoin

org.apache.spark.sql.execution.joins.HashJoin

  protected def join(

      streamedIter: Iterator[InternalRow],

      hashed: HashedRelation,

      numOutputRows: SQLMetric): Iterator[InternalRow] = {

    val joinedIter = joinType match {

      case _: InnerLike =>

        innerJoin(streamedIter, hashed)

      case LeftOuter | RightOuter =>

        outerJoin(streamedIter, hashed)

      case LeftSemi =>

        semiJoin(streamedIter, hashed)

      case LeftAnti =>

        antiJoin(streamedIter, hashed)

      case j: ExistenceJoin =>

        existenceJoin(streamedIter, hashed)

      case x =>

        throw new IllegalArgumentException(

          s"BroadcastHashJoin should not take $x as the JoinType")

    }

    val resultProj = createResultProjection

    joinedIter.map { r =>

      numOutputRows += 1

      resultProj(r)

    }

  }

  private def innerJoin(

      streamIter: Iterator[InternalRow],

      hashedRelation: HashedRelation): Iterator[InternalRow] = {

    val joinRow = new JoinedRow

    val joinKeys = streamSideKeyGenerator()

    streamIter.flatMap { srow =>

      joinRow.withLeft(srow)

      val matches = hashedRelation.get(joinKeys(srow))

      if (matches != null) {

        matches.map(joinRow.withRight(_)).filter(boundCondition)

      } else {

        Seq.empty

      }

    }

  }

这里只贴出内关联，即innerJoin，代码比较简单，注意这里是内存操作，会在单个partition内部进行；

2.2 BroadcastHashJoinExec

org.apache.spark.sql.execution.joins.BroadcastHashJoinExec

  protected override def doExecute(): RDD[InternalRow] = {

    val numOutputRows = longMetric("numOutputRows")

    val broadcastRelation = buildPlan.executeBroadcast[HashedRelation]()

    streamedPlan.execute().mapPartitions { streamedIter =>

      val hashed = broadcastRelation.value.asReadOnlyCopy()

      TaskContext.get().taskMetrics().incPeakExecutionMemory(hashed.estimatedSize)

      join(streamedIter, hashed, numOutputRows)

    }

  }

这里会将buildPlan广播出去，然后在streamedPlan上通过mapPartitions在1个分区内部进行join，join方法见HashJoin；

2.3 ShuffledHashJoinExec

org.apache.spark.sql.execution.joins.ShuffledHashJoinExec

  protected override def doExecute(): RDD[InternalRow] = {

    val numOutputRows = longMetric("numOutputRows")

    streamedPlan.execute().zipPartitions(buildPlan.execute()) { (streamIter, buildIter) =>

      val hashed = buildHashedRelation(buildIter)

      join(streamIter, hashed, numOutputRows)

    }

  }

join过程为先将两个rdd（streamedPlan和buildPlan）进行zipPartitions，然后在1个partition内部join，join方法见HashJoin；

2.4 SortMergeJoinExec

org.apache.spark.sql.execution.joins.SortMergeJoinExec

  protected override def doExecute(): RDD[InternalRow] = {

    val numOutputRows = longMetric("numOutputRows")

    left.execute().zipPartitions(right.execute()) { (leftIter, rightIter) =>

      val boundCondition: (InternalRow) => Boolean = {

        condition.map { cond =>

          newPredicate(cond, left.output ++ right.output).eval _

        }.getOrElse {

          (r: InternalRow) => true

        }

      }

      // An ordering that can be used to compare keys from both sides.

      val keyOrdering = newNaturalAscendingOrdering(leftKeys.map(_.dataType))

      val resultProj: InternalRow => InternalRow = UnsafeProjection.create(output, output)

      joinType match {

        case _: InnerLike =>

          new RowIterator {

            private[this] var currentLeftRow: InternalRow = _

            private[this] var currentRightMatches: ArrayBuffer[InternalRow] = _

            private[this] var currentMatchIdx: Int = -1

            private[this] val smjScanner = new SortMergeJoinScanner(

              createLeftKeyGenerator(),

              createRightKeyGenerator(),

              keyOrdering,

              RowIterator.fromScala(leftIter),

              RowIterator.fromScala(rightIter)

            )

            private[this] val joinRow = new JoinedRow

            if (smjScanner.findNextInnerJoinRows()) {

              currentRightMatches = smjScanner.getBufferedMatches

              currentLeftRow = smjScanner.getStreamedRow

              currentMatchIdx = 0

            }

            override def advanceNext(): Boolean = {

              while (currentMatchIdx >= 0) {

                if (currentMatchIdx == currentRightMatches.length) {

                  if (smjScanner.findNextInnerJoinRows()) {

                    currentRightMatches = smjScanner.getBufferedMatches

                    currentLeftRow = smjScanner.getStreamedRow

                    currentMatchIdx = 0

                  } else {

                    currentRightMatches = null

                    currentLeftRow = null

                    currentMatchIdx = -1

                    return false

                  }

                }

                joinRow(currentLeftRow, currentRightMatches(currentMatchIdx))

                currentMatchIdx += 1

                if (boundCondition(joinRow)) {

                  numOutputRows += 1

                  return true

                }

              }

              false

            }

            override def getRow: InternalRow = resultProj(joinRow)

          }.toScala

...

和ShuffledHashJoinExec一样，同样先zipPartitions，然后在1个partition内部根据joinType返回不同的RowIterator实现类，上边代码包含内关联实现，大部分工作通过SortMergeJoinScanner实现

org.apache.spark.sql.execution.joins.SortMergeJoinScanner

  final def findNextInnerJoinRows(): Boolean = {

    while (advancedStreamed() && streamedRowKey.anyNull) {

      // Advance the streamed side of the join until we find the next row whose join key contains

      // no nulls or we hit the end of the streamed iterator.

    }

    if (streamedRow == null) {

      // We have consumed the entire streamed iterator, so there can be no more matches.

      matchJoinKey = null

      bufferedMatches.clear()

      false

    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {

      // The new streamed row has the same join key as the previous row, so return the same matches.

      true

    } else if (bufferedRow == null) {

      // The streamed row's join key does not match the current batch of buffered rows and there are

      // no more rows to read from the buffered iterator, so there can be no more matches.

      matchJoinKey = null

      bufferedMatches.clear()

      false

    } else {

      // Advance both the streamed and buffered iterators to find the next pair of matching rows.

      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)

      do {

        if (streamedRowKey.anyNull) {

          advancedStreamed()

        } else {

          assert(!bufferedRowKey.anyNull)

          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)

          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()

          else if (comp < 0) advancedStreamed()

        }

      } while (streamedRow != null && bufferedRow != null && comp != 0)

      if (streamedRow == null || bufferedRow == null) {

        // We have either hit the end of one of the iterators, so there can be no more matches.

        matchJoinKey = null

        bufferedMatches.clear()

        false

      } else {

        // The streamed row's join key matches the current buffered row's join, so walk through the

        // buffered iterator to buffer the rest of the matching rows.

        assert(comp == 0)

        bufferMatchingRows()

        true

      }

    }

  }

  /**

   * Advance the streamed iterator and compute the new row's join key.

   * @return true if the streamed iterator returned a row and false otherwise.

   */

  private def advancedStreamed(): Boolean = {

    if (streamedIter.advanceNext()) {

      streamedRow = streamedIter.getRow

      streamedRowKey = streamedKeyGenerator(streamedRow)

      true

    } else {

      streamedRow = null

      streamedRowKey = null

      false

    }

  }

  /**

   * Advance the buffered iterator until we find a row with join key that does not contain nulls.

   * @return true if the buffered iterator returned a row and false otherwise.

   */

  private def advancedBufferedToRowWithNullFreeJoinKey(): Boolean = {

    var foundRow: Boolean = false

    while (!foundRow && bufferedIter.advanceNext()) {

      bufferedRow = bufferedIter.getRow

      bufferedRowKey = bufferedKeyGenerator(bufferedRow)

      foundRow = !bufferedRowKey.anyNull

    }

    if (!foundRow) {

      bufferedRow = null

      bufferedRowKey = null

      false

    } else {

      true

    }

  }

  /**

   * Called when the streamed and buffered join keys match in order to buffer the matching rows.

   */

  private def bufferMatchingRows(): Unit = {

    assert(streamedRowKey != null)

    assert(!streamedRowKey.anyNull)

    assert(bufferedRowKey != null)

    assert(!bufferedRowKey.anyNull)

    assert(keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)

    // This join key may have been produced by a mutable projection, so we need to make a copy:

    matchJoinKey = streamedRowKey.copy()

    bufferedMatches.clear()

    do {

      bufferedMatches += bufferedRow.copy() // need to copy mutable rows before buffering them

      advancedBufferedToRowWithNullFreeJoinKey()

    } while (bufferedRow != null && keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)

  }

可以看到过程和二路归并排序Binary Merge Sort差不多；

附：RowIterator是一个抽象类，本质是一个接口，是一个常见的Iterator定义，如下：

org.apache.spark.sql.execution.RowIterator

abstract class RowIterator {

  /**

   * Advance this iterator by a single row. Returns `false` if this iterator has no more rows

   * and `true` otherwise. If this returns `true`, then the new row can be retrieved by calling

   * [[getRow]].

   */

  def advanceNext(): Boolean

  /**

   * Retrieve the row from this iterator. This method is idempotent. It is illegal to call this

   * method after [[advanceNext()]] has returned `false`.

   */

  def getRow: InternalRow

  /**

   * Convert this RowIterator into a [[scala.collection.Iterator]].

   */

  def toScala: Iterator[InternalRow] = new RowIteratorToScala(this)

}

【原创】大数据基础之Spark（8）Spark中Join实现原理的更多相关文章

【原创】大数据基础之Hadoop（1）HA实现原理
有些工作只能在一台server上进行,比如master,这时HA(High Availability)首先要求部署多个server,其次要求多个server自动选举出一个active状态server, ...
大数据学习系列之七 ----- Hadoop+Spark+Zookeeper+HBase+Hive集群搭建图文详解
引言在之前的大数据学习系列中,搭建了Hadoop+Spark+HBase+Hive 环境以及一些测试.其实要说的话,我开始学习大数据的时候,搭建的就是集群,并不是单机模式和伪分布式.至于为什么先写单 ...
CentOS6安装各种大数据软件第十章：Spark集群安装和部署
相关文章链接 CentOS6安装各种大数据软件第一章:各个软件版本介绍 CentOS6安装各种大数据软件第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件第三章:Linux基础 ...
大数据平台搭建（hadoop+spark）
大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息主机名 ip地址安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...
大数据系列之并行计算引擎Spark部署及应用
相关博文: 大数据系列之并行计算引擎Spark介绍之前介绍过关于Spark的程序运行模式有三种: 1.Local模式: 2.standalone(独立模式) 3.Yarn/mesos模式本文将介绍 ...
大数据系列之并行计算引擎Spark介绍
相关博文:大数据系列之并行计算引擎Spark部署及应用 Spark: Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎. Spark是UC Berkeley AMP lab ( ...
【原创】大数据基础之Zookeeper（2）源代码解析
核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...
【原创】大数据基础之Spark（1）Spark Submit即Spark任务提交过程
Spark2.1.1 一 Spark Submit本地解析 1.1 现象提交命令: spark-submit --master local[10] --driver-memory 30g --cla ...
【原创】大数据基础之Hive（5）hive on spark
hive 2.3.4 on spark 2.4.0 Hive on Spark provides Hive with the ability to utilize Apache Spark as it ...

随机推荐

深入剖析Redis系列：Redis数据结构与全局命令概述
前言 Redis 提供了 5 种数据结构.理解每种数据结构的特点,对于 Redis 的开发运维非常重要,同时掌握 Redis 的单线程命令处理机制,会使数据结构和命令的选择事半功倍. ...
java oop 集合框架
集合主要是用来存放数据的,集合与数组的主要区别在于,集合的大小不受限制,而数组的大小受限制,在使用集合增加数据时又常常与泛型联系起来,所以集合和泛型在实际开发过程中总会结合在一起数组致命的缺点是数组 ...
hMailServer配置图文详细教程
https://www.hmailserver.org/viewtopic.php?f=4&t=6
mysql之优化（2）
1.选取最适用的字段属性MySQL可以很好的支持大数据量的存取,但是一般说来,数据库中的表越小,在它上面执行的查询也就会越快.因此,在创建表的时候,为了获得更好的性能,我们可以将表中字段的宽度设得尽可 ...
[转帖]Scanners-Box 指引
作者:杨文链接:https://zhuanlan.zhihu.com/p/26534414来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明出处. 原文地址:We5ter/Sca ...
Python——字典操作
一.取出字典中所有的key-value student={'name':'xiaoming','age':11,'school':'tsinghua'} for key,value in studen ...
下载图片没有关闭http输入流导致下载超时
在某次接入第三方厂商数据时,需要根据对方提供的URL地址下载图片,当数据量大时会遇到很多的下载图片超时问题,开始以为是第三方厂商的问题,对方排查了很久之后,说是我这边下载数据全部留在缓存区,导致缓存区 ...
ZOJ 4097 Rescue the Princess
在这个物欲横流的社会 oj冷漠无情只有这xx还有些温度越界就越界吧 wrong 怎么回事.... 给出一个图然后给出q次询问问是否存在v和w分别到u的路径且边不重复在边双连通分量中任意两 ...
P1494 [国家集训队]小Z的袜子
题目 P1494 [国家集训队]小Z的袜子解析在区间\([l,r]\)内, 任选两只袜子,有 \[r-l+1\choose2\] \[=\frac{(r-l+1)!}{2!(r-l-1)!}\] ...
【转】Esp8266学习之旅① 搭建开发环境，开始一个“hellow world”串口打印。
@2019-02-28 [小记] Esp8266学习之旅① 搭建开发环境,开始一个“hellow world”串口打印.

【原创】大数据基础之Spark（8）Spark中Join实现原理