从下面分析可以看出，是先做了hash计算，然后使用hash join table来讲hash值相等的数据合并在一起。然后再使用udf计算距离，最后再filter出满足阈值的数据：

参考：https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala

  /**

   * Join two datasets to approximately find all pairs of rows whose distance are smaller than

   * the threshold. If the [[outputCol]] is missing, the method will transform the data; if the

   * [[outputCol]] exists, it will use the [[outputCol]]. This allows caching of the transformed

   * data when necessary.

   *

   * @param datasetA One of the datasets to join.

   * @param datasetB Another dataset to join.

   * @param threshold The threshold for the distance of row pairs.

   * @param distCol Output column for storing the distance between each pair of rows.

   * @return A joined dataset containing pairs of rows. The original rows are in columns

   *         "datasetA" and "datasetB", and a column "distCol" is added to show the distance

   *         between each pair.

   */

  def approxSimilarityJoin(

      datasetA: Dataset[_],

      datasetB: Dataset[_],

      threshold: Double,

      distCol: String): Dataset[_] = {

    val leftColName = "datasetA"

    val rightColName = "datasetB"

    val explodeCols = Seq("entry", "hashValue")

    val explodedA = processDataset(datasetA, leftColName, explodeCols)

    // If this is a self join, we need to recreate the inputCol of datasetB to avoid ambiguity.

    // TODO: Remove recreateCol logic once SPARK-17154 is resolved.

    val explodedB = if (datasetA != datasetB) {

      processDataset(datasetB, rightColName, explodeCols)

    } else {

      val recreatedB = recreateCol(datasetB, $(inputCol), s"${$(inputCol)}#${Random.nextString(5)}")

      processDataset(recreatedB, rightColName, explodeCols)

    }

    // Do a hash join on where the exploded hash values are equal.

    val joinedDataset = explodedA.join(explodedB, explodeCols)

      .drop(explodeCols: _*).distinct()

    // Add a new column to store the distance of the two rows.

    val distUDF = udf((x: Vector, y: Vector) => keyDistance(x, y), DataTypes.DoubleType)

    val joinedDatasetWithDist = joinedDataset.select(col("*"),

      distUDF(col(s"$leftColName.${$(inputCol)}"), col(s"$rightColName.${$(inputCol)}")).as(distCol)

    )

    // Filter the joined datasets where the distance are smaller than the threshold.

    joinedDatasetWithDist.filter(col(distCol) < threshold)

  }

补充：

sql join 算法时间复杂度

2016年08月26日 12:04:34 stevewongbuaa 阅读数 2477

参考

stackoverflow

笔记

sql语句如下：

SELECT  T1.name, T2.date

FROM    T1, T2

WHERE   T1.id=T2.id

        AND T1.color='red'

        AND T2.type='CAR'

假设T1有m行，T2有n行，那么，普通情况下，应该要遍历T1的每一行的id（m），然后在遍历T2（n）中找出T2.id = T1.id的行进行join。时间复杂度应该是O（m*n）

如果没有索引的话，engine会选择hash join或者merge join进行优化。

hash join是这样的：

选择被哈希的表，通常是小一点的表。让我们愉快地假定是T1更小吧。
T1所有的记录都被遍历。如果记录符合color=’red’，这条记录就会进去哈希表，以id为key，以name为value。
T2所有的记录被遍历。如果记录符合type=’CAR’，使用这条记录的id去搜索哈希表，所有命中的记录的name的值，都被返回，还带上了当前记录的date的值，这样就可以把两者join起来了。

时间复杂度O(n+m)，实现hash表是O(n)，hash表查找是O(m)，直接将其相加。

merge join是这样的：

1.复制T1(id, name),根据id排序。
2.复制T2(id, date)，根据id排序。
3.两个指针指向两个表的最小值。

4.在循环中比较指针，如果match，就返回记录。如果不match，指向较小值的指针指向下一个记录。

>1  2<  - 不match, 左指针小，左指针++

 2  3

 2  4

 3  5

 1  2<  - match, 返回记录，两个指针都++

>2  3

 2  4

 3  5

 1  2  - match, 返回记录，两个指针都++

 2  3<

 2  4

>3  5

 1  2 - 左指针越界，查询结束。

 2  3

 2  4<

 3  5

>

时间复杂度O(n*log(n)+m*log(m))。排序算法的复杂度分别是O(n*log(n))和O(m*log(m))，直接将两者相加。

在这种情况下，使查询更加复杂反而可以加快速度，因为更少的行需要经受join-level的测试？

当然了。

如果原来的query没有where语句，如

SELECT  T1.name, T2.date

FROM    T1, T2

是更简单的，但是会返回更多的结果并运行更长的时间。

hash函数的补充：

可以看到 hashFunction 涉及到indices 字段下表的计算。另外的distance计算使用了jaccard相似度。

from：https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala

/**

 * :: Experimental ::

 *

 * Model produced by [[MinHashLSH]], where multiple hash functions are stored. Each hash function

 * is picked from the following family of hash functions, where a_i and b_i are randomly chosen

 * integers less than prime:

 *    `h_i(x) = ((x \cdot a_i + b_i) \mod prime)`

 *

 * This hash family is approximately min-wise independent according to the reference.

 *

 * Reference:

 * Tom Bohman, Colin Cooper, and Alan Frieze. "Min-wise independent linear permutations."

 * Electronic Journal of Combinatorics 7 (2000): R26.

 *

 * @param randCoefficients Pairs of random coefficients. Each pair is used by one hash function.

 */

@Experimental

@Since("2.1.0")

class MinHashLSHModel private[ml](

    override val uid: String,

    private[ml] val randCoefficients: Array[(Int, Int)])

  extends LSHModel[MinHashLSHModel] {

  /** @group setParam */

  @Since("2.4.0")

  override def setInputCol(value: String): this.type = super.set(inputCol, value)

  /** @group setParam */

  @Since("2.4.0")

  override def setOutputCol(value: String): this.type = super.set(outputCol, value)

  @Since("2.1.0")

  override protected[ml] def hashFunction(elems: Vector): Array[Vector] = {

    require(elems.numNonzeros > 0, "Must have at least 1 non zero entry.")

    val elemsList = elems.toSparse.indices.toList

    val hashValues = randCoefficients.map { case (a, b) =>

      elemsList.map { elem: Int =>

        ((1L + elem) * a + b) % MinHashLSH.HASH_PRIME

      }.min.toDouble

    }

    // TODO: Output vectors of dimension numHashFunctions in SPARK-18450

    hashValues.map(Vectors.dense(_))

  }

  @Since("2.1.0")

  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {

    val xSet = x.toSparse.indices.toSet

    val ySet = y.toSparse.indices.toSet

    val intersectionSize = xSet.intersect(ySet).size.toDouble

    val unionSize = xSet.size + ySet.size - intersectionSize

    assert(unionSize > 0, "The union of two input sets must have at least 1 elements")

    1 - intersectionSize / unionSize

  }

  @Since("2.1.0")

  override protected[ml] def hashDistance(x: Seq[Vector], y: Seq[Vector]): Double = {

    // Since it's generated by hashing, it will be a pair of dense vectors.

    // TODO: This hashDistance function requires more discussion in SPARK-18454

    x.zip(y).map(vectorPair =>

      vectorPair._1.toArray.zip(vectorPair._2.toArray).count(pair => pair._1 != pair._2)

    ).min

  }

  @Since("2.1.0")

  override def copy(extra: ParamMap): MinHashLSHModel = {

    val copied = new MinHashLSHModel(uid, randCoefficients).setParent(parent)

    copyValues(copied, extra)

  }

  @Since("2.1.0")

  override def write: MLWriter = new MinHashLSHModel.MinHashLSHModelWriter(this)

}

minhash pyspark 源码分析——hash join table是关键的更多相关文章

第九篇：Spark SQL 源码分析之 In-Memory Columnar Storage源码分析之 cache table
/** Spark SQL源码分析系列文章*/ Spark SQL 可以将数据缓存到内存中,我们可以见到的通过调用cache table tableName即可将一张表缓存到内存中,来极大的提高查询效 ...
Memcached源码分析——hash
以下为memcached中关于使用的hash算法的一点记录 memcached中默认使用的是Bob Jenkins的jenkins_hash算法以下4段代码均在memcached-1.4.22/ha ...
hbase源码分析：ERROR: Table already exists问题诊断
问题描述: 重新安装了测试环境的hadoop,所以之前hbase所建的表数据都丢失了,但是zookeeper没有动.在hbase shell中list的时候,看不到之前建的表,但是create tes ...
【Spark SQL 源码分析系列文章】
从决定写Spark SQL源码分析的文章,到现在一个月的时间里,陆陆续续差不多快完成了,这里也做一个整合和索引,方便大家阅读,这里给出阅读顺序 :) 第一篇 Spark SQL源码分析之核心流程第二 ...
死磕以太坊源码分析之state
死磕以太坊源码分析之state 配合以下代码进行阅读:https://github.com/blockchainGuide/ 希望读者在阅读过程中发现问题可以及时评论哦,大家一起进步. 源码目录 |- ...
[源码分析] 带你梳理 Flink SQL / Table API内部执行流程
[源码分析] 带你梳理 Flink SQL / Table API内部执行流程目录 [源码分析] 带你梳理 Flink SQL / Table API内部执行流程 0x00 摘要 0x01 Apac ...
java-通过 HashMap、HashSet 的源码分析其 Hash 存储机制
通过 HashMap.HashSet 的源码分析其 Hash 存储机制集合和引用就像引用类型的数组一样,当我们把 Java 对象放入数组之时,并非真正的把 Java 对象放入数组中.仅仅是把对象的 ...
SOFA 源码分析 — 负载均衡和一致性 Hash
前言 SOFA 内置负载均衡,支持 5 种负载均衡算法,随机(默认算法),本地优先,轮询算法,一致性 hash,按权重负载轮询(不推荐,已被标注废弃). 一起看看他们的实现(重点还是一致性 hash) ...
[转]数据库中间件 MyCAT源码分析——跨库两表Join
1. 概述 2. 主流程 3. ShareJoin 3.1 JoinParser 3.2 ShareJoin.processSQL(...) 3.3 BatchSQLJob 3.4 ShareDBJo ...

随机推荐

论consul正确的关闭姿势
最近在工作中发现一个有意思的现象,我用 ctrl+c 关闭本地 consul 的时候,报警系统并没有发出告警,说我的 node 异常,自己看了一下代码,发现 consul 的关闭还是有点猫腻的,仔细来 ...
consul集群搭建以及ACL配置
由于时间匆忙,要是有什么地方没有写对的,请大佬指正,谢谢.文章有点水,大佬勿喷这篇博客不回去深度的讲解consul中的一些知识,主要分享的我在使用的时候的一些操作和遇见的问题以及解决办法.当然有些东西 ...
mysql left join和union结合的用法
left join和union结合的用法子查询union 然后加个括号设置个别名 (union自动去除重复的 ) <pre>select o.nickName,o.sex,o.provi ...
PHP 23种设计模式
学习PHP,对设计模式永远是逃不掉的:今天把php23种设计模式及其demo好好整理如下: 记录PHP关于23种设计模式的简单Demo. Demo地址:https://segmentfault.com ...
Python的线程、进程和协程
进程:一个进程就是一个正在运行的程序,它是计CPU分配资源的最小单位.每个进程都有自己独立的内存空间.能同时执行的进程数最多不超过内核数,也就是每个内核同一时刻只能执行一个进程.那么多进程就是能[同 ...
【Linux文件目录】的一点小结
1. 相关指令: chgrp:改变文件所属用户组点击(此处)折叠或打开 chgrp [-R] group dirname/filename -R: 基本-r参数都是递归recursive ...
Kafka性能调优 - Kafka优化的方法
今天,我们将讨论Kafka Performance Tuning.在本文“Kafka性能调优”中,我们将描述在设置集群配置时需要注意的配置.此外,我们将讨论Tuning Kafka Producers ...
STL源码剖析——序列式容器#2 List
list就是链表的实现,链表是什么,我就不再解释了.list的好处就是每次插入或删除一个元素,都是常数的时空复杂度.但遍历或访问就需要O(n)的时间. List本身其实不难理解,难点在于某些功能函数的 ...
Prism
网址:https://prismjs.com 使用教程:https://www.cnblogs.com/zhibu/p/6272338.html 使用教程:https://www.zlinet.com ...
TZOJ5255: C++实验：三角形面积
#include<iostream> #include<iomanip> #include<math.h> #include<cmath> using ...

minhash pyspark 源码分析——hash join table是关键

sql join 算法 时间复杂度

参考

笔记

hash join是这样的：

merge join是这样的：

minhash pyspark 源码分析——hash join table是关键的更多相关文章

随机推荐

热门专题

sql join 算法时间复杂度