Join(SQL)

An SQL join clause combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is.

Implementation

Much work in database-systems has aimed at efficient implementation of joins, because relational systems commonly call for joins, yet face difficulties in optimising their efficient execution.
The problem arises because inner joins operate both commutatively and associatively. In practice, this means that the user merely supplies the list of tables for joining and the join conditions to use, and the database system has the task of determining the most efficient way to perform the operation.
A query optimizer determines how to execute a query containing joins. A query optimizer has two basic freedoms:
- Join order: Beacuse it joins functions commutatively and associatively, the order in which the system joins tables does not change the final result set of the query. However, join-order could have an enormous impact on the cost of the join operation, so choosing the best join order becomes very important.
- Join method: Given two tables and a join condition, multiple algorithms can produce the result set of the join. Which algorithm runs most efficiently depends on the sizes of the input tables, the number of rows from each table that match the join condition, and the operations required by the rest of the query.

Join Algorithms

Three fundamental algorithms for performing a join operation exist: nested loop join, sort-merge join and hash join.

Nested loop join

A nested loop join is a navie algorithm that joins two sets by using two nested loops.

Two relations R and S are joined as follows:

For each tuple r in R do

     For each tuple s in S do

        If r and s satisfy the join condition

           Then output the tuple <r,s>

This algorithm will involve n_r*b_s+ b_r block transfers and n_r+b_r seeks, where b_r and b_s are number of blocks in relations R and S respectively, and n_r is the number of tuples in relation R.
The algorithm runs in I/Os, where |R| and |S| is the number of tuples contained in R and Srespectively and can easily be generalized to join any number of relations.
The block nested loop join algorithm is a generalization of the simple nested loops algorithm that takes advantage of additional memory to reduce the number of times that the S relation is scanned.

Sort-merge join

The sort-merge join (also known as merge join) is a join algorithm and is used in the implementation of a relational database management system.
The basic problem of a join algorithm is to find, for each distinct value of the join attribute, the set of tuples in each relation which display that value.
The key idea of the sort-merge algorithm is to first sort the relations by the join attribute, so that interleaved linear scans will encounter these sets at the same time.
In practice, the most expensive part of performing a sort-merge join is arranging for both inputs to the algorithm to be presented in sorted order.
- This can be achieved via an explicit sort operation (often an external sort), or by taking advantage of a pre-existing ordering in one or both of the join relations.
- The latter condition can occur because an input to the join might be produced by an index scan of a tree-based index, another merge join, or some other plan operator that happens to produce output sorted on an appropriate key.
Time complexity: Let's say that we have two relations R and S and . R fits in Pr pages memory and S fits in Ps pages memory. So, in the worst case sort-merge join will run in I/Os. In the case that R and S are not ordered the worst case time cost will contain additional terms of sorting time: O(Pr+Ps+Prlog⁡(Pr)+Pslog⁡(Ps)), which equals O(Prlog⁡(Pr)+Pslog⁡(Ps)).

Pseudocode

For simplicity, the algorithm is described in the case of an inner join of two relations on a single attribute. Generalization to other join types, more relations and more keys is straightforward.

 function sortMerge(relation left, relation right, attribute a)

     var relation output

     var list left_sorted := sort(left, a) // Relation left sorted on attribute a

     var list right_sorted := sort(right, a)

     var attribute left_key, right_key

     var set left_subset, right_subset // These sets discarded except where join predicate is satisfied

     advance(left_subset, left_sorted, left_key, a)

     advance(right_subset, right_sorted, right_key, a)

     while not empty(left_subset) and not empty(right_subset)

         if left_key = right_key // Join predicate satisfied

             add cartesian product of left_subset and right_subset to output

             advance(left_subset, left_sorted, left_key, a)

             advance(right_subset, right_sorted, right_key, a)

         else if left_key < right_key

            advance(left_subset, left_sorted, left_key, a)

         else // left_key > right_key

            advance(right_subset, right_sorted, right_key, a)

     return output

 // Remove tuples from sorted to subset until the sorted[1].a value changes

 function advance(subset out, sorted inout, key out, a in)

     key := sorted[1].a

     subset := emptySet

     while not empty(sorted) and sorted[1].a = key

         insert sorted[1] into subset

         remove sorted[1]

Hash Join

Hash join is similar to nested loop join but faster than nested loop join and hash join is used for equi join.

Classic hash join

The classic hash join algorithm for an inner join of two relations proceeds as follows:
1. Build phase: prepare a hash table for the smaller relation. The hash table entries consist of the join attribute and its row. (Because the hash table is accessed by applying a hash function to the join attribute, it will be much quicker to find a given join attribute's rows by using this table than by scanning the original relation.)
2. Probe phase: scan the larger relation and find the relevant rows from the smaller relation by looking in the hash table.
This algorithm is simple, but it requires that the smaller join relation fits into memory, which is sometimes not the case.
A simple approach to handling oom proceeds as follows:
1. For each tuple r in the build input R
  1. Add r to the in-memory hash table
  2. If the size of the hash table equals the maximum in-memory size:
    1. Scan the probe input S, and add matching join tuples to the output relation
    2. Reset the hash table, and continue scanning the build input R
2. Do a final scan of the probe input S and add the resulting join tuples to the output relation

Grace hash join

A better approach is known as the "grace hash join", after the GRACE database machine for which it was first implemented.
This algorithm avoids rescanning the entire S relation by:
- first partitioning both R and S via a hash function, and writing these partitions out to disk.
- then loads pairs of partitions into memory, builds a hash table for the smaller partitioned relation, and probes the other relation for matches with the current hash table. Because the partitions were formed by hashing on the join key, it must be the case that any join output tuples must belong to the same partition.
- It is possible that one or more of the partitions still does not fit into the available memory, in which case the algorithm is recursively applied: an additional orthogonal hash function is chosen to hash the large partition into sub-partitions, which are then processed as before. Since this is expensive, the algorithm tries to reduce the chance that it will occur by forming as reasonably smaller partitions as possible during the initial partitioning phase.

Hybrid hash join

The hybrid hash join algorithm is a refinement of the grace hash join which takes advantage of more available memory.
During the partitioning phase, the hybrid hash join uses the available memory for two purposes:
- To hold the current output buffer page for each of the k partitions.
- To hold an entire partition in-memory, known as "partition 0".
Because partition 0 is never written to or read from disk, the hybrid hash join typically performs fewer I/O operations than the grace hash join.

Hash anti-join

Hash joins can also be evaluated for an anti-join predicate (a predicate selecting values from one table when no related values are found in the other). Depending on the sizes of the tables, different algorithms can be applied:

Hash left anti-join

Prepare a hash table for the NOT IN side of the join.
Scan the other table, selecting any rows where the join attribute hashes to an empty entry in the hash table.
This is more efficient when the NOT IN table is smaller than the FROM table

Hash right anti-join

Prepare a hash table for the FROM side of the join.
Scan the NOT IN table, removing the corresponding records from the hash table on each hash hit
Return everything that left in the hash table
This is more efficient when the NOT IN table is larger than the FROM table

Join Indexes

Join indexes are database indexes that facilitate the processing of join queries in data warehouses.
They are currently(2012) available in implementations by Oracle and Teradata.
In the Teradata implementation, specified columns, aggregate functions on columns, or components of date columns from
The Oracle implementation limits itself to using bitmap indexes.

FYI

Join(SQL)

Join Algorithm的更多相关文章

8.2.1.10 Nested-Loop Join Algorithms 嵌套循环关联算法:
8.2.1.10 Nested-Loop Join Algorithms 嵌套循环关联算法: MySQL 执行关联在表之间使用一个嵌套循环算法或者变种 Nested-Loop Join Algori ...
Mysql Nested-Loop Join Algorithms
MySQL在多表之间执行join时,利用一种nested-loop algorithm 或者其变种:(嵌套循环) Nested-Loop Join Algorithm 一个简单的嵌套循环连 ...
MySql联接算法
联接算法是MySql数据库用于处理联接的物理策略.在MySql 5.5版本仅支持Nested-Loops Join算法,如果联接表上有索引时,Nested-Loops Join是非常高效的算法.如果有 ...
[MySQL Reference Manual] 8 优化
8.优化 8.优化 8.1 优化概述 8.2 优化SQL语句 8.2.1 优化SELECT语句 8.2.1.1 SELECT语句的速度 8.2.1.2 WHERE子句优化 8.2.1.3 Range优 ...
Peeking into Apache Flink's Engine Room
http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html Join Processin ...
Select Statement Syntax [AX 2012]
Applies To: Microsoft Dynamics AX 2012 R3, Microsoft Dynamics AX 2012 R2, Microsoft Dynamics AX 2012 ...
MapReduce: 一个巨大的倒退
前言 databasecolumn 的数据库大牛们(其中包括PostgreSQL的最初伯克利领导:Michael Stonebraker)最近写了一篇评论当前如日中天的MapReduce 技术的文章, ...
SQL Server 2008性能故障排查（二）——CPU
原文:SQL Server 2008性能故障排查(二)--CPU 承接上一篇:SQL Server 2008性能故障排查(一)--概论说明一下,CSDN的博客编辑非常不人性化,我在word里面都排好 ...
EMC题2
易安信笔试题分享:1 protected成员函数能被肿么调用2 “has-a” relationship是指的啥,答案有instance, reference, pointer等...3 int, c ...

随机推荐

『算法设计_伪代码』贪心算法_最短路径Dijkstra算法
Dijkstra算法实际上是一个贪婪算法(Greedy algorithm).因为该算法总是试图优先访问每一步循环中距离起始点最近的下一个结点.Dijkstra算法的过程如下图所示. 初始化给定图中 ...
c# 线程的生命周期
对于线程而言有两种类型:前台线程,后台线程.前台与后台线程性质相同,但终止条件不同. 后台线程:在运行过程中如果宿主进程结束,线程将直接终止执行:在强制终止时,线程即终止执行不论线程代码是否执行完毕. ...
Oracle中的instr()函数详解及应用
1)instr()函数的格式 (俗称:字符查找函数) 格式一:instr( string1, string2 ) / instr(源字符串, 目标字符串) 格式二:instr( strin ...
Hadoop介绍-2.分布式计算框架Hadoop原理及架构全解
Hadoop是Apache软件基金会所开发的并行计算框架与分布式文件系统.最核心的模块包括Hadoop Common.HDFS与MapReduce. HDFS HDFS是Hadoop分布式文件系统(H ...
HTML相关知识点总结
1.表格<table>常用属性 cellspacing:两个单元格之间的距离注:属性值为数字,效果图如下(左边cellspacing="0",右边cellspacin ...
Mac重要目录
App最喜欢的几个目录 Mac和Windows操作系统有一个很大的不同,大部分App是没有安装程序的,一般下载下来就是一个dmg文件,解开之后直接将App拖到应用程序目录下就可以了,所以给人感觉卸载也 ...
IDA 逆向工程反汇编使用
IDA pro 7.0版本 from:freebuf 用到的工具有IDA pro 7.0 ,被反汇编的是百度云(BaiduNetdisk_5.6.1.2.exe). 首先,IDA pro的长相如下: ...
Ubuntu16.04无法使用WiFi
本人联想431,安装ubuntu16.04 lts,打开之后没有wife,参考这个解决 http://blog.csdn.net/bubblem/article/details/53575017 U ...
DeepLearning4J
http://blog.csdn.net/nysyxxg/article/details/52554734
[POJ2761]Feed the dogs
Problem 查询区间第k大,但保证区间不互相包含(可以相交) Solution 只需要对每个区间左端点进行排序,那它们的右端点必定单调递增,不然会出现区间包含的情况. 所以我们暴力对下一个区间加上 ...

Join Algorithm

Join(SQL)

Implementation

Join Algorithms

Nested loop join

Sort-merge join

Pseudocode

Hash Join

Classic hash join

Grace hash join

Hybrid hash join

Hash anti-join

Hash left anti-join

Hash right anti-join

Join Indexes

FYI

Join Algorithm的更多相关文章

随机推荐

热门专题