JOINs are essential operations in relational databases. They create a link between rows based on common values and allow the meaningful combination of these rows. CrateDB supports joins and due to its distributed nature allows you to work with large amounts of data.

In this document we will present the following topics. First, an overview of the existing types of joins and algorithms provided. Then a description of how CrateDB implements them along with the necessary optimizations, which allows us to work with huge datasets.

Table of Contents

A join is a relational operation that merges two data sets based on certain properties. Join Types (Inspired by this article) shows which elements appear in which join.

Join Types

From left to right, top to bottom: left join, right join, inner join, outer join, and cross join of a set L and R.

cross join returns the Cartesian product of two or more relations. The result of the Cartesian product on the relation L and R consists of all possible permutations of each tuple of the relation L with every tuple of the relation R.

An inner join is a join of two or more relations that returns only tuples that satisfy the join condition.

An equi join is a subset of an inner join and a comparison-based join, that uses equality comparisons in the join condition. The equi join of the relation L and R combines tuple l of relation L with a tuple r of the relation R if the join attributes of both tuples are identical.

Outer Join

An outer join returns a relation consisting of tuples that satisfy the join condition and dangling tuples from both or one of the relations, respectively to the outer join type.

An outer join has following types:

  • Left outer join returns tuples of the relation L matching tuples of the relation R and dangling tuples of the relation R padded with null values.
  • Right outer join returns tuples of the relation R matching tuples of the relation L and dangling tuples from the relation L padded with null values.
  • Full outer join returns matching tuples of both relations and dangling tuples produced by left and right outer joins.

CrateDB supports (a) CROSS JOIN, (b) INNER JOIN, (c) EQUI JOIN, (d) LEFT JOIN, (e) RIGHT JOIN and (f) FULL JOIN. All of these join types are executed using the nested loop join algorithm except for the Equi Joinswhich are executed using the hash join algorithm. Special optimizations, according to the specific use cases, are applied to improve execution performance.

The nested loop join is the simplest join algorithm. One of the relations is nominated as the inner relation and the other as the outer relation. Each tuple of the outer relation is compared with each tuple of the inner relation and if the join condition is satisfied, the tuples of the relation L and R are concatenated and added into the returned virtual relation:

for each tuple l ∈ L do
for each tuple r ∈ R do
if l.a Θ r.b
put tuple(l, r) in Q

Listing 1. Nested loop join algorithm.

For joins on some relations, the nested loop operation can be executed directly on the handler node. Specifically for queries involving a CROSS JOIN or joins on system tables /information_schema each shard sends the data to the handler node. Afterwards, this node runs the nested loop, applies limits, etc. and ultimately returns the results. Similarly, joins can be nested, so instead of collecting data from shards the rows can be the result of a previous join or table function.

Relations are usually distributed to different nodes which require the nested loop to acquire the data before being able to join. After finding the locations of the required shards (which is done in the planning stage), the smaller data set (based on the row count) is broadcast amongst all the nodes holding the shards they are joined with. After that, each of the receiving nodes can start running a nested loop on the subset it has just received. Finally, these intermediate results are pushed to the original (handler) node to merge and return the results to the requesting client (see Nodes that are holding the smaller shards broadcast the data to the processing nodes which then return the results to the requesting node.).

Nodes that are holding the smaller shards broadcast the data to the processing nodes which then return the results to the requesting node.

Pre-Ordering and Limits Optimization

Queries can be optimized if they contain (a) ORDER BY, (b) LIMIT, or (c) if INNER/EQUI JOIN. In any of these cases, the nested loop can be terminated earlier:

  • Ordering allows determining whether there are records left
  • Limit states the maximum number of rows that are returned

Consequently, the number of rows is significantly reduced allowing the operation to complete much faster.

The Hash Join algorithm is used to execute certains types of joins in a more perfomant way than Nested Loop.

The operation takes place in one node (the handler node to which the client is connected). The rows of the left relation of the join are read and a hashing algorithm is applied on the fields of the relation which participate in the join condition. The hashing algorithm generates a hash value which is used to store every row of the left relation in the proper position in a hash table.

Then the rows of the right relation are read one-by-one and the same hashing algorithm is applied on the fields that participate in the join condition. The generated hash value is used to make a lookup in the hash table. If no entry is found, the row is skipped and the processing continues with the next row from the right relation. If an entry is found, the join condition is validated (handling hash collisions) and on successful validation the combined tuple of left and right relation is returned.

Basic hash join algorithm

The Hash Join algorithm requires a hash table containing all the rows of the left relation to be stored in memory. Therefore, depending on the size of the relation (number of rows) and the size of each row, the size of this hash table might exceed the available memory of the node executing the hash join. To resolve this limitation the rows of the left relation are loaded into the hash table in blocks.

On every iteration the maximum available size of the hash table is calculated, based on the number of rows and size of each row of the table but also taking into account the available memory for query execution on the node. Once this block-size is calculated the rows of the left relation are processed and inserted into the hash table until the block-size is reached. The operation then starts reading the rows of the right relation, process them one-by-one and performs the lookup and the join condition validation. Once all rows from the right relation are processed the hash table is re-initialized based on a new calculation of the block size and a new iteration starts until all rows of the left relation are processed.

With this algorithm the memory limitation is handled in expense of having to iterate over the rows of the right table multiple times, and it is the default algorithm used for Hash Join execution by CrateDB.

Since the right table can be processed multiple times (number of rows from left / block-size) the right table should be the smaller (in number of rows) of the two relations participating in the join. Therefore, if originally the right relation is larger than the left the query planner performs a switch to take advantage of this detail and execute the hash join with better performance.

Since CrateDB is a distributed database and a standard deployment consists of at least three nodes and in most case of much more, the Hash Join algorithm execution can be further optimized (performance-wise) by executing it in a distributed manner across the CrateDB cluster.

The idea is to have the hash join operation executing in multiple nodes of the cluster in parallel and then merge the intermediate results before returning them to the client.

A hashing algorithm is applied on every row of both the left and right relations. On the integer value generated by this hash, a modulo, by the number of nodes in the cluster, is applied and the resulting number defines the node to which this row should be sent. As a result each node of the cluster receives a subset of the whole data set which is ensured (by the hashing and modulo) to contain all candidate matching rows. Each node in turn performs a block hash join on this subset and sends its result tuples to the handler node (where the client issued the query). Finally, the handler node receives those intermediate results, merges them and applies any pending ORDER BYLIMITand OFFSET and sends the final result to the client.

This algorithm is used by CrateDB for most cases of hash join execution except for joins on complex subqueries that contain LIMIT and/or OFFSET.

Distributed hash join algorithm

Join operations on large relation can be extremely slow especially if the join is executed with a Nested Loop. - which means that the runtime complexity grows quadratically (O(n*m)). Specifically for Cross Joins this results in large amounts of data sent over the network and loaded into memory at the handler node. CrateDB reduces the volume of data transferred by employing Query Then Fetch: First, filtering and ordering are applied (if possible where the data is located) to obtain the required document IDs. Next, as soon as the final data set is ready, CrateDB fetches the selected fields and returns the data to the client.

Complex queries such as Listing 2 require the planner to decide when to filter, sort, and merge in order to efficiently execute the plan. In this case, the query would be split internally into subqueries before running the join. As shown in Figure 5, first filtering (and ordering) is applied to relations L and R on their shards, then the result is directly broadcast to the nodes running the join. Not only will this behavior reduce the number of rows to work with, it also distributes the workload among the nodes so that the (expensive) join operation can run faster.

SELECT L.a, R.x
FROM L, R
WHERE L.id = R.id
AND L.b > 100
AND R.y < 10
ORDER BY L.a

Listing 2. An INNER JOIN on ids (effectively an EQUI JOIN) which can be optimized.

Figure 5

Complex queries are broken down into subqueries that are run on their shards before joining.
 
 
 
 

cratedb joins 原理(官方文档)的更多相关文章

  1. Es官方文档整理-2.分片内部原理

    Es官方文档整理-2.分片内部原理 1.集群      一个运行的Elasticsearch实例被称为一个节点,而集群是有一个或多个拥有相同claster.name配置的节点组成,他们共同承担数据和负 ...

  2. cassandra 3.x官方文档(6)---内部原理之存储引擎

    写在前面 cassandra3.x官方文档的非官方翻译.翻译内容水平全依赖本人英文水平和对cassandra的理解.所以强烈建议阅读英文版cassandra 3.x 官方文档.此文档一半是翻译,一半是 ...

  3. cassandra 3.x官方文档(7)---内部原理之如何读写数据

    写在前面 cassandra3.x官方文档的非官方翻译.翻译内容水平全依赖本人英文水平和对cassandra的理解.所以强烈建议阅读英文版cassandra 3.x 官方文档.此文档一半是翻译,一半是 ...

  4. hbase官方文档(转)

    FROM:http://www.just4e.com/hbase.html Apache HBase™ 参考指南  HBase 官方文档中文版 Copyright © 2012 Apache Soft ...

  5. HBase官方文档

    HBase官方文档 目录 序 1. 入门 1.1. 介绍 1.2. 快速开始 2. Apache HBase (TM)配置 2.1. 基础条件 2.2. HBase 运行模式: 独立和分布式 2.3. ...

  6. Spark SQL 官方文档-中文翻译

    Spark SQL 官方文档-中文翻译 Spark版本:Spark 1.5.2 转载请注明出处:http://www.cnblogs.com/BYRans/ 1 概述(Overview) 2 Data ...

  7. Cassandra 3.x官方文档(1)---关于Cassandra

    写在前面 cassandra3.x官方文档的非官方翻译.翻译内容水平全依赖本人英文水平和对cassandra的理解.所以强烈建议阅读英文版cassandra 3.x 官方文档.此文档一半是翻译,一半是 ...

  8. 从官方文档去学习之FreeMarker

    一.前言 上一篇 <从现在开始,试着学会用官方文档去学习一个技术框架>提倡大家多去从官方文档学习技术,没有讲到具体的实践,本篇就拿一个案例具体的说一说,就是FreeMarker,选择这个框 ...

  9. Sqoop 使用详解(内含对官方文档的解析)

    Sqoop 是 Cloudera 公司创造的一个数据同步工具,现在已经完全开源了. 目前已经是 hadoop 生态环境中数据迁移的首选,另外还有 ali 开发的 DataX 属于同类型工具,由于社区的 ...

随机推荐

  1. "字节跳动杯"2018中国大学生程序设计竞赛-女生专场 Solution

    A - 口算训练 题意:询问 $[L, R]$区间内 的所有数的乘积是否是D的倍数 思路:考虑分解质因数 显然,一个数$x > \sqrt{x} 的质因子只有一个$ 那么我们考虑将小于$\sqr ...

  2. MFC读写EXIF信息,图片非占用

    MFC读写EXIF信息 读取有类库可以直接调用,网络上有直接可以用的:但是写Exif的资料非常少,我花了一点时间研究收集,终于成功. 将相关的资料共享.主要是借助gdi+,需要注意的地方很多   // ...

  3. linux第七周

    可执行程序的装载 一.预处理.编译.链接和目标文件的格式 可执行文件的创建——预处理.编译和链接 cd Code vi hello.c gcc -E -o hello.cpp hello.c -m32 ...

  4. stm32 Flash读写独立函数[库函数]

    一. stm32的FLASH分为 1.主存储块:用于保存具体的程序代码和用户数据,主存储块是以页为单位划分的, 一页大小为1KB.范围为从地址0x08000000开始的128KB内. 2.信息块   ...

  5. Pandas数据分析python环境说明文档

    1. 要求windows系统 2. pycharm编程环境并要求配置好python3.x环境 pycharm可在官网下载,下面是链接. https://www.jetbrains.com/zh/pyc ...

  6. 图解Git命令【转】

    本文转载自:https://github.com/geeeeeeeeek/git-recipes/wiki/4.1-%E5%9B%BE%E8%A7%A3Git%E5%91%BD%E4%BB%A4 此页 ...

  7. UVa 1210 连续素数之和

    https://vjudge.net/problem/UVA-1210 题意: 输入整数n,有多少种方案可以把n写成若干个连续素数之和? 思路: 先素数打表,然后求个前缀和. #include< ...

  8. BZOJ 1269 【AHOI2006】 文本编辑器

    题目链接:文本编辑器 这道题没啥好说的,直接上\(Splay\)就行了,板子题…… 但是我某个地方忘了下放标记导致调了一晚上 听说有个东西叫\(rope\)可以直接过?然而我并不会 保存一发板子: # ...

  9. error: device offline - waiting for device -

    解决方法:重启服务 一.关闭 adb kill-server 二.启动 adb start-server 三.连接 adb connect 192.168.1.10 四.查看设备 adb device ...

  10. Memento(备忘录)

    意图: 在不破坏封装性的前提下,捕获一个对象的内部状态,并在该对象之外保存这个状态.这样以后就可将该对象恢复到原先保存的状态. 适用性: 必须保存一个对象在某一个时刻的(部分)状态, 这样以后需要时它 ...