Parallel I/O and Columnar Storage

We begin with a high level overview of the system while follow up posts will discuss specific components in more detail. The target audience are software and systems engineers with an interest in databases and distributed systems.

The Challenge

Let's start off with a challenge: We're given a table with 100TB of web tracking data collected over a period of a few days. Here is a small sample of rows from the table:

Our goal is to answer queries like "How many people visited the page '/account/signup' in the last 10 days?". The are only two rules: The query is not known beforehand and we must answer it in less than a second. Here is the same example query in SQL:

SELECT count(1)
FROM tracking_data
WHERE url = '/account/signup' AND time > time_at("-10d");

If that doesn't sound like an interesting challenge yet, consider this quick back-of-the-envelope calculation: Assuming the average hard disk can read roughly 200MB per second (sequentially), loading 100TB from disk will take 500k disk-seconds or about 138 disk hours.

Oops. 138 hours is almost 6 days. A long way from "less than a second". We're seven orders of magnitude off and that's before we even started to process any of the data — just to read it from disk.


How can we still solve the challenge? We can't make a single disk go any faster but we can break up the data set into smaller pieces, put each piece on its own disk and read all pieces from all disks in parallel. If we distributed the data over 500k individual disks we could read our whole data set in one second.

There is only one small problem with that scheme — half a million disks cost a lot of money.

So even if we use a lot of machines, reading the full data set from disk in less than a second is utterly out of reach.


Can we still solve the challenge? Yes, but I'm afraid we'll have to cheat — maybe we can answer the query without actually reading all the data from disk.

If we could come up with an algorithm which computes the query result after reading only .01% of the data (or 10GB) from disk, then we could return an answer in one second using just fifty disks. Fifty disks could be hosted in a dozen servers. Finally, that sounds reasonable!

But how do we compute the answer for the full dataset while reading only .01% of the data from disk? One approach would be to use sampling and probabalistic algorithms. However, sampling would give us approximately correct results. Approximate results are great for some use cases and not so great to unworkable for others.

There is another trick we can use to minimze the amount of data to be read from disk while always giving correct results. The technique is called "column-oriented" or "columnarstorage.

Columnar Storage

To really understand the benefit of columnar storage for data anlytics we first have to look at how regular row-oriented databases store tables on disk:

It turns out they do it pretty much like you'd expect them to. They basically keep a file somwhere which contains all the rows in the table. One row after another. Hence the name "row-oriented".

A row-oriented file conceptually looks something like this

What's problematic about row-oriented databases is that to execute a query, they always have to read the full rows from disk. Even if just a small part of each row is required to answer the query, the database still has to read every row in full. [3]

The not completely intuitive reason for this is that hard disks are only fast if you read a file sequentially, i.e. only if you read one byte after another. Jumping around within a file performs very poorly. So poorly in fact, that reading whole rows is practically always faster than reading partial rows. [4]

Consider the following example query which calculates the number of page views per minute:

SELECT time, count(1)
FROM tracking_data
GROUP BY date_trunc("1min", time);

To compute the answer, we only need to know the value of the time column of each row. We're not interested in the session_idurl or any of the potentially hundreds of other columns of the table.

Ideally, we would only load the time column of each row from disk when executing the query. But we just saw that row-oriented storage can't do that efficiently. We always have to read the full rows no matter what.

Depending on the table and query, we could be spending 99% of the time reading data from disk which we're not going to need to answer the query.


This is the problem column-oriented storage tries to solve. The basic idea is that instead of storing one row after another, we can break up the row into the individual columns and then store one column after another.

If, for example, our table contained one thousand rows with each three columns timesession_id and url, we would first store an array of a thousand time values, then another array of a thousand session_id values and finally an array of a thousand url values.

You might have come across the concept before under a different name: What we're doing is basically vectorization [5].

Storing each column seperately has two desirable properties. The first and most obvious one is that it allows us to also read each column separately — we don't have to load all the extraneous columns from disk anymore.

The second and less obvious upside of columnar storage is that we can compress the data very efficiently. The compression further reduces the number of bytes we actually have to fetch from disk to read a row.

To see why compression in columnar files can be very significant, imagine our table had a fourth is_customer column. Sadly, we don't have any customers yet so the field is always false.

In a row-oriented database, storing the is_customer field for one million rows would take at least 1MB (one byte per boolean). In a columnar database we can store all one million values in a single byte — a 1000000x improvement. [6]


Lastly, it should be noted that columnar storage also has a big downside: It's less efficient to perform updates on columnar files than on row-oriented files. So you probably won't see the classical OLTP databases like MySQL switching to columnar any time soon. Still, for analytical queries on large data sets columnar is clearly the way to go.

Are we there yet?

Looks like we have finally put together a scheme which will allow us to excute a SQL query on 100TB of data in less than a second, even though just reading the data from disk would have taken many days.

Let's recapitulate our approach: We're going to split up the data into many small pieces, then distribute the pieces among a dozen or so machines. On each machine, we will cheat by storing the rows in columnar format and only actually reading a small subset of the compressed data to answer the query.

Of course, we have only scratched the surface of the problem so far. In the next post we will discuss how exactly we're going to split up the data into pieces.

You subscribe to email updates for upcoming posts or the rss feed in the sidebar.

[0] Andrew Lamb et al. (2012) The Vertica Analytic Database: C-Store 7 Years Later (The 38th International Conference on Very Large Data Bases) — http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf

[1] Sergey Melnik et al. (2010) Dremel: Interactive Analysis of Web-Scale Datasets (The 36th International Conference on Very Large Data Bases) — http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf

[2] EventQL (2016) An open-source SQL database for large-scale event analytics — http://eventql.io

[3] Yes, you can put an index on the column. It turns out that indexes in traditional row-oriented database are fundamentally columnar representations of the data. Another way to look at it would be that a columnar table behaves like a traditional table with an automatic index on all columns.

[4] Of course this is an oversimplification. However, even with buffering and speculative read-ahead you wont realistically be faster than reading the rows in full.

[5] "Vectorization" on Wikipedia — https://en.wikipedia.org/wiki/Vectorization

[6] We'll reveal how later in this series. If you can't wait until then check out this excellent paper on the topic [1].

Parallel I/O and Columnar Storage的更多相关文章

  1. Spark SQL 源代码分析之 In-Memory Columnar Storage 之 in-memory query

    /** Spark SQL源代码分析系列文章*/ 前面讲到了Spark SQL In-Memory Columnar Storage的存储结构是基于列存储的. 那么基于以上存储结构,我们查询cache ...

  2. 第十篇:Spark SQL 源码分析之 In-Memory Columnar Storage源码分析之 query

    /** Spark SQL源码分析系列文章*/ 前面讲到了Spark SQL In-Memory Columnar Storage的存储结构是基于列存储的. 那么基于以上存储结构,我们查询cache在 ...

  3. 第九篇:Spark SQL 源码分析之 In-Memory Columnar Storage源码分析之 cache table

    /** Spark SQL源码分析系列文章*/ Spark SQL 可以将数据缓存到内存中,我们可以见到的通过调用cache table tableName即可将一张表缓存到内存中,来极大的提高查询效 ...

  4. Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast.

    https://spark.apache.org/sql/ Performance & Scalability Spark SQL includes a cost-based optimize ...

  5. Parallel file system processing

    A treewalk for splitting a file directory is disclosed for parallel execution of work items over a f ...

  6. 资源list:Github上关于大数据的开源项目、论文等合集

    Awesome Big Data A curated list of awesome big data frameworks, resources and other awesomeness. Ins ...

  7. Awesome Big Data List

    https://github.com/onurakpolat/awesome-bigdata A curated list of awesome big data frameworks, resour ...

  8. ORC Creation Best Practices

    Short Description: ORC Creation Best Practices with examples and references. Article Synopsis. ORC i ...

  9. [转]awsome-java

    原文链接 Awesome Java A curated list of awesome Java frameworks, libraries and software. Contents Projec ...

随机推荐

  1. 如何编写Makefile,一份由浅入深的Makefile全攻略

    本文转载整理自陈浩大大的文章(跟我一起写 Makefile),由于原文内容庞大,故梳理出目录结构以便于学习及查阅参考. 一.概述 —— 什么是makefile?或许很多Winodws的程序员都不知道这 ...

  2. fiddler Android抓包与弱网

    tools rules-ctrl+R 搜索 oSession["request-trickle-delay"] = rules-perfromance-simulate modem ...

  3. HDU 4638 树状数组 想法题

    题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=4638 解题思路: 题意为询问一段区间里的数能组成多少段连续的数.先考虑从左往右一个数一个数添加,考虑当 ...

  4. Linux(CentOS)搭建SVN服务器

    1.安装命令 yum -y install subversion 查看SVN安装位置 rpm -ql subversion 查看SVN版本 svnserve --version 2.创建版本库根目录( ...

  5. CH4101 银河英雄传说

    题意 4101 银河英雄传说 0x40「数据结构进阶」例题 描述 公元五八○一年,地球居民迁移至金牛座α第二行星,在那里发表银河联邦创立宣言,同年改元为宇宙历元年,并开始向银河系深处拓展.  宇宙历七 ...

  6. 【转】python基础-编码与解码

    [转自:https://www.cnblogs.com/OldJack/p/6658779.html] 一.什么是编码 编码是指信息从一种形式或格式转换为另一种形式或格式的过程. 在计算机中,编码,简 ...

  7. Android USB gadget框架学习笔记

    一 Gadget框架结构 kernel/drivers/usb/gadget,这个目录是android下usbgadget的主要目录. Gadget功能组织单元:主要文件android.c,usb g ...

  8. ZH奶酪:【Python】random模块

    Python中的random模块用于随机数生成,对几个random模块中的函数进行简单介绍.如下:random.random() 用于生成一个0到1的随机浮点数.如: import random ra ...

  9. 创意:Soap一款新型的触摸式家用智能路由器

    版权声明:本文为博主原创文章.未经博主同意不得转载. https://blog.csdn.net/iefreer/article/details/34808749 Soap简单介绍 这里的Soap不是 ...

  10. 二分查找算法,java实现

    二分查找算法是在有序数组中用到的较为频繁的一种算法. 在未接触二分查找算法时,最通用的一种做法是,对数组进行遍历,跟每个元素进行比较,其时间复杂度为O(n),但二分查找算法则更优,因为其查找时间复杂度 ...