4.1 Storage requirements for the master dataset

To determine the requirements for data storage, you must consider how your data will be written and how it will be read. The role of the batch layer within the Lambda Architecture affects both values.

In chapter 2 we emphasized two key properties of data: data is immutable and eternally true. Consequently, each piece of your data will be written once and only once. There is no need to ever alter your data – the only write operation will be to add ad new data unit to your dataset. The storage solution must therefore be optimized to handle a large, constantly growing set of data.

The batch layer is also responsible for computing functions on the dataset to produce the batch views. This means the batch layer storage system needs to be good at reading lots of data at once. In particular, random access to individual pieces of data is not required.

With this “write once, bulk read many times” paradigm in mind, we can create a checklist of requirements for the data storage.

4.2 Choosing a storage solution for the batch layer

With the requirements checklist in hand, you can now consider options for batch layer storage. With such loose requirements – not even needing random access to the data – it seems like you could use pretty much any distributed database for the master dataset.

4.2.1 Using a key/value store for the master dataset

We haven't discussed distributed key/value stores yet, but you can essentially think of them as giant persistent hashmaps that are distributed among many machines. If you're storing a master dataset on a key/value store, the first thing you have to figure out is what the keys should be and what the values should be.

What a value should be is obvious – it's a piece of data you want to store – but what should a key be? There's no natural key in the data model, nor is one necessary because the data is meant to be consumed in bulk. So you immediately hit an impedance mismatch between the data model and how key/value stores work. The only really viable idea is to generate a UUID to use as a key.

But this is only the start of the problems with using key/value stores for a master dataset. Because key/value store need fine-grained access to key/value pairs to do random reads and writes, you can't compress multiple key/value pairs together. So you're severely limited in tuning the trade-off between storage costs and processing costs.

Key/value stores are meant to be used as mutable stores, which is a problem if enforcing immutability is so crucial for the master dataset. Unless you modify the code of the key/value store you're using, you typically can't disable the ability to modify existing key/value pairs.

The biggest problem, though, is that a key/value store has a lot of things you don't need: random reads, random writes, and all the machinery behind making those work. In fact, most of the implementation of a key/value store is dedicated to these features you don't need at all. This means the tool is enormously more complex than it needs to be to meet your requirements, making it much more likely you'll have a problem with it. Additionally, the key/value store indexes your data and provides unneeded services, which will increase your storage costs and lower your performance when reading and writing data.

4.2.2 Distributed filesystems

It turns out there's a type of technology that you're already intimately familiar with that's a perfect fit for batch layer storage: filesystem.

Files are sequences of bytes, and the most efficient way to consume them is by scanning through them. They're stored sequentially on disk (sometimes they're split into blocks, but reading and writing is still essentially sequential). You have full control over the bytes of a file, and you have the full freedom to compress them however you want. Unlike a key/value store, a filesystem gives you exactly what you need and no more, while also not limiting your abilit to tune storage cost versus processing cost. On top of that, filesystems implement fine-grained permission system, which are perfect for enforcing immutability.

The problem with a regular filesystems is that it exists on just a single machine, so you can only scale to the storage limits and processing power of that one machine. But it turns out that there's a class of technologies called distributed filesystems that is quite is similar to the filesystems you're familiar with, except they spread their storage across a cluster of computers. They scale by adding more machines to the cluster. Distributed fielsystems are designed so that you have fault tolerance when a machine goes down, meaning that if you lose one machine, all your files and data will still be accessible.

There are some differences between distributed fielsystems and regular filesystems. The operations you can do with a distributed filesystem are often more limited than you can do with a regular filesystem. For instance, you may not be able to write to the middle of a file or even modify a file at all after creation. Oftentimes having small files can be inefficient, so you want to make sure you keep your file sizes relatively large to make use of the distributed filesystem properly (the details depend on the tool, but 64 MB is a good rule of thumb).

4.3 How distributed filesystems work

It's tough to talk in the abstract about how any distributed filesystem works, so we'll ground our explanation with a specific tool: the Hadoop Distributed File System (HDFS). We feel the design of HDFS is sufficiently representative of how distributed filesystems work to demonstrate how such a tool can be used for the batch layer.

HDFS and Hadoop Map Reduce are the two prongs of the Hadoop project: a Java framework for distributed storage and distributed processing of large amounts of data. Hadoop is deployed across multiple servers, typically called a cluster, and HDFS is a distributed and scalable filesystem that manages how data is stored across the cluster. Hadoop is a project of significant size and depth, so we'll only provide a high level description.

In an HDFS cluster, there are two types of nodes: a single namenode and multiple datanodes. When you upload a file to HDFS, the file is first chunked into blocks of a fixed size, typically between 64 MB and 256 MB. Each block is then replicated across multiple datanodes (typically three) that are chosen at random. The namenode keeps track of the file-to-block mapping and where each block is located. This design is Distributing a file in this way across many nodes allows it to be easily processed in parallel. When a program needs to access a file stored in HDFS, it contacts the namenode to determine which datanodes host the file contents.

Additionally, with each block replicated across multiple nodes, your data remains available even when individual nodes are offline. Of course, there are limits to this fault tolerance: if you have a replication factor of three, three nodes go down at once, and you're storing millions of blocks, chances are that some blocks happened to exist on exactly those three nodes and will be unavailable.

4.4 Storing a master dataset with a distributed filesystem

Distributed filesystems vary in the kinds of operations they permit. Some distributed fielsystems let you modify existing files, and others don't. Some allow you to append to existing files, and some don't have the feature.

Clearly, with unmodifiable files you can't store the entire master dataset in a single file. What you can do instead is spred the master dataset among many files, and store all those files in the same folder. Each file would contain many serialized data objects.

4.5 Vertical partitioning

Although the batch layer is built to run functions on the entire dataset, many computations don't require looking at all the data. For example, you may have a computation that only requires information collected during the past two weeks. The batch storage should allow you to partition your data so that a function only access data relevant to its computation. This process is called vertical partitioning, and it can greatly contribute to making the batch layer more efficient. While it's not strictly necessary for the batch layer, as the batch layer is capable of looking at all the data an once and filtering out what it doesn't need, vertical partitioning enables larges performance gains, so it's important to know how to use the technique.

Vertically partitioning data on a distributed filesystem can be down by sorting your data into separate folders. For example, suppose you're storing login information on a distributed filesystem. Each login contains a username, IP address, and timestamp. To vertically partition by day, you can create a separate folder for each day of data. Each day folder would have many files containing the logins for that day.

Now if you only want to look at a particular subset of your dataset, you can just look at the files in those particular folders and ignore the other files.

4.6 Low-level nature of distributed filesystems

While distributed filesystems provide the storage and fault-tolerance properties you need for storing a master dataset, you'll find using their APIs directly too low-level for the tasks you need to run. We'll illustrate this using regular Unix filesystem operations and show the difficulties you can get into when doing tasks like appending to a master dataset or vertically partitioning a master dataset.

Let's start with appending to a master dataset. Suppose your master dataset is in the folder /master and you have a folder of data in /new-data that you want to put inside your master dataset. Unfortunately, this code has serious problems. If the master dataset folder contains any files of the same name, then the mv operation will fail. To do it correctly, you have to be sure you rename the file to a random filename and so avoid conflicts.

There's another problem. One of the core requirements of storage for the master dataset is the ability to tune the trade-offs between storage costs and processing costs. When storing a master dataset on a distributed filesystem, you choose a file format and compression format that makes the trade-off you desire. What if the files in /new-data are of a different format than in /master? Then the mv operation won't work at all – you instead need to copy the records out of /new-data and into a brand new file with the file format used in /master.

Let's now take a look at doing the same operation but with a vertically paritioned master dataset.

Just putting the files from /new-data into the root of /master is wrong because it wouldn't respect the vertical partitioning of /master. Either the append operation should be disallowed – because /new-data isn't correctly vertically partitioned – or /new-data should be vertically partitioned as part of the append operation. But when you're just using a files-and-folders as part of the append operation. But when you're just using a files-and-folders API directly, it's very easy to make a mistake and break the vertical partitioning constrains to a dataset.

All the operations and checks that need to happen to get these operations working correctly strongly indicate that files and folders are too low-level of an abstraction for manipulating datasets.

4.7 Storing the SuperWebAnalytics.com master dataset on a distributed filesystem

Let's now look at how you can make use of a distributed filesystem to store the master dataset for SuperWebAnalytics.com

When you last left this project, you had created a graph schema to represent the dataset. Every edge and property is represented via its own independent DataUnit.

A key observation is that a graph schema provides a natural vertical partitioning of the data. You can store all edge and property types in their own folders. Vertically partitioning the data this way lets you efficiently run computation that only look at certain properties and edges.

Data storage on the batch layer的更多相关文章

  1. 增长中的时间序列存储(Scaling Time Series Data Storage) - Part I

    本文摘译自 Netflix TechBlog : Scaling Time Series Data Storage - Part I 重点:扩容.缓存.冷热分区.分块. 时序数据 - 会员观看历史 N ...

  2. Neural Networks and Deep Learning(week3)Planar data classification with one hidden layer(基于单隐藏层神经网络的平面数据分类)

    Planar data classification with one hidden layer 你会学习到如何: 用单隐层实现一个二分类神经网络 使用一个非线性激励函数,如 tanh 计算交叉熵的损 ...

  3. 课程一(Neural Networks and Deep Learning),第三周(Shallow neural networks)—— 3.Programming Assignment : Planar data classification with a hidden layer

    Planar data classification with a hidden layer Welcome to the second programming exercise of the dee ...

  4. [转]How to build a data storage and VM Server using comodity hardware and free software

    Source: http://learnandremember.blogspot.jp/2010_01_01_archive.html Requisites: 1) RAID protection f ...

  5. 《Pro SQL Server Internals, 2nd edition》的CHAPTER 1 Data Storage Internals中的Data Pages and Data Rows(翻译)

    数据页和数据行 数据库中的空间被划分为逻辑8KB的页面.这些页面是以0开始的连续编号,并且可以通过指定文件ID和页号来引用它们.页面编号都是连续的,这样当SQL Server增长数据库文件时,从文件中 ...

  6. Tuning 14 Using Oracle Data Storage Structures Efficiently

    90% 是Heap table Cluster 集群表, index-organized table: 就是把索引和表 和二为一了. partitioned table:表非常大, 逻辑上是一个大表, ...

  7. tensorflow和python操作中的笔记

    前一段时间做了一些项目,把一些笔记放在了txt中,现分享出来,自己也能够时长预习. 1) 读取文件时,将固定的文件地址,采用数组或者字符串的形式,提前表示出来,后期使用时候采用拼接操作 2) # 得到 ...

  8. Tensorflow - Implement for a Convolutional Neural Network on MNIST.

    Coding according to TensorFlow 官方文档中文版 中文注释源于:tf.truncated_normal与tf.random_normal TF-卷积函数 tf.nn.con ...

  9. 关于CQRS(老外经典好文)

    CQRS means Command Query Responsibility Segregation. Many people think that CQRS is an entire archit ...

随机推荐

  1. 方法过滤器,分布式缓存 Memcached实现Session解决方案

    控制器-〉方法过滤器-〉controller-> 方法 所以通过建立controller基类的方法进行方法过滤,所有控制器先执行基类的OnActionExecuting 方法. using Sp ...

  2. 阿里云OneinStack,Linux下tomcat命令

    阿里云OneinStack,Linux下tomcat命令 Linux下如何查看tomcat是否启动在Linux系统下,重启Tomcat使用命令操作的首先,进入Tomcat下的bin目录cd /usr/ ...

  3. laravel框架总结(四) -- 服务容器

    1.依赖 我们定义两个类:class Supperman 和 class Power,现在我们要使用Supperman ,而Supperman 依赖了Power class Supperman { p ...

  4. 使用redis-dump进行Redis数据库合并

    前言 最近处理数据时,涉及到跨服务器访问的问题,我有两个Redis服务器分别在不同的机器上,给数据维护带来了诸多不便,于是便研究了下如何将两个Redis中的数据合并到一处. 从网站搜了一些工具,找到了 ...

  5. 后勤数据抽取流程图 Logistic Data Extraction

    声明:原创作品,转载时请注明文章来自SAP师太技术博客( 博/客/园www.cnblogs.com):www.cnblogs.com/jiangzhengjun,并以超链接形式标明文章原始出处,否则将 ...

  6. 《BI那点儿事》SQL Server 2008体系架构

    Microsoft SQL Server是一个提供了联机事务处理.数据仓库.电子商务应用的数据库和数据分析的平台.体系架构是描述系统组成要素和要素之间关系的方式.Microsoft SQL Serve ...

  7. Canvas学习

    参考了慕课网课程:炫丽的倒计时效果Canvas绘图与动画基础  感谢  liuyubobobo 老师 ,提供了这么好的课程 1.<canvas><canvas>标签     注 ...

  8. 从欧几里得距离、向量、皮尔逊系数到http://guessthecorrelation.com/

    一.欧几里得距离就是向量的距离公式 二.皮尔逊相关系数反应的就是线性相关 游戏http://guessthecorrelation.com/ 的秘诀也就是判断一组点的拟合线的斜率y/x ------- ...

  9. Windows 域(domain)

    http://baike.baidu.com/view/1512519.htm http://baike.baidu.com/view/1218493.htm http://www.jb51.net/ ...

  10. Android:padding和android:layout_margin的区别

    padding是站在父view的角度描述问题,它规定它里面的内容必须与这个父view边界的距离. margin则是站在自己的角度描述问题,规定自己和其他(上下左右)的view之间的距离