Amazon Redshift and Massively Parellel Processing

Today, Yelp held a tech talk in Columbia University about the data warehouse adopted by Yelp.

Yelp used Amazon Redshift as data warehouse.

There are several features for Redshift:

1. Massively Parellel Processing

2. SQL access

3. Column-based Datastore

Benefits are:

1. Data is structured, accessible and well documented.
2. Architecture allows for easy extensibility and sharing across teams.
3. Allows use of entire SQL-compatible tool ecosystem.

Details:

Massively Parellel Processing (MMP)

Traditional BigData always uses Hadoop + MapReduce. MapReduce's native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL(Structural Query Language). You can refer detail here.

Below is the structure for implementing MMP.

Similarly, Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment databases. (referrence: Introduction to MMP)

Typical query sentence in MMP

Column-based Datastore

Enables sparse table definitions
Enables compact storage
Improve scanning/filtering

(Benefits: wiki)

Column-based Datastore

Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
Row-oriented organizations are more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek.

In practice, row-oriented storage layouts are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column-oriented storage layouts are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).

Amazon Redshift and Massively Parellel Processing的更多相关文章

Amazon Redshift数据库
Amazon Redshift介绍 Amazon Redshift是一种可轻松扩展的完全托管型PB级数据仓库,它通过使用列存储技术和并行化多个节点的查询来提供快速的查询性能,使您能够更高效的分析现有数 ...
Power BI连接至Amazon Redshift
一直在使用Power BI连接至MongoDB中,但效果一直不是太理想,今天使用另一种方法,将MongoDB中的数据通过Azure Data Factory转入Amazon Redshift中,而在P ...
amazon redshift 分析型数据库特点——本质还是列存储
Amazon Redshift 是一种快速且完全托管的 PB 级数据仓库,使您可以使用现有的商业智能工具经济高效地轻松分析您的所有数据.从最低 0.25 USD 每小时 (不承担任何义务) 直到每年每 ...
Amazon Redshift数据迁移到MaxCompute
Amazon Redshift数据迁移到MaxCompute Amazon Redshift 中的数据迁移到MaxCompute中经常需要先卸载到S3中,再到阿里云对象存储OSS中,大数据计算服务Ma ...
POWER BI 基于 ODBC 数据源的配置刷新-以Amazon Redshift为例
POWER BI 基于 ODBC 数据源的配置刷新-以Amazon Redshift为例 Powerbi 有多种数据源连接,可以使用它们连接到不同数据源. 如果在 Power BI Desktop 的 ...
Amazon Redshift and the Case for Simpler Data Warehouses
Redshift是Amazon一个商业产品上的进化但并不是技术的进化,他使用的无非都是传统数仓领域的技术如果说创新,就是大量使用Amazon本身的云服务的云原生架构,大大提升的产品的迭代速度,可维 ...
Python 如何连接并操作 Aws 上 PB 级云数据仓库 Redshift
Python 如何连接并操作 Aws 上 PB 级云数据仓库 Redshift 一.简介 Amazon Redshift 是一个快速.可扩展的数据仓库,可以简单.经济高效地分析数据仓库和数据湖中的所有 ...
Qwiklab'实验-DynamoDB, Redshift, Elasticsearch'
title: AWS之Qwiklab subtitle: 4. Qwiklab'实验-Amazon DynamoDB, Amazon Redshift, Elasticsearch Service' ...
Massively parallel supercomputer
A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...

随机推荐

Quartus DSE 初步应用
介绍 Design Space Explorer (DSE) is a program that automates the process of finding the optimal collec ...
Textarea - 百度富文本编辑器插件UEditor
UEditor各种实例演示 Ueditor 是百度推出的一款开源在线 HTML 编辑器. 主要特点: 轻量级:代码精简,加载迅速. 定制化:全新的分层理念,满足多元化的需求.采用三层架构:1. 核心层 ...
MyCat部署运行（Windows环境）与使用步骤详解
目录(?)[+] 1.MyCat概念 1.1 总体架构 MyCAT的架构如下图所示: MyCAT使用MySQL的通讯协议模拟成一个MySQL服务器,并建立了完整的Schema(数据库).Tab ...
再谈cacheAsBitmap
cacheAsBitmap这个属性很多人都知道,但少有人明白它到底是如何生效的.虽然看名字是转换为位图处理,但用起来的时候感觉却也不过如此.所以,不少人最终选择自己转换Bitmap. 当然,自己转Bi ...
Android应用程序资源管理器（Asset Manager）的创建过程分析
文章转载至CSDN社区罗升阳的安卓之旅,原文地址:http://blog.csdn.net/luoshengyang/article/details/8791064 在前面一篇文章中,我们分析了And ...
ibatisnet框架使用说明
ibatis配置文件主要包括三个 sqlmap.config,providers.config,database.config,注意所有文件生成操作都为嵌入的资源.其中database.config主 ...
IOS 实现QQ好友分组展开关闭功能
贴出核心代码主要讲一下思路. - (void)nameBtnClick:(myButton *)sender { //获取当前点击的分组对应的section self.clickIndex = s ...
OD调试3--reverseMe
OD调试3:reverseMe.exe(reverse就是逆向的意思) 运行效果图: 1关于寄存器寄存器就好比是CPU身上的口袋,方便CPU随时从里边拿出需要的东西来使用.今天的程序中涉及到九个寄存 ...
jsp生成html
这几天公司需要生成静态的HTML页面以减小数据库与服务器的压力和负担,于是在网络上一阵狂搜,找到几篇相当不错的文章和一些相当有用的资料.为了方便,我整理在自己的BLOG,以供参考! 在接下来的应用中, ...
Spark IDEA开发环境构建
本文档基于IEDA构建spark maven应用. date: 2016/8/1 author: wangxl 1.下载IDEA https://www.jetbrains.com/idea/ 2.安 ...

Amazon Redshift and Massively Parellel Processing

Amazon Redshift and Massively Parellel Processing的更多相关文章

随机推荐

热门专题