Today, Yelp held a tech talk in Columbia University about the data warehouse adopted by Yelp.

Yelp used Amazon Redshift as data warehouse.

There are several features for Redshift:

1. Massively Parellel Processing

2. SQL access

3. Column-based Datastore

Benefits are:

1. Data is structured, accessible and well documented.
2. Architecture allows for easy extensibility and sharing across teams.
3. Allows use of entire SQL-compatible tool ecosystem.

Details:

Massively Parellel Processing (MMP)

Traditional BigData always uses Hadoop + MapReduce. MapReduce's native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL(Structural Query Language). You can refer detail here.

Below is the structure for implementing MMP.

Similarly, Data is distributed across each segment database to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment databases. (referrence: Introduction to MMP)

Typical query sentence in MMP

Column-based Datastore

Enables sparse table definitions
Enables compact storage
Improve scanning/filtering

(Benefits: wiki)

Column-based Datastore

  1. Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
  2. Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
  3. Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
  4. Row-oriented organizations are more efficient when writing a new row if all of the row data is supplied at the same time, as the entire row can be written with a single disk seek.

In practice, row-oriented storage layouts are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column-oriented storage layouts are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).

Amazon Redshift and Massively Parellel Processing的更多相关文章

  1. Amazon Redshift数据库

    Amazon Redshift介绍 Amazon Redshift是一种可轻松扩展的完全托管型PB级数据仓库,它通过使用列存储技术和并行化多个节点的查询来提供快速的查询性能,使您能够更高效的分析现有数 ...

  2. Power BI连接至Amazon Redshift

    一直在使用Power BI连接至MongoDB中,但效果一直不是太理想,今天使用另一种方法,将MongoDB中的数据通过Azure Data Factory转入Amazon Redshift中,而在P ...

  3. amazon redshift 分析型数据库特点——本质还是列存储

    Amazon Redshift 是一种快速且完全托管的 PB 级数据仓库,使您可以使用现有的商业智能工具经济高效地轻松分析您的所有数据.从最低 0.25 USD 每小时 (不承担任何义务) 直到每年每 ...

  4. Amazon Redshift数据迁移到MaxCompute

    Amazon Redshift数据迁移到MaxCompute Amazon Redshift 中的数据迁移到MaxCompute中经常需要先卸载到S3中,再到阿里云对象存储OSS中,大数据计算服务Ma ...

  5. POWER BI 基于 ODBC 数据源的配置刷新-以Amazon Redshift为例

    POWER BI 基于 ODBC 数据源的配置刷新-以Amazon Redshift为例 Powerbi 有多种数据源连接,可以使用它们连接到不同数据源. 如果在 Power BI Desktop 的 ...

  6. Amazon Redshift and the Case for Simpler Data Warehouses

    Redshift是Amazon一个商业产品上的进化 但并不是技术的进化,他使用的无非都是传统数仓领域的技术 如果说创新,就是大量使用Amazon本身的云服务的云原生架构,大大提升的产品的迭代速度,可维 ...

  7. Python 如何连接并操作 Aws 上 PB 级云数据仓库 Redshift

    Python 如何连接并操作 Aws 上 PB 级云数据仓库 Redshift 一.简介 Amazon Redshift 是一个快速.可扩展的数据仓库,可以简单.经济高效地分析数据仓库和数据湖中的所有 ...

  8. Qwiklab'实验-DynamoDB, Redshift, Elasticsearch'

    title: AWS之Qwiklab subtitle: 4. Qwiklab'实验-Amazon DynamoDB, Amazon Redshift, Elasticsearch Service' ...

  9. Massively parallel supercomputer

    A novel massively parallel supercomputer of hundreds of teraOPS-scale includes node architectures ba ...

随机推荐

  1. hdu5772-String problem(最大权闭合子图问题)

    解析: 多校标答 第一类:Pij 表示第i个点和第j个点组合的点,那么Pij的权值等于w[i][j]+w[j][i](表示得分)第二类:原串中的n个点每个点拆出一个点,第i个点权值为 –a[s[i]] ...

  2. REVERSE关键字之REVERSE索引

    昨天说到REVERSE关键字可以指REVERSE函数和REVERSE索引,简单介绍了下REVERSE函数的含义,今天简单整理下REVERSE索引. REVERSE索引也是一种B树索引,但它物理上将按照 ...

  3. IO之流程与buffer概览

    为了说明这个流程,还是用图来描述一下比较直观. 中间过程请参考 <IO之内核buffer----"buffer cache"> <IO之标准C库buffer> ...

  4. PHP批量审核ajax jquery

    var jQuery = $.noConflict(); // alert(jQuery); jQuery(document).ready(function() { /*批量审核*/ jQuery(' ...

  5. iOS面试知识点

    1 iOS基础 1.1 父类实现深拷贝时,子类如何实现深度拷贝.父类没有实现深拷贝时,子类如何实现深度拷贝. 深拷贝同浅拷贝的区别:浅拷贝是指针拷贝,对一个对象进行浅拷贝,相当于对指向对象的指针进行复 ...

  6. XCode中在提示窗体中对已弃用的API接口画上红线

    当我们在XCode中写程序时会不断的出现相关API提示窗体,那敲起来是一个爽啊. 有时候会看到一些API已经弃用了被画上红色的横线.说明该接口已经被弃用,仍保留,但不建议使用,对弃用API实现画横线事 ...

  7. 2013年全球ERP市场格局(Gartner)

    Gartner于5月5日公布了全球ERP市场的分析报告,报告称全球ERP软件销售额2013年整体增长了3.8%(从2012年$244亿美元到2013年$258亿美元),全球前五位ERP厂商座次例如以下 ...

  8. sql语句中BEGIN TRAN...COMMIT TRAN

    BEGIN TRAN标记事务開始  COMMIT TRAN 提交事务  一般把DML语句(select ,delete,update,insert语句)放在BEGIN TRAN...COMMIT TR ...

  9. flashback database操作步骤

    默认情况数据库的flashback database是关闭的. 启用Flashback Database 步骤:1.配置Flash Recovery Area 检查是否启动了flash recover ...

  10. NET基础课--泛型(NET之美)

    1.泛型,类型或方法的一种抽象概括. 2.泛型类:在类型名后面加一个<>,其中传递占位符,也就是类型参数.where是类型约束 可以再查资料 public class SortHelper ...