Understanding HBase and BigTable
Hbase is a distributed data storage systems.
A Bigtable is spare , distributed , persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp. each value in the map is an uninterrupted arry of bytes.
HBase uses a data model very similar to that of Bigtable. Users store data rows in labelled tables. A data row has a sortable key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have crazily-varying columns, if the user likes.
map
At its core, HBase/BigTable is a map. From the wikipedia article, a map is "an abstract data type composed of a collection of keys and a collection of values, where each key is associated with one value."
Using JavaScript Object Notation, here's an example of a simple map where all the values are just strings:
{
"zzzzz" : "woot",
"xyz" : "hello",
"aaaab" : "world",
"1" : "x",
"aaaaa" : "y"
}
persistent
Persistence merely means that the data you put in this special map "persists" after the program that created or accessed it is finished.
distributed
HBase and BigTable are built upon distributed filesystems so that the underlying file storage can be spread out among an array of independent machines.
HBase sits atop either HDFS or Amazon's S3, while a BigTable makes use of the GFS.
Data is replicated across a number of participating nodes in an analogous manner to how data is striped across discs in a RAID system.
sorted
Unlike most map implementations, in HBase/BigTable the key/value pairs are kept in strict alphabetical order. That is to say that the row for the key "aaaaa" should be right next to the row with key "aaaab" and very far from the row with key "zzzzz".
Continuing our JSON example, the sorted version looks like this:
{
"1" : "x",
"aaaaa" : "y",
"aaaab" : "world",
"xyz" : "hello",
"zzzzz" : "woot"
}
Because these systems tend to be so huge and distributed, this sorting feature is actually very important. The spacial propinquity of rows with like keys ensures that when you must scan the table, the items of greatest interest to you are near each other.
This is important when choosing a row key convention. For example, consider a table whose keys are domain names. It makes the most sense to list them in reverse notation (so "com.jimbojw.www" rather than "www.jimbojw.com") so that rows about a subdomain will be near the parent domain row.
Continuing the domain example, the row for the domain "mail.jimbojw.com" would be right next to the row for "www.jimbojw.com" rather than say "mail.xyz.com" which would happen if the keys were regular domain notation.
It's important to note that the term "sorted" when applied to HBase/BigTable does not mean that "values" are sorted. There is no automatic indexing of anything other than the keys, just as it would be in a plain-old map implementation.
multidimensional
It is easier to think about this like a multidimensional map - a map of maps if you will. Adding one dimension to our running JSON example gives us this:
{
"1" : {
"A" : "x",
"B" : "z"
},
"aaaaa" : {
"A" : "y",
"B" : "w"
},
"aaaab" : {
"A" : "world",
"B" : "ocean"
},
"xyz" : {
"A" : "hello",
"B" : "there"
},
"zzzzz" : {
"A" : "woot",
"B" : "1337"
}
}
In the above example, you'll notice now that each key points to a map with exactly two keys: "A" and "B". From here forward, we'll refer to the top-level key/map pair as a "row". Also, in BigTable/HBase nomenclature, the "A" and "B" mappings would be called "Column Families".
A table's column families are specified when the table is created, and are difficult or impossible to modify later. It can also be expensive to add new column families, so it's a good idea to specify all the ones you'll need up front.
Fortunately, a column family may have any number of columns, denoted by a column "qualifier" or "label". Here's a subset of our JSON example again, this time with the column qualifier dimension built in:
{
// ...
"aaaaa" : {
"A" : {
"foo" : "y",
"bar" : "d"
},
"B" : {
"" : "w"
}
},
"aaaab" : {
"A" : {
"foo" : "world",
"bar" : "domination"
},
"B" : {
"" : "ocean"
}
},
// ...
}
Notice that in the two rows shown, the "A" column family has two columns: "foo" and "bar", and the "B" column family has just one column whose qualifier is the empty string ("").
When asking HBase/BigTable for data, you must provide the full column name in the form "<family>:<qualifier>". So for example, both rows in the above example have three columns: "A:foo", "A:bar" and "B:".
Note that although the column families are static, the columns themselves are not. Consider this expanded row:
{
// ...
"zzzzz" : {
"A" : {
"catch_phrase" : "woot",
}
}
}
In this case, the "zzzzz" row has exactly one column, "A:catch_phrase". Because each row may have any number of different columns, there's no built-in way to query for a list of all columns in all rows. To get that information, you'd have to do a full table scan. You can however query for a list of all column families since these are immutable (more-or-less).
The final dimension represented in HBase/BigTable is time. All data is versioned either using an integer timestamp (seconds since the epoch), or another integer of your choice. The client may specify the timestamp when inserting data.
Consider this updated example utilizing arbitrary integral timestamps:
{
// ...
"aaaaa" : {
"A" : {
"foo" : {
15 : "y",
4 : "m"
},
"bar" : {
15 : "d",
}
},
"B" : {
"" : {
6 : "w"
3 : "o"
1 : "w"
}
}
},
// ...
}
Each column family may have its own rules regarding how many versions of a given cell to keep (a cell is identified by its rowkey/column pair) In most cases, applications will simply ask for a given cell's data, without specifying a timestamp. In that common case, HBase/BigTable will return the most recent version since it stores these in reverse chronological order.
If an application asks for a given row at a given timestamp, HBase will return cell data where the timestamp is less than or equal to the one provided.
Using our imaginary HBase table, querying for the row/column of "aaaaa"/"A:foo" will return "y" while querying for the row/column/timestamp of "aaaaa"/"A:foo"/10 will return "m". Querying for a row/column/timestamp of "aaaaa"/"A:foo"/2 will return a null result.
sparse
The last keyword is sparse. As already mentioned, a given row can have any number of columns in each column family, or none at all. The other type of sparseness is row-based gaps, which merely means that there may be gaps between keys.
Understanding HBase and BigTable的更多相关文章
- HBase vs. BigTable Comparison - HBase对比BigTable
HBase vs. BigTable Comparison HBase is an open-source implementation of the Google BigTable architec ...
- 理解Hbase和BigTable(转)
add by zhj: 这篇文章写的通俗易懂,介绍了HBase最重要的几点特性. 英文原文:https://dzone.com/articles/understanding-hbase-and-big ...
- HBase 数据模型(Data Model)
HBase Data Model--HBase 数据模型(翻译) 在HBase中,数据是存储在有行有列的表格中.这是与关系型数据库重复的术语,并不是有用的类比.相反,HBase可以被认为是一个多维度的 ...
- (zz) 谷歌技术"三宝"之BigTable
006年的OSDI有两篇google的论文,分别是BigTable和Chubby.Chubby是一个分布式锁服务,基于Paxos算法:BigTable是一个用于管理结构化数据的分布式存储系统,构建在G ...
- [转载] 谷歌技术"三宝"之BigTable
转载自http://blog.csdn.net/opennaive/article/details/7532589 2006年的OSDI有两篇google的论文,分别是BigTable和Chubby. ...
- HBase 数据模型
在HBase中,数据是存储在有行有列的表格中.这是与关系型数据库重复的术语,并不是有用的类比.相反,HBase可以被认为是一个多维度的映射. HBase数据模型术语 Table(表格) 一个HBase ...
- Hbase学习03
第3章 Hbase数据存储模型与工作组件 Data格式设计的的总体原则是按照需求要求,依据Hbase性能的相关标准规范和文件,并遵循“统一规范.统一数据模型.统一规划集群.分步实施”的原则,注重实际应 ...
- 谷歌技术"三宝"之BigTable
转自:https://blog.csdn.net/OpenNaive/article/details/7532589 2006年的OSDI有两篇google的论文,分别是BigTable和Chubby ...
- HBase 架构与工作原理1 - HBase 的数据模型
本文系转载,如有侵权,请联系我:likui0913@gmail.com 一.应用场景 HBase 与 Google 的 BigTable 极为相似,可以说 HBase 就是根据 BigTable 设计 ...
随机推荐
- ZOJ Monthly, January 2018
A 易知最优的方法是一次只拿一颗,石头数谁多谁赢,一样多后手赢 #include <map> #include <set> #include <ctime> #in ...
- phpstorm快捷键大全
前言:这段时间换了编辑器,所以挺多命令也改变了 转载来自:https://www.jianshu.com/p/ffb24d61000d?utm_campaign=maleskine&utm_c ...
- dubbo核心要点及下载(dubbo二)
一.dubbo核心要点 1):服务是围绕服务提供方和服务消费方的,服务提供方实现服务,服务消费方调用服务. 2):服务注册 对于服务提供方它需要发布服务,而由于应用系统的复杂性,服务的数量.类型不断的 ...
- NOI-OJ 1.13 ID:23 区间内的真素数
整体思路 这里需要大量使用素数,必须能够想到只求出M到N之间的素数是不够的,因为M到N之间数字的反序有可能是大于M或小于N的数字,例如M=2,N=20,那么19的反序91大于20,所以使用埃拉拖色尼算 ...
- 第一节:框架前期准备篇之Log4Net日志详解
一. Log4Net简介 Log4net是从Java中的Log4j迁移过来的一个.Net版的开源日志框架,它的功能很强大,可以将日志分为不同的等级,以不同的格式输出到不同的存储介质中,比如:数据库.t ...
- 使用 MERGE 语句实现增删改
Ø 简介 在平常编写增删改的 SQL 语句时,我们用的最多的就是 INSERT.UPDATE 和 DELETE 语句,这是最基本的增删改语句.其实,SQL Server 中还有另外一个可以实现增删改 ...
- luogu P5293 [HNOI2019]白兔之舞
传送门 关于这题答案,因为在所有行,往后跳到任意一行的\(w_{i,j}\)都是一样的,所以可以算出跳\(x\)步的答案然后乘上\(\binom{l}{x}\),也就是枚举跳到了哪些行 如果记跳x步的 ...
- vue学习之组件
组件从注册方式分为全局组件和局部组件. 从功能类型又可以分为偏视图表现的(presentational)和偏逻辑的(动态生成dom),推荐在前者中使用模板,在后者中使用 JSX 或渲染函数动态生成组件 ...
- JAVA进阶5
间歇性混吃等死,持续性踌躇满志系列-------------第5天 1.IDEA常用快捷键 2.简单方法的使用 package cn.intcast.day05.demo01; public clas ...
- django中的反向解析
1,定义: 随着功能的增加会出现更多的视图,可能之前配置的正则表达式不够准确,于是就要修改正则表达式,但是正则表达式一旦修改了,之前所有对应的超链接都要修改,真是一件麻烦的事情,而且可能还会漏掉一些超 ...