Search is the act of locating information you care about: for example, searching for pages in a textbook that contain the topic you want to read about, or for web pages that have the information you’re looking for. Searching for documents containing particular terms requires looking up indexes that map terms to the documents that contain them. To enable search, you have to build these indexes. This is precisely what Google and other search engines do. Their document corpus is the entire internet; the search terms are whatever you type in the search box.

搜索是我们所关注的查找信息的行为:例如,在教科书中搜索包含您想阅读的主题的书页,或搜索含有你想寻找的信息的网页。搜索包含特定词的文件,是需要查询文档索引的,这些索引映射了词与文档的关系。要启用搜索,你必须先建立这些索引。谷歌和其他搜索引擎就正是这么做的。他们的文档语料库是整个互联网级的,搜索条件是你在搜索框中键入的任何内容。

http://www.uifanr.com/

Bigtable, and by extension HBase, provides storage for this corpus of documents. Bigtable supports row-level access so crawlers can insert and update documents individually. The search index can be generated efficiently via MapReduce directly against Bigtable. Individual document results can be retrieved directly. Support for all these access patterns was key in influencing the design of Bigtable. Figure 1.1 illustrates the critical role of Bigtable in the web-search application.

Bigtable和HBase为语料库中的文档提供了存储。 Bigtable支持行级的访问,以便抓取工具可以插入或更新单独的文档。搜索索引可以通过Bigtable提供的MapReduce而产生。特定的文档结果可以直接被检索出来。支持所有这些访问模式是决定Bigtable设计的关键。图1.1显示了Bigtable在网络搜索应用程序中的关键作用。

http://www.uifanr.com/

NOTE:In the interest of brevity, this look at Bigtable doesn’t do the original authors justice. We highly recommend the three papers on Google File System, MapReduce, and Bigtable as required reading for anyone curious about these technologies. You won’t be disappointed

注:为简便起见,这个看的Bigtable不做原作者正义。我们强烈建议任何好奇这些技术的人都阅读一下“谷歌文件系统”,“MapReduce”,和“Bigtable”这三篇论文。看过后,你一定不会感到失望的。

http://www.uifanr.com/

Figure 1.1 Providing web-search results using Bigtable, simplified. The crawlers—applications collecting web pages—store their data in Bigtable. A MapReduce process scans the table to produce the search  index. Search results are queried from Bigtable to display to the user.

1.Crawlers constantly scour the internet for new pages. Those pages are stored as individual records in Bigtable.

2.A MapReduce job runs over the entire table, generating search indexes for the web search application

3.The user initiates a web search request.

4.The web search application queries the search indexes and retries matching documents directly from Bigtable.

5.Search results are presented to the user.

图1.1 简单地了介绍了下基于Bigtable的网页搜索处理流程。爬虫是收集的网页的应用程序,把数据存储在Bigtable中。MapReduce进程扫描表来建立搜索索引。搜寻结果是从Bigtable中查询出来的,并显示给用户。

1. 爬虫们不断抓取互联网新的页面。这些网页都存储在Bigtable的文档记录。

2. MapReduce的作业运行在整个表上,为页面搜索应用程序生成搜索索引。

3. 用户发起网络页面搜索请求。

4. 网络页面搜索应用程序查询搜索索引,然后直接从Bigtable中找出匹配的文档。

5. 查询的结果返回并呈现给用户。

http://www.uifanr.com/

With the canonical HBase example covered, let’s look at other places where HBase has found purchase. The adoption of HBase has grown rapidly over the last couple of years. This has been fueled by the system becoming more reliable and performant, due in large part to the engineering effort invested by the various companies backing and using it. As more commercial vendors provide support, users are increasingly confident in using the system for critical applications. A technology designed to store a continuously updated copy of the internet turns out to be pretty good at other things internet-related. HBase has found a home filling a variety of roles in and around social-networking companies. From storing communications between individuals to communication analytics, HBase has become a critical infrastructure at Facebook, Twitter, and StumbleUpon, to name a few.

了解了典型的HBase应用案例后,让我们来看看其他有HBase市场的地方。在过去几年里,基于HBase的应用发展迅速。这带动了HBase系统变得更可靠,更高性能,这一变化在很大程度上是由于一些公司支持并使用它,为它投入了工程设计与开发的精力。随着越来越多的商业供应商对HBase提供支持,用户越来越有信心在关键应用系统里使用HBase。这一原来设计用来存储互联网不断更新的数据的技术,变得也适用其他的东西还不错互联网相关。 HBase的已经找到了家和周围的社交网络公司灌装各种角色。从存储个人通讯分析之间的通信,HBase的已成为在Facebook,Twitter的,和StumbleUpon一个重要的基础设施,仅举几例。

http://www.uifanr.com/

HBase has been used in three major types of use cases but it’s not limited to those. In the interest of keeping this chapter short and sweet, we’ll cover the major use cases here.

HBase的已被用于在用例3主要类型,但它并不局限于这些。为了保持这一章简短而亲切的利益,我们将在这里涵盖了主要的用例。

http://www.uifanr.com/

7.HBase In Action 第一章-HBase简介(1.2.1 典型的网络搜索问题:Bigtable的起原)的更多相关文章

  1. 1.HBase In Action 第一章-HBase简介(后续翻译中)

    This chapter covers ■ The origins of Hadoop, HBase, and NoSQL ■ Common use cases for HBase ■ A basic ...

  2. 8.HBase In Action 第一章-HBase简介(1.2.2 捕获增量数据)

    Data often trickles in and is added to an existing data store for further usage, such as analytics, ...

  3. 6.HBase In Action 第一章-HBase简介(1.2 HBase的使用场景和成功案例)

    Sometimes the best way to understand a software product is to look at how it's used. The kinds of pr ...

  4. 5.HBase In Action 第一章-HBase简介(1.1.3 HBase的兴起)

    Pretend that you're working on an open source project for searching the web by crawling websites and ...

  5. 4.HBase In Action 第一章-HBase简介(1.1.2 数据创新)

    As we now know, many prominent internet companies, most notably Google, Amazon, Yahoo!, and Facebook ...

  6. 3.HBase In Action 第一章-HBase简介(1.1.1 大数据你好呀)

    Let's take a closer look at the term Big Data. To be honest, it's become something of a loaded term, ...

  7. 2.HBase In Action 第一章-HBase简介(1.1数据管理系统:快速学习)

    Relational database systems have been around for a few decades and have been hugely successful in so ...

  8. 第一章 C++简介

    第一章  C++简介 1.1  C++特点 C++融合了3种不同的编程方式:C语言代表的过程性语言,C++在C语言基础上添加的类代表的面向对象语言,C++模板支持的泛型编程. 1.2  C语言及其编程 ...

  9. python 教程 第一章、 简介

    第一章. 简介 官方介绍: Python是一种简单易学,功能强大的编程语言,它有高效率的高层数据结构,简单而有效地实现面向对象编程.Python简洁的语法和对动态输入的支持,再加上解释性语言的本质,使 ...

随机推荐

  1. Erlang 的新数据结构 map 浅析

    更新:文中示例代码直接从Joe的新版 Erlang 书中摘抄而来,其中模式匹配的代码有错误,现已纠正.应该用 := 匹配字段,而不是 => . 即将发布的 Erlang 17 最大变化之一包括新 ...

  2. js日期时间函数

    日期时间脚本库方法列表 Date.prototype.isLeapYear 判断闰年Date.prototype.Format 日期格式化Date.prototype.DateAdd 日期计算Date ...

  3. 算法(三)粒子群算法PSO的介绍

    一.引言 在讲算法之前,先看两个例子: 例子一:背包问题,一个书包,一堆物品,每个物品都有自己的价值和体积,装满书包,使得装的物品价值最大. 例子二:投资问题,n个项目,第i个项目投资为ci 收益为p ...

  4. which,whereis, locate, find

    which 在PATH环境变量中的路径中查找目标文件,所以用来查找都是可执行文件,Linux下的各种命令本质上就是一个可执行的文件,所以我们安装新的软件之后通常都会有相应的命令将其打开,就是因为安装的 ...

  5. mysql中count(),group by使用

    count()统计表中或数组中记录 count(*)返回检索行的数目,且不论其值中是否包含NULL count(column_name)返回的是对列中column_name不为NULL的行的统计 例如 ...

  6. jQuery 获取页面元素的属性值

    获取浏览器显示区域(可视区域)的高度 :    $(window).height();    获取浏览器显示区域(可视区域)的宽度 : $(window).width();    获取页面的文档高度 ...

  7. 什么办法可以替代distinct

    今天在论坛上看到一个面试题,是说有什么办法可以替代distinct,得到同样的结果.答案都被大家说的差不多了,发现挺有意思的,就记录一下: SQL> select num from t1;    ...

  8. [Hadoop]如何安装Hadoop

    Hadoop是一个分布式系统基础架构,他使得用户可以在不了解分布式底层细节的情况下,开发分布式程序. Hadoop的重要核心:HDFS和MapReduce.HDFS负责储存,MapReduce负责计算 ...

  9. HDU 4445 Crazy Tank --枚举

    题意: n个物体从高H处以相同角度抛下,有各自的初速度,下面[L1,R1]是敌方坦克的范围,[L2,R2]是友方坦克,问从某个角度抛出,在没有一个炮弹碰到友方坦克的情况下,最多的碰到敌方坦克的炮弹数. ...

  10. POJ 2449 Remmarguts' Date --K短路

    题意就是要求第K短的路的长度(S->T). 对于K短路,朴素想法是bfs,使用优先队列从源点s进行bfs,当第K次遍历到T的时候,就是K短路的长度. 但是这种方法效率太低,会扩展出很多状态,所以 ...