An interesting trend has been developing in the IT landscape over the past few years.  Many new technologies develop and immediately latch onto the “Big Data” buzzword.  And as older technologies add “Big Data” features in an attempt to keep up
with the Joneses, we are seeing a blurring of the boundaries between various technologies.  Say you have search engines such as ElasticSearch or Solr storing JSON documents, MongoDB storing JSON documents, or a pile of JSON documents stored in HDFS on a Hadoop
cluster. Interestingly enough, you can fulfill many of the same use cases with any of these three configurations.

ElasticSearch as a NoSQL database?

Superficially, that doesn’t sound right, but nonetheless it is a valid scenario.  Likewise, MongoDB with support for MapReduce over sharded collections can accomplish many of the same things as Hadoop.  And, of
course, with the many tools that you can layer on top of a Hadoop base (Hive, HBase, Pig, and the like) you can query data from your Hadoop cluster in a multitude of ways.

Given that, can we now say that Hadoop, MongoDB and ElasticSearch are all exactly equivalent?  Of course not.  Each tool still has a niche for which it is most ideally suited, but each has enough flexibility to fulfill multiple roles.  The question
now becomes “What is the ideal use for each of these technologies”, and that my friends is what we will explore now.

ElasticSearch has begun to spread beyond its roots as a “pure” search engine and now adds some features for analytics and visualization - but at its core, remains primarily a full-text search engine for locating documents by keyword.  ElasticSearch
builds on top of Lucene and supports extremely fast lookup and a rich query syntax.  If you have millions (or more) of text documents and you need to locate documents by keywords located in that text, ElasticSearch fits the bill perfectly.  Yes, if your documents
are JSON you can treat ElasticSearch as a sort of lightweight “NoSQL database”.  But ElasticSearch is not quite a “database engine” and provides less support for complex calculations and aggregation as part of a query - although the “statistics” facet does
provide some ability to retrieve calculated statistical information scoped to the given query.   Facets in ElasticSearch are intended mainly to support a “faceted navigation” facility.

If you are looking to return a (usually small) collection of documents in response to a keyword query, and want the ability to support faceted navigation around those documents, then ElasticSearch is probably your best, first choice.  If you need
to perform more complex calculations, run server-side scripts against your data, and easily run MapReduce jobs on your data, then MongoDB or Hadoop enter the picture.

假设你的目的是通过指定keyword查询取得一个文档集合(一般是较小的),而且具有支持对文档基于切面的导航,elasticsearch会非常合适,可是假设你希望支持很多其它复杂的计算,在server端你的数据上执行脚本,非常easy的在你的数据上执行mapreduce成寻。那么MongoDB和Hadoop就在被考虑的范围里边了。

MongoDB is a “NoSQL” database which is designed from the ground up to be highly scalable, with automatic sharding support and a number of additional performance optimizations.  MongoDB is a document oriented database which stores document in a “JSON
like” format (technically BSON) with some extensions beyond plain JSON - for example, a native date type.  MongoDB provides a text index type for supporting full-text search against fields which contain text, so we can see that there is overlap between what
you can do with ElasticSearch and MongoDB, in terms of basic keyword search against a collection of documents.

Where MongoDB goes beyond ElasticSearch is its support for features like server-side scripts in Javascript, aggregation pipelines, MapReduce support and capped collections.   With MongoDB, you can use aggregation pipelines to process documents in
a collection, streaming them through a sequence of pipeline operators where each operator transforms the document.  Pipeline operators can generate entirely new documents or remove documents from the final output.   This is a very powerful facility for filtering,
processing and transforming data as it is retrieved.   MongoDB also supports running map/reduce jobs over the data in a collection, using custom Javascript functions for the map and reduce phases of the operation.  This allows for ultimate flexibility in performing
any type of calculation or transformation to the selected data.

与ES相比,MongoDB超过es的地方是支持server端的javascript和聚合管道 以及MapReduce的支持和capped collections

Another extremely powerful feature in MongoDB is known as “capped collections”.  With the capped collections facility, a user can define a maximum size for a collection - after which the collection can simply be written to blindly, and it will roll-over
data as necessary to maintain the specified size limit.   This feature is extremely useful for capture logs and other streaming data for analysis.

另外一个很NB的特征是capped collections;通过capped collections。用户能够定义为一个collection定义最大的size,用来插入数据(仅仅能插入更新 不能删除)。依照LRU挤出数据存放新插入的数据,这个特点很适合获取log数据和流数据的存储和分析

科普一下capped collections

特点:

1.仅仅能插入,更新,不能删除doc,能够使用drop()删除整个collection

2.LRU列表。相信大家对这个应该非常了解了,oracle里面非常多地方就是用的这个规则,假设指定的集合大小满了。那么会依照LRU挤出数据存放新插入的数据,这里记得更新是不能超出collection的大小的,不能挤出空间存放更新的数据,这个也合情合理。

3.插入的记录都是依照插入的顺序排列,普通的collection在_id上是肯定有索引的,可是这里是没有的

4.能够高速的查询和插入。假设写比读的比例大。建议不要建立索引。否则写会耗费非常多额外的资源。

As you can see, while ElasticSearch and MongoDB have some overlap in possible use cases, they are not the same tool.  But what about Hadoop? Isn’t Hadoop “just MapReduce” which is supported by MongoDB anyway? Is there really a use case for Hadoop
where MongoDB is just as suitable.

In a word, yes.  Hadoop is the grand-father of MapReduce based cluster computing frameworks.  Hadoop provides probably the overall most flexible and powerful environment for processing large amounts of data, and definitely fits niches for which
you would not use ElasticSearch or MongoDB.

To understand why this is true, look at how Hadoop abstracts storage - via HDFS - from the associated computational facility.   With data stored in HDFS, any arbitrary job can be run against that data, using either Java code written to the core
MapReduce API, or arbitrary code written in native languages using Hadoop Streaming.   And starting with Hadoop2 and YARN, even the core programming model is abstracted so that you aren’t limited to MapReduce.  With YARN you can, for example, implement MPI
on top of Hadoop and write jobs in that style.

Additionally, the Hadoop ecosystem provides a staggering array of tools that build on top of HDFS and core MapReduce to query, analyze and process data.  Hive provides a “SQL like” language that allows Business Analysts to query data using a syntax
they are already familiar with.  HBase provides a column oriented database on top of Hadoop.  Pig and Sizzle provide two more alternative programming models for querying Hadoop Data.  With data stored in HDFS using Hadoop, you inherit the ability to simply
plug in Apache Mahout to use advanced machine learning algorithms on your data.  While using RHadoop is straightforward to use the R statistical language to perform advanced statistical analyses on Hadoop data.

So while Hadoop and MongoDB also have some overlapping use cases, and share some useful functionality (seamless horizontal scalability, for example) it remains the case that each tool serves a specific purpose in enterprise computing.  If you simply
want to locate documents by keyword and perform simple analytics, then ElasticSearch may fit the bill.  If you need to query documents that can be modeled as JSON and perform moderately more sophisticated analysis, then MongoDB becomes a compelling choice.
 And if you have a huge quantity of data that needs a wide variety of different types of complex processing and analysis, then Hadoop provides the broadest range of tools and the most flexibility.

As always, it is important to choose the right tool(s) for the job at hand.  And in the “Big Data” space the sheer number of technologies and the blurry lines can make this difficult.  As we can see, there are specific scenarios which best suit
each of these technologies and, more importantly, the differences do matter.  Though, the best news of all is you are not limited to using only one of these tools.   Depending on the details of your use case, it may actually make sense to build a combination
platform.  For example, ElasticSearch and Hadoop are known to work well together, with ElasticSearch providing rapid keyword search, and Hadoop jobs powering the more complicated analytics.

In the end, it takes ample research and careful analysis to make the best choices for your computing environment.   Before selecting any technology or platform, take the time to evaluate it carefully, understand what scenarios it was designed to
optimize for, and what tradeoffs and sacrifices it makes.  Start with a small pilot project to “kick the tires” before converting your entire enterprise to a new platform, and slowly grow into the new stack.

Follow these steps and you can successfully navigate the maze of “Big Data” technologies and reap the associated benefits.\

本文转自:http://www.osintegrators.com/opensoftwareintegrators%7CChoosing-Between-ElasticSearch-MongoDB-%2526-Hadoop

ps:仅仅翻译了部分,其它部分将会在晚点时候完毕~,翻译不正确的地方还请各位指正

Choosing Between ElasticSearch, MongoDB & Hadoop的更多相关文章

  1. [转载]Elasticsearch、MongoDB和Hadoop比较

    IT界在过去几年中出现了一个有趣的现象.很多新的技术出现并立即拥抱了“大数据”.稍微老一点的技术也会将大数据添进自己的特性,避免落大部队太远,我们看到了不同技术之间的边际的模糊化.假如你有诸如Elas ...

  2. Elasticsearch、MongoDB和Hadoop比较

    IT界在过去几年中出现了一个有趣的现象.很多新的技术出现并立即拥抱了“大数据”.稍微老一点的技术也会将大数据添进自己的特性,避免落大部队太远,我们看到了不同技术之间的边际的模糊化.假如你有诸如Elas ...

  3. Graylog+elasticsearch+mongodb集群+nginx负载均衡前端

    网上有张图画的很好,搜索有关它的配置文章,google里有几篇英文的,都是依靠haproxy等或别的什么实现,没有纯粹的Graylog+elasticsearch+mongodb集群,项目需要,只有自 ...

  4. 一文教您如何通过 Docker 快速搭建各种测试环境(Mysql, Redis, Elasticsearch, MongoDB) | 建议收藏

    欢迎关注个人微信公众号: 小哈学Java, 文末分享阿里 P8 高级架构师吐血总结的 <Java 核心知识整理&面试.pdf>资源链接!! 个人网站: https://www.ex ...

  5. Elasticsearch、MongoDB、Hadoop适用场景

    如果你仅仅想要通过关键字和简单的分析,那么Elasticsearch可以完成任务: 如果你需要查询文档,并且包含更加复杂的分析过程,那么MongoDB相当适合: 如果你有一个海量的数据,需要大量不同的 ...

  6. 搭建ElasticSearch+MongoDB检索系统

    ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口.Elasticsearch是用Java开发的,并作为Apach ...

  7. es第十篇:Elasticsearch for Apache Hadoop

    es for apache hadoop(elasticsearch-hadoop.jar)允许hadoop作业(mapreduce.hive.pig.cascading.spark)与es交互. A ...

  8. Hadoop+MongoDB的四种方案

    背景: 公司核心业务库现存在MongoDB中,分布在6台MongoDB节点.现面临如下问题: 1.最大的一张表有10多个G,MongoDB在查询方面尚能胜任,但是涉及到复杂计算时会比较吃力. 2.Mo ...

  9. hadoop mongodb install(3)

    reference:http://dblab.xmu.edu.cn/blog/868-2/ root@iZuf68496ttdogcxs22w6sZ:~# mv mongodb-linux-x86_6 ...

随机推荐

  1. getHibernateTemplate()和getSession()的区别

    自动生成hibernate配置文件的时候,会在dao层用到getSession()方法来操作数据库记录,但是他还有个方法getHibernateTemplate(),这两个方法究竟有什么区别呢? 1. ...

  2. WebMatrix安装和使用

    官网:http://www.microsoft.com/web/webmatrix/ 一直觉得dreamweaver已经过时了,很多新的库都不支持.而且,启动慢,占用内存多,是时候换一个ide了. h ...

  3. JavaEE Tutorials (21) - Java EE安全:高级主题

    21.1使用数字证书331 21.1.1创建服务器证书332 21.1.2向证书安全域增加用户334 21.1.3为GlassFish服务器使用一个不同的服务器证书33421.2认证机制335 21. ...

  4. POJ——放苹果

    4:放苹果 查看 提交 统计 提问 总时间限制:  1000ms  内存限制:  65536kB 描述 把M个同样的苹果放在N个同样的盘子里,允许有的盘子空着不放,问共有多少种不同的分法?(用K表示) ...

  5. mysql中使用正则表达式时的注意事项

    mysql不支持\d元字符匹配数字 mysql不支持向前.向后查找 regexp不能和not搭配使用

  6. C# winfrom中Flash播放使用axShockwaveFlash控件设置透明XP出现白色背景解决办法,仿QQ魔法表情效果

    //播放时  图片周围有锯齿白边出现    反锯齿处理暂无解决办法. 如有大神 请给我留言 新Form    AllowDrop True 引用using System.IO; 拖1个Button p ...

  7. Java面试题之谈谈你对Struts的理解

    1. struts是一个按MVC模式设计的Web层框架,其实它就是一个大大的servlet,这个Servlet名为ActionServlet,或是ActionServlet的子类.我们可以在web.x ...

  8. Java基础学习笔记1

    Dos的基本命令: Dir:列出当前目录的所有文件和文件夹 Md:创建一个目录 Rd:删除目录 Cd:进入指定的目录 Cd..:退回上一级目录 Cd/:退回根目录 Del:删除文件 Exit:退出do ...

  9. 用PS绿化版出现“请卸载并重新安装该产品”的解决方法

    下载了一个CS6版本的绿化版PS,解压后发现用不了,因为是不用安装的,所以这个提示明显是没用的. 我把64位破解文件 amtlib.dll和32位破解文件 amtlib.dll都放进去试了一下,结果行 ...

  10. Adroid: getExternalStorageDirectory 不一定是你想要的外部存储SdCard

    前情提要:我的测试机是华为荣耀6,我装过一个16G的内存卡 因为要面试的需要,我的一个演示项目用的是android本地的WebService.然而写好的webService部署到本地上,应用怎么获取数 ...