Big Data Ingestion and streaming product introduction
Flume
Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.
Chukwa
Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.
Sqoop
Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.
Kafka
Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.
Big Data Ingestion and streaming product introduction的更多相关文章
- timer Compliant Controller project (1)--Product introduction meeting
Last week ,I lead the meeting for new project. i'm very excited. The meeting is divided into the fo ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- [转]Efficiently Paging Through Large Amounts of Data
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This i ...
- Spark Streaming官方文档学习--下
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...
- 【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zip ...
- 100 open source Big Data architecture papers for data professionals
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
随机推荐
- struts2 通配符简化配置
在struts映射中反复出现的模式 动作方法 描写叙述 下一个动作方法 add 为save准备网页 save save 提交INSERT list edit 为update准备网页 update up ...
- SQL_修改表结构
***********************************************声明*************************************************** ...
- [连载]Java程序设计(04)---任务驱动的方法:工资结算系统
任务:或在公司,该公司将其分为三类人员:部门经理.销售员.在发工资的时候,部门经理拿固定月薪8000元.技术人员按每小时100元领取月薪.销售人员依照500元底薪加当月销售额的4%进行提成.设计并实现 ...
- yii中登录后跳转回登录前请求的页面
当我们请求一个经过权限控制的请求不通过时,会跳转到一个地方请求权限,请求结束后需要跳转回之前的页面.比如我们请求一个需要登录的action,会被跳转到login页面,我们希望登录成功后跳转到我们之前希 ...
- ArcGIS课程:表面数据转换成矢量数据
虽然TIN (TIN) 和 terrain 数据收集被认为是载体表面.但它们实际上包括基于其他信息元素.并且该信息是在图象点.线或多边形原始格这可能是更实用的公式.在 ArcGIS 在,你可以很容易的 ...
- JTAG应该如何接线
下面是某个ARM9评估板的原理图: 注意: 1. Vref和Vtarget可以直接连在一起,由被调试板提供3.3V或5V电源: 2. nTRST,最好上拉: 3. TDI,最好上拉 4. TMS,最好 ...
- js关闭当前页面不弹出提示的方法
js关闭当前页面不弹出提示的方法 js关闭当前页面不弹出提示的方法 "window.opener=null;window.open('','_self','');window.close() ...
- navigator获取参数
<script type="text/javascript" language="javascript"> document.write(" ...
- C#事件与委托的区别
C#事件与委托的区别 1. 委托 事件是利用委托来定义的,因此先解释委托.委托是一个类,它与其他类如int,string等没有本质区别,int代表的是所有的整形,而string代表的是字符串,委托则代 ...
- twisted学习笔记 No.1
原创博文,转载请注明出处 . 1.安装twisted ,然后安装PyOpenSSL(一个Python开源OpenSSL库),这个软件包用于给Twisted提供加密传输支持(SSL).最后,安装PyCr ...