Big Data Ingestion and streaming product introduction
Flume
Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.
Chukwa
Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.
Sqoop
Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.
Kafka
Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.
Big Data Ingestion and streaming product introduction的更多相关文章
- timer Compliant Controller project (1)--Product introduction meeting
Last week ,I lead the meeting for new project. i'm very excited. The meeting is divided into the fo ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- [转]Efficiently Paging Through Large Amounts of Data
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This i ...
- Spark Streaming官方文档学习--下
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...
- 【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zip ...
- 100 open source Big Data architecture papers for data professionals
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
随机推荐
- MVC验证11-对复杂类型使用jQuery异步验证
原文:MVC验证11-对复杂类型使用jQuery异步验证 本篇体验使用"jQuery结合Html.BeginForm()"对复杂类型属性进行异步验证.与本篇相关的"兄弟篇 ...
- datatables表格
datatables表格 并不是所有的后台开发都有美工和前端工程师来配合做页面,为了显示数据并有一定的美感,jQuery的DataTables插件对于像我这样的前端菜鸟来说真是雪中送炭,当然对于专业的 ...
- JavaScript超大整数加法
原文:JavaScript超大整数加法 什么是「超大整数」? JavaScript 采用 IEEE754标准 中的浮点数算法来表示数字 Number. 我也没花时间去详细了解 IEEE754标准 ,但 ...
- PHP 6:PHP 基本数据类型
原文:PHP 6:PHP 基本数据类型 本章将介绍PHP基本类型.相信我们已经熟悉了C/C++,C#或者Java里的任意一种语言.本章会以C#为比较语言.OK,如果你想学PHP,你最先考虑的是什么呢? ...
- MSSQL2008数据库备份还原和数据恢复
原文:MSSQL2008数据库备份还原和数据恢复 序言 一直想写一篇关于数据库备份与恢复的文章,但基于能力的有限对数据库认知的有限怕不足以准确的表达,最后思考很久还是决定把自己的一些理解写出来供大 ...
- WebService它CXF这三个音符(Service接口实现类)
ITeacherServiceImpl.java: /** * @Title:ITeacherServiceImpl.java * @Package:com.you.service.impl * @D ...
- Eclipse编辑器样式修改
很多的开发工具都可以更改主题样式,但eclipse作为一款影响力巨大的开源开发工具,却没有自带更改样式的功能,这多少令人有点小遗憾.Eclipse 4之后,Eclipse使用者呼声高涨,就有人开始做起 ...
- JQuery UI Layout Plug-in布局
端]使用JQuery UI Layout Plug-in布局 引言 使用JQuery UI Layout Plug-in布局框架实现快速布局,用起来还是挺方便的,稍微研究了一下,就能上手,关于该布 ...
- ASP.NET MVC4+EF5+EasyUI+Unity2.x注入的后台管理系统
构建ASP.NET MVC4+EF5+EasyUI+Unity2.x注入的后台管理系统(30)-本地化(多语言) 我们的系统有时要扩展到其他国家,或者地区,需要更多的语言环境,微软提供了一些解决方 ...
- 【转】Objc的底层并发API
本文由webfrogs译自objc.io,原文作者Daniel Eggert.转载请注明出处! 小引 本篇英文原文所发布的站点objc.io是一个专门为iOS和OS X开发者提供的深入讨论技术的平台, ...