Big Data Ingestion and streaming product introduction
Flume
Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.
Chukwa
Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.
Sqoop
Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.
Kafka
Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.
Big Data Ingestion and streaming product introduction的更多相关文章
- timer Compliant Controller project (1)--Product introduction meeting
Last week ,I lead the meeting for new project. i'm very excited. The meeting is divided into the fo ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- [转]Efficiently Paging Through Large Amounts of Data
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This i ...
- Spark Streaming官方文档学习--下
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...
- 【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zip ...
- 100 open source Big Data architecture papers for data professionals
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
随机推荐
- ”Validation of viewstate MAC failed” 错误
”Validation of viewstate MAC failed” 错误 在ASP.NET里面,View State使用较为广泛.它作为一个隐藏字段,可以帮助服务端”记住“客户端的改变,这样客户 ...
- C语言两个libxml2库使用的问题
最近使用libxml2想做点东西,翻看一些example后还是有些疑问,去segmentfault问了下,感谢@pingjiang的热心解答,问题解决,记录如下 (一)如下是一个XML文件,p为根结点 ...
- C#网络编程系列(两)它Socket同步TCPserver
声明原文 笔者:竹zz 本文地址http://blog.csdn.net/zhujunxxxxx/article/details/44258719 转载请注明出处 文章系列文件夹 C#网络编程系列文 ...
- 安装SQL Server 2008 - 初学者系列 - 学习者系列文章
本文介绍SQL Server 2008数据库的安装 1.从下列地址获取SQL Server 2008的副本 thunder://QUFlZDJrOi8vfGZpbGV8Y25fc3FsX3NlcnZl ...
- Inno Setup 网页显示插件 webctrl
原文:Inno Setup 网页显示插件 webctrl ; -- Example.iss -- ; restools ; http://restools.hanzify.org ; 插件名:webc ...
- 【SSRS】入门篇(四) -- 向报表添加数据
原文:[SSRS]入门篇(四) -- 向报表添加数据 定义好数据集后 [SSRS]入门篇(三) -- 为报表定义数据集 ,就可以开始设计报表了,将要显示在报表的字段.文本框.图像和其他项从工具箱拖放到 ...
- linux服务创建及jq配置服务列表查看
1.应用背景 随着业务需求,后台处理服务不断增多,对于这些服务或后台程序的查看.更新操作越来越凌乱,所以我们首先需要一个服务列表查看工具,方便查看各 服务的端口.运行状态.jar包路径等等. 2.创建 ...
- DropDownList和GridView用法
DropDownList和GridView用法 DropDownList控件和GridView控件在Asp.net中相当常用,以下是控件的解释,有些是常用的,有些是偶尔的,查找.使用.记录,仅此而 ...
- SpringMVC类型转换、数据绑定
SpringMVC类型转换.数据绑定详解[附带源码分析] 目录 前言 属性编辑器介绍 重要接口和类介绍 部分类和接口测试 源码分析 编写自定义的属性编辑器 总结 参考资料 前言 SpringMVC是目 ...
- 在GridView的中有一个DropDownList,并且DropDownList有回传事件
在GridView的中有一个DropDownList,并且DropDownList有回传事件 最近做一个项目,需要在GridView中的ItemTemplate中添加一个DropDownList,并且 ...