Big Data Ingestion and streaming product introduction
Flume
Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.
Chukwa
Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.
Sqoop
Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.
Kafka
Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.
Big Data Ingestion and streaming product introduction的更多相关文章
- timer Compliant Controller project (1)--Product introduction meeting
Last week ,I lead the meeting for new project. i'm very excited. The meeting is divided into the fo ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- [转]Efficiently Paging Through Large Amounts of Data
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This i ...
- Spark Streaming官方文档学习--下
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...
- 【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zip ...
- 100 open source Big Data architecture papers for data professionals
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
随机推荐
- 转载+自练(莫喷)怎样在cocos2d 2.1.4里面使用动画和Texture Packer
本文实践自 Ray Wenderlich.Tony Dahbura 的文章<How to Use Animations and Sprite Sheets in Cocos2D 2.X>, ...
- jQuery圆形统计图实战开发
今天我给大家介绍一款圆形统计图circliful,它基于HTML5的画布和jQuery,无需使用图像轻松实现圆形统计图,而且有很多属性设置,使用起来非常方便. 首先我们需要将jquery库文件和jqu ...
- Spring之SpringMVC的Controller(源码)分析
说明: 例子就不举了,还是直接进入主题,本文主要是以SpringMVC的Controller接口为入点,来分析SpringMVC中C的具体实现和处理过程. 1.Controller接口 public ...
- Mac OSX操作系统安装和配置Zend Server 6教程(3)
Zend Server安装好以后,在php.ini文件中,没有默认时区.就是导致很多警告信息出现的根本. 接下来,我们看看如果修改这个文件. 首先,进入php.ini文件.此文件在目录zend/etc ...
- DbUtility Ex
扩展 DbUtility (1) 2014-05-22 21:48 by Ivony..., 234 阅读, 3 评论, 收藏, 编辑 本文原始路径: https://www.zybuluo.com/ ...
- [置顶] 如何使用c3p0+spring连接oracle数据库
1. 首先是jdbc.properties属性文件的编写,便于数据库移植: datasource.driverClassName=oracle.jdbc.driver.OracleDriver dat ...
- iOS基础 - Copy
copy和mutableCopy 一个对象使用copy或mutableCopy方法可以创建对象的副本 copy – 需要先实现NSCoppying协议,创建的是不可变副本(如NSString.NSAr ...
- Eclipse添加Web和java EE插件
1.在Eclipse中菜单help选项中选择install new software选项 2.在work with 栏中输入 Juno - http://download.eclipse.org/re ...
- 企业架构研究总结(30)——TOGAF架构内容框架之内容元模型(上)
2. 内容元模型(Content Metamodel) 在TOGAF的眼中,企业架构是以一系列架构构建块为基础的,并将目录.矩阵和图形作为其具体展现方式.如果我们把这些表述方式看作为构建块的语法,那么 ...
- SQL Server跨网段(跨机房)FTP复制
SQL Server跨网段(跨机房)FTP复制 2013-09-24 17:53 by 听风吹雨, 273 阅读, 0 评论, 收藏, 编辑 一. 背景 搭建SQL Server复制的时候,如果网络环 ...