Big Data Ingestion and streaming product introduction
Flume
Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.
Chukwa
Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.
Sqoop
Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.
Kafka
Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.
Big Data Ingestion and streaming product introduction的更多相关文章
- timer Compliant Controller project (1)--Product introduction meeting
Last week ,I lead the meeting for new project. i'm very excited. The meeting is divided into the fo ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- [转]Efficiently Paging Through Large Amounts of Data
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This i ...
- Spark Streaming官方文档学习--下
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...
- 【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zip ...
- 100 open source Big Data architecture papers for data professionals
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
随机推荐
- 找不到方法: Int32 System.Environment.get_CurrentManagedThreadId() .
这个问题在本地运行没错...放到服务器上就出现这个问题.. 原因:是这个方法是.NETFRAMWORK4.5的..服务器上用的是4.0就会出现这个问题. 解决办法:在本地WEB项目右键把项目改到FRA ...
- Bootstrap transition.js 插件
Bootstrap transition.js 插件详解 Bootstrap 自带的 JavaScript 插件的动画效果几乎都是使用 CSS 过渡实现的,而其中的 transition.js 就 ...
- Linq to Sql:N层应用中的查询(下) : 根据条件进行动态查询
原文:Linq to Sql:N层应用中的查询(下) : 根据条件进行动态查询 如果允许在UI层直接访问Linq to Sql的DataContext,可以省去很多问题,譬如在处理多表join的时候, ...
- 在地图中使用Java
Map以按键/数值对的形式存储数据,和数组很类似,在数组中存在的索引,它们本身也是对象. Map的接口 Map---实现Map Map.Entry--Map的内部类 ...
- 框架Asp.net Identity
框架Asp.net Identity 在Asp.net上,微软的membershop框架经历了Asp.net membership到Asp.net simple membership,再到现在的Asp ...
- 王立平--string.Empty
String.Empty 字段 .NET Framework 类库 表示空字符串.此字段为仅仅读.命名空间:System 程序集:mscorlib(在 mscorlib.dll 中) protecte ...
- Hadoop -YARN 应用程序设计概述
一概述 应用程序是用户编写的处理数据的统称,它从YARN中申请资源完毕自己的计算任务.YARN自身相应用程序类型没有不论什么限制,它能够是处理短类型任务的MapReduce作业,也能够是 ...
- 命令行配置源和安装本地rpm包
因为Firefox的在写博客时提交代码会丢失缩进,所以打算安装Chrome来写博,还不错,学到了两条命令- [shell] sudo yum-config-manager --add-repo=htt ...
- [译]Java 设计模式之桥接
(文章翻译自Java Design Pattern: Bridge) 简单来说,桥梁设计模式是一个两层的抽象. 桥接模式就是从一个抽象中实现中解耦以便两个都可以独立的改变.桥接使用封装聚合而且使用继承 ...
- 解决水晶报表提示“未知的查询引擎错误” FOR VS2010
原文:解决水晶报表提示“未知的查询引擎错误” FOR VS2010 在VS2010环境下运行水晶报表(当然要先装上Crystal Report For VS2010), 在SetDataSource方 ...