Big Data Ingestion and streaming product introduction
Flume
Flume isdistributed system for collecting log data from many sources, aggregating it,and writing it to HDFS. It is designed to be reliable and highly available, whileproviding a simple, flexible, and intuitive programming model based onstreaming data flows. Flume provides extensibility for online analyticapplications that process data stream in situ. Flume and Chukwa share similar goalsand features. However, there are some notable differences. Flume maintains acentral list of ongoing data flows, stored redundantly in Zookeeper. Incontrast, Chukwa distributes this information more broadly among its services.Flume adopts a “hop-by-hop” model, while in Chukwa the agents on each machineare responsible for deciding what data to send.
Chukwa
Log processing wasone of the original purposes of MapReduce. Unfortunately, Hadoop is hard to usefor this purpose. Writing MapReduce jobs to process logs is somewhat tediousand the batch nature of MapReduce makes it difficult to use with logs that aregenerated incrementally across many machines. Furthermore, HDFS stil does notsupport appending to existing files. Chukwa is a Hadoop subproject that bridgesthat gap between log handling and MapReduce. It provides a scalable distributedsystem for monitoring and analysis of log-based data. Some of the durabilityfeatures include agent-side replying of data to recover from errors. See alsoFlume.
Sqoop
Apache Sqoop is atool designed for efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases. It offers two-wayreplication with both snapshots and incremental updates.
Kafka
Apache Kafka is adistributed publishes-subscribe messaging system. It is designed to providehigh throughput persistent messaging that’s scalable and allows for paralleldata loads into Hadoop. Its features include the use of compression to optimizeIO performance and mirroring to improve availability, scalability and tooptimize performance in multiple-cluster scenarios.
Storm
Hadoop is ideal forbatch-mode processing over massive data sets, but it doesn’t supportevent-stream (a.k.a. message-stream) processing, i.e., responding to individualevents within a reasonable time frame. (For limited scenarios, you could use aNoSQL database like HBase to capture incoming data in the form of appendupdates.) Storm is a general-purpose, event-processing system that is growingin popularity for addressing this gap in Hadoop. Like Hadoop, Storm uses acluster of services for scalability and reliability. In Storm terminology youcreate a topology that runs continuously over a stream of incoming data, whichis analogous to a Hadoop job that runs as a batch process over a fixed data setand then terminates. An apt analogy is a continuous stream of water flowingthrough plumbing. The data sources for the topology are called spouts and eachprocessing node is called a bolt. Bolts can perform arbitrarily sophisticatedcomputations on the data, including output to data stores and other services.It is common for organizations to run a combination of Hadoop and Stormservices to gain the best features of both platforms.
Big Data Ingestion and streaming product introduction的更多相关文章
- timer Compliant Controller project (1)--Product introduction meeting
Last week ,I lead the meeting for new project. i'm very excited. The meeting is divided into the fo ...
- [Data Structures and Algorithms - 1] Introduction & Mathematics
References: 1. Stanford University CS97SI by Jaehyun Park 2. Introduction to Algorithms 3. Kuangbin' ...
- An Introduction to Text Mining using Twitter Streaming
Text mining is the application of natural language processing techniques and analytical methods to t ...
- (转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Introduction Optimization is always the ultimate goal whether you are dealing with a real life probl ...
- [转]Efficiently Paging Through Large Amounts of Data
本文转自:http://msdn.microsoft.com/en-us/library/bb445504.aspx Scott Mitchell April 2007 Summary: This i ...
- Spark Streaming官方文档学习--下
Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...
- 【Repost】A Practical Intro to Data Science
Are you a interested in taking a course with us? Learn about our programs or contact us at hello@zip ...
- 100 open source Big Data architecture papers for data professionals
zhuan :https://www.linkedin.com/pulse/100-open-source-big-data-architecture-papers-anil-madan Big Da ...
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Inp ...
随机推荐
- C语言库函数大全及应用实例十三
原文:C语言库函数大全及应用实例十三 [编程资料]C语言库函数大全及应用实例十三 函数名: stat 功 能: 读取打 ...
- ASP.NET MVC局部视图
使用ASP.NET MVC局部视图避免JS拼接HTML,编写易于维护的HTML页面 以前使用ASP.NET WebForm开发时,喜欢使用Repeater控件嵌套的方式开发前台页面,这样就不用JS ...
- C# 编译器选项 /platform(指定输出平台)32位程序运行到x64平台的问题
如果说你编译的exe运行时报错: “尝试读取或写入受保护的内存.这通常指示其他内存已损坏” 这很有可能是你是以非托管的方式错误地引用了64位的API中去. 为什么会这样? 那你就要考虑VS的编译器选项 ...
- Android学习路径——Android的四个组成部分activity(一)
一.什么是Activity? Activity简单的说就是一个接口.我们是Android手机上看到的每个界面就是一个activity. 二.Activity的创建 1.定义一个类继承activity, ...
- 【转】Android实现推送方式解决方案
本文介绍在Android中实现推送方式的基础知识及相关解决方案.推送功能在手机开发中应用的场景是越来起来了,不说别的,就我们手机上的新闻客户端就时不j时的推送过来新的消息,很方便的阅读最新的新闻信息. ...
- Windows服务、批处理项目实战
一周一话题之三(Windows服务.批处理项目实战) -->目录导航 一. Windows服务 1. windows service介绍 2. 使用步骤 3. 项目实例--数据上传下载服务 ...
- GMap.Net
GMap.Net开发之在WinForm和WPF中使用GMap.Net地图插件 GMap.NET是什么? 来看看它的官方说明:GMap.NET is great and Powerful, Free ...
- [Usaco2008 Feb]Meteor Shower流星雨[BFS]
Description 去年偶们湖南遭受N年不遇到冰冻灾害,现在芙蓉哥哥则听说另一个骇人听闻的消息: 一场流星雨即将袭击整个霸中,由于流星体积过大,它们无法在撞击到地面前燃烧殆尽, 届时将会对它撞到的 ...
- 统计学习方法(三)——K近邻法
/*先把标题给写了.这样就能经常提醒自己*/ 1. k近邻算法 k临近算法的过程,即对一个新的样本,找到特征空间中与其最近的k个样本,这k个样本多数属于某个类,就把这个新的样本也归为这个类. 算法 ...
- Log4j、Log4j 2、Logback、SFL4J、JUL、JCL的比较
Log4j.Log4j 2.Logback.SFL4J.JUL.JCL的比较 之前就知道有好几种日志框架,但是一直都是听别人讲,在什么时候该用何种logger,哪种logger比较好……一直对Log4 ...