原文地址:http://www.javacodegeeks.com/2015/02/streaming-big-data-storm-spark-samza.html

There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences.

Apache Storm

In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline.

Apache Spark

Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations).

Apache Samza

Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka.

Common Ground

All three real-time computation systems are open-source, low-latencydistributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations.

The three frameworks use different vocabularies for similar concepts:

Comparison Matrix

A few of the differences are summarized in the table below:

There are three general categories of delivery patterns:

  1. At-most-once: messages may be lost. This is usually the least desirable outcome.
  2. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases.
  3. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases.

Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident.

Use Cases

All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines.

If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching.

A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel...

Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQLMLlibGraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time.

A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu…

If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects.

A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale…

Conclusion

We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.

Streaming Big Data: Storm, Spark and Samza--转载的更多相关文章

  1. 实时流Streaming大数据:Storm,Spark和Samza

    当前有许多分布式计算系统能够实时处理大数据,这篇文章是对Apache的三个框架进行比较,试图提供一个快速的高屋建瓴地异同性总结. Apache Storm 在Storm中,你设计的实时计算图称为top ...

  2. 论文阅读计划1(Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming & An Enforcement of Real Time Scheduling in Spark Streaming & StyleBank: An Explicit Representation for Neural Ima)

    Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming[1] 简介:雅虎发布的一份各种流处理引擎的基准 ...

  3. Spark Streaming概念学习系列之Spark Streaming的竞争对手

    不多说,直接上干货! Spark Streaming的竞争对手 Storm 在Storm中,先要设计一个用于实时计算的图状结构,我们称之为拓扑(topology).这个拓扑将会被提交给集群,由集群中的 ...

  4. [CDH] Process data: integrate Spark with Spring Boot

    c 一.Spark 统计计算 简单统计后写入Redis. /** * 订单统计和乘车人数统计 */ object OrderStreamingProcessor { def main(args: Ar ...

  5. [转载]流式大数据处理的三种框架:Storm,Spark和Samza

    许多分布式计算系统都可以实时或接近实时地处理大数据流.本文将对三种Apache框架分别进行简单介绍,然后尝试快速.高度概述其异同. Apache Storm 在Storm中,先要设计一个用于实时计算的 ...

  6. 流式大数据处理的三种框架:Storm,Spark和Samza

    许多分布式计算系统都可以实时或接近实时地处理大数据流.本文将对三种Apache框架分别进行简单介绍,然后尝试快速.高度概述其异同. Apache Storm 在Storm中,先要设计一个用于实时计算的 ...

  7. 大数据处理的三种框架:Storm,Spark和Samza

    许多分布式计算系统都可以实时或接近实时地处理大数据流.下面对三种Apache框架分别进行简单介绍,然后尝试快速.高度概述其异同. Apache Storm 在Storm中,先要设计一个用于实时计算的图 ...

  8. 三个大数据处理框架:Storm,Spark和Samza 介绍比较

    转自:http://www.open-open.com/lib/view/open1426065900123.html 许多分布式计算系统都可以实时或接近实时地处理大数据流.本文将对三种Apache框 ...

  9. Storm,Spark和Samza

    http://www.csdn.net/article/2015-03-09/2824135 Apache Storm 在Storm中,先要设计一个用于实时计算的图状结构,我们称之为拓扑(topolo ...

随机推荐

  1. 20155308&20155316 2017-2018-1 《信息安全系统设计基础》实验一

    20155308&20155316 2017-2018-1 <信息安全系统设计基础>实验一 此次实验我和黄月同学一起做了1.2.3.5项,第4项在实验课上做完了,但是没有按时提交. ...

  2. 3110: [Zjoi2013]K大数查询

    3110: [Zjoi2013]K大数查询 https://lydsy.com/JudgeOnline/problem.php?id=3110 分析: 整体二分+线段树. 两种操作:区间加入一个数,区 ...

  3. 四、Django设置相关

    1.全局设置 setttings文件 import os import sys # Build paths inside the project like this: os.path.join(BAS ...

  4. 网络基础知识-bps、Bps、pps的区别

    在计算机科学中,bit是表示信息的最小单位,叫做二进制位:一般用0和1表示.Byte叫做字节,由8个位(8bit)组成一个字节(1Byte),用于表示计算机中的一个字符.bit(比特)与Byte(字节 ...

  5. .net mvc 使用ueditor的开发(官网没有net版本?)

    1.ueditor的下载导入 官网下载地址:https://ueditor.baidu.com/website/download.html · 介绍 有两种,一种开发版,一种Mini版,分别长这样: ...

  6. html5新特性localStorage和sessionStorage

    HTML5 提供了两种在客户端存储数据的新方法: localStorage: (1)它的生命周期是永久的,关闭页面或浏览器之后localStorage中的数据也不会消失. (2)它的容量大小是5M作用 ...

  7. Windows10 Oracle ODBC安装配置

    项目紧迫,需在短时间内交付成果,新团队成员,吐嘈之前数据库设计太low,很难看懂数据库表结构间的关系,为了使新同事更好的了解数据库表结构,特意使用powerDesigner对oracle.mysql数 ...

  8. 2018 ACM-ICPC World Finals - Beijing F.Go with the Flow

    先枚举所有的列长度 对于每种列长度,然后里面用dp算 #include <algorithm> #include <cmath> #include <cstdio> ...

  9. 华为云分布式缓存服务DCS与开源服务差异对比

    华为云分布式缓存DCS提供单机.主备.集群等丰富的实例类型,满足用户高读写性能及快速数据访问的业务诉求.支持丰富的实例管理操作,帮助用户省去运维烦恼.用户可以聚焦于业务逻辑本身,而无需过多考虑部署.监 ...

  10. VPS挂机赚美刀详细介绍–Alexamaster操作流程

    跟 vps 主机打交道时间长了,手里也渐渐积累了些闲置的 vps.让它们这么闲着吧,感觉有些浪费资源:用起来吧,暂时又没有好的项目.一直听说通过 vps挂机可以赚回主机成本,甚至可以盈利.正好这两天有 ...