Tasks & executors relation

Q1. However I'm a bit confused by the concept of "task". Is a task an running instance of the component(spout or bolt) ? An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?

A1: Yes, and yes

task只是某个component(spout或者bolt)的实例。Executor线程在执行期间会调用该task的nextTuple或execute方法

Q2. Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ?

A2: Running more than one task per executor does not increase the level of parallelism -- an executor always has one thread that it uses for all of its tasks, which means that tasks run serially on an executor.

运行多个task并不会增加并行度,因为一个executor只是一个线程,这意味着它会顺序执行所有的task

  • The number of executor threads can be changed after the topology has been started (see storm rebalance command).
  • The number of tasks of a topology is static.

And by definition, there is the invariant of #executors <= #tasks.

一个topology的task个数是固定的,但是executor数(线程数)是可以动态改变的。默认的,executor数 <= tasks数

So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is, of course, slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalancethe topology to make full use of all 25 boxes without any downtime.

Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.

一个executor运行2+task数的情况通常有:

  • 为了给topology运行提供多大的灵活度,在运行中可以扩展并发度
  • 为了功能测试

In practice we normally we run 1 task per executor.

PS: Note that Storm will actually spawn a few more threads behind the scenes. For instance, each executor has its own "send thread" that is responsible for handling outgoing tuples. There are also "system-level" background threads for e.g. acking tuples that run alongside "your" threads. IIRC the Storm UI counts those acking threads in addition to "your" threads.

实际上我们通常是 executors数 = task数

Storm如何partition

我们知道 A stream grouping defines how that stream should be partitioned among the bolt's tasks. 简单理解,partition就是定义数据如何分给多个每个task。Storm给出了spout/bolt之间的数据partition算法

  1. Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
  2. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
  3. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
  4. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
  5. None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
  6. Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/backtype/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method inOutputCollector (which returns the task ids that the tuple was sent to).
  7. Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

然而,如何对进入Spout的数据做partition呢?这个问题storm并没有回答。有几种思路

  1. 用户自定义partition的方式,比如基于某个字段进行shuffle

  2. 干脆不考虑partition,spout的并发设置为1,task个数也为1

  3. 还有一种是spout有并发度,多线程抢占式的消费数据,这时候可能需要有一个保持一致性状态的锁。

Reference

http://stackoverflow.com/questions/17257448/what-is-the-task-in-storm-parallelism

http://www.cnblogs.com/yufengof/p/storm-worker-executor-task.html

http://storm.apache.org/releases/0.9.6/Understanding-the-parallelism-of-a-Storm-topology.html

[Storm] 并发度的理解的更多相关文章

  1. Storm并发度和Grouping方式

    Storm并发度和Grouping方式 .note-content {font-family: "Helvetica Neue",Arial,"Hiragino Sans ...

  2. 关于Storm 中Topology的并发度的理解

    来自:https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html htt ...

  3. storm并发度理解

    1. 核心原理 一个运行中的拓扑是由什么组成的:worker进程,executors和tasks.Storm是按照下面3种主要的部分来区分Storm集群中一个实际运行的拓扑的:Worker进程.Exe ...

  4. storm基础系列之一----storm并发度概念剖析

    前言: 学了几天storm的基础,发现如果有hadoop基础,再理解起概念来,容易的多.不过,涉及到一些独有的东西,如调度,如并发度,还是很麻烦.那么,从这一篇开始,力争清晰的梳理这些知识. 在正式学 ...

  5. storm源码之理解Storm中Worker、Executor、Task关系 + 并发度详解

    本文导读: 1 Worker.Executor.task详解 2 配置拓扑的并发度 3 拓扑示例 4 动态配置拓扑并发度 Worker.Executor.Task详解: Storm在集群上运行一个To ...

  6. 用实例的方式去理解storm的并发度

    什么是storm的并发度 一个topology(拓扑)在storm集群上最总是以executor和task的形式运行在suppervisor管理的worker节点上.而worker进程都是运行在jvm ...

  7. 理解Storm并发

    作者:Jack47 PS:如果喜欢我写的文章,欢迎关注我的微信公众账号程序员杰克,两边的文章会同步,也可以添加我的RSS订阅源. 注:本文主要内容翻译自understanding-the-parall ...

  8. Storm基本概念以及Topology的并发度

    Spouts,流的源头 Spout是Storm里面特有的名词,Stream的源头,通常是从外部数据源读取tuples,并emit到topology Spout可以同时emit多个tupic strea ...

  9. Storm并发机制详解

    本文可作为 <<Storm-分布式实时计算模式>>一书1.4节的读书笔记 在Storm中,一个task就可以理解为在集群中某个节点上运行的一个spout或者bolt实例. 记住 ...

随机推荐

  1. linux下解压.tar.gz .tar.bz2

     从网络上下载到的源码包, 最常见的是 .tar.gz 包, 还有一部分是 .tar.bz2包要解压很简单 :.tar.gz     格式解压命令为          tar   -zxvpf   x ...

  2. ascii、unicode、utf、gb等编码详解

    很久很久以前,有一群人,他们决定用8个可以开合的晶体管来组合成不同的状态,以表示世界上的万物.他们看到8个开关状态是好的,于是他们把这称为"字节".再后来,他们又做了一些可以处理这 ...

  3. 【原】移动web资源整理

    2013年初接触移动端,简单做下总结,首先了解下移动web带来的问题 设备更新换代快——低端机遗留下问题.高端机带来新挑战 浏览器厂商不统一——兼容问题多 网络更复杂——弱网络,页面打开慢 低端机性能 ...

  4. bug注意事项记录

    在此记录开发中需要注意的点: UI开发中注意: 1.多按钮同时点击的问题: 2.按钮连续点击的问题(按钮冷却) 3.刷新时注意数据可变性:拆分可变和不变的数据,确保只刷新可变的数据 非UI注意问题: ...

  5. oracle add_months函数

    oracle add_months函数 add_months 函数主要是对日期函数进行操作,举例子进行说明 add_months 有两个参数,第一个参数是日期,第二个参数是对日期进行加减的数字(以月为 ...

  6. Linux下的ctrl常用组合键

    在linux的命令模式下使用ctrl组合键能让操作更便捷. ctrl + k -- 剪切光标及其后边的内容: ctrl + u -- 剪切光标之前的内容: ctrl + y -- 在光标处粘贴上两个命 ...

  7. [LeetCode] Find Right Interval 找右区间

    Given a set of intervals, for each of the interval i, check if there exists an interval j whose star ...

  8. [LeetCode] Copy List with Random Pointer 拷贝带有随机指针的链表

    A linked list is given such that each node contains an additional random pointer which could point t ...

  9. 千万级高并发负载均衡软件HAproxy

    1负载均衡产品介绍 基于硬件的负载均衡设备例如F5,Big-IP,基于软件的负载均衡产品HAproxy,LVS,nginx在这些软件产品中,又分为基于操作系统的软负载实现和基于第三方应用的软负载实现. ...

  10. 冰冻三尺非一日之寒--web框架Django(三)

      第二十章: django(三,多对多)   1.Django请求的生命周期         路由系统 -> 视图函数(获取模板+数据-->渲染) -> 字符串返回给用户   2. ...