超好资料:

英文:https://github.com/xetorthio/getting-started-with-storm/blob/master/ch03Topologies.asc

中文:http://ifeve.com/getting-started-with-storm-3/

下面具体讲下:storm的几种groupping 策略的例子

Storm Grouping

  1. shuffleGrouping

    将流分组定义为混排。这种混排分组意味着来自Spout的输入将混排,或随机分发给此Bolt中的任务。shuffle grouping对各个task的tuple分配的比较均匀。

  2. fieldsGrouping

    这种grouping机制保证相同field值的tuple会去同一个task,这对于WordCount来说非常关键,如果同一个单词不去同一个task,那么统计出来的单词次数就不对了。

  3. All grouping

    广播发送, 对于每一个tuple将会复制到每一个bolt中处理。

  4. Global grouping

    Stream中的所有的tuple都会发送给同一个bolt任务处理,所有的tuple将会发送给拥有最小task_id的bolt任务处理。

  5. None grouping

    不关注并行处理负载均衡策略时使用该方式,目前等同于shuffle grouping,另外storm将会把bolt任务和他的上游提供数据的任务安排在同一个线程下。

  6. Direct grouping

    由tuple的发射单元直接决定tuple将发射给那个bolt,一般情况下是由接收tuple的bolt决定接收哪个bolt发射的Tuple。这是一种比较特别的分组方法,用这种分组意味着消息的发送者指定由消息接收者的哪个task处理这个消息。 只有被声明为Direct Stream的消息流可以声明这种分组方法。而且这种消息tuple必须使用emitDirect方法来发射。消息处理者可以通过TopologyContext来获取处理它的消息的taskid (OutputCollector.emit方法也会返回taskid)

In this chapter, we’ll see how to pass tuples between the different components of a Storm topology, and how to deploy a topology into a running Storm cluster

Stream Grouping

One of the most important things that we need to do when designing a topology is to define how data is exchanged between components (how streams are consumed by the bolts). A Stream Grouping specifies which stream(s) are consumed by each bolt and how the stream will be consumed.

Tip
A node can emit more than one stream of data. A stream grouping allows us to choose which stream to receive.

The stream grouping is set when the topology is defined, as we saw in chapter 2, Getting Started:

....
builder.setBolt("word-normalizer", new WordNormalizer())
.shuffleGrouping("word-reader");
....

Here a bolt is set on the topology builder, and then a source is set using the shuffle stream grouping. A stream grouping normally takes the source component id as a parameter, and optionally other parameters as well, depending on the kind of stream grouping.

Tip
There can be more than one source per InputDeclarer, and each source can be grouped with a different stream grouping.

Shuffle Grouping

Shuffle Grouping is the most commonly used grouping. It takes a single parameter (the source component), and sends each tuple, emitted by the source, to a randomly chosen bolt warranting that each consumer will receive the same number of tuples .

The shuffle grouping is useful for doing atomic operations. For example, a math operation. However if the operation can’t be randomically distributed, such as the example in chapter 2 where we needed to count words, we should considerate the use of other grouping.

Fields Grouping

Fields Grouping allows us to control how tuples are sent to bolts, based on one or more fields of the tuple. It guarantees that a given set of values, for a combination of fields, is always sent to the same bolt. Coming back to the word count example, if we group the stream by the word field, the word-normalizer bolt will always send tuples with a given word to the same instance of the word-counter bolt.

....
builder.setBolt("word-counter", new WordCounter(),2)
.fieldsGrouping("word-normalizer", new Fields("word"));
....
Tip
All fields set in the fields grouping must exist in the sources’s field declaration.

All Grouping

All Grouping sends a single copy of each tuple to all instances of the receiving bolt. This kind of grouping is used to sendsignals to bolts, for example if we need to refresh a cache we can send a refresh cache signal to all bolts. In the word-count example, we could use an all grouping to add the ability to clear the counter cache (see Topologies Example)

    public void execute(Tuple input) {
String str = null;
try{
if(input.getSourceStreamId().equals("signals")){
str = input.getStringByField("action");
if("refreshCache".equals(str))
counters.clear();
}
}catch (IllegalArgumentException e) {
//Do nothing
}
....
}

We’ve added an if to check the stream source. Storm give us the posibility to declare named streams (if we don’t send a tuple to a named stream the stream is "default") it’s an excelent way to identify the source of the tuples like this case where we want to identify the signals

In the topology definition, we add a second stream to the word-counter bolt that sends each tuple from the signals-spout stream to all instances of the bolt.

builder.setBolt("word-counter", new WordCounter(),2)
.fieldsGrouping("word-normalizer", new Fields("word"))
.allGrouping("signals-spout","signals");

The implementation of signals-spout can be found at git repository.

Custom Grouping

We can create our own custom stream grouping by implementing the backtype.storm.grouping.CustomStreamGroupinginterface. This gives us the power to decide which bolt(s) will receive each tuple.

Let’s modify the word count example, to group tuples so that all words that start with the same letter will be received by the same bolt.

public class ModuleGrouping implements CustomStreamGrouping, Serializable{

    int numTasks = 0;

    @Override
public List<Integer> chooseTasks(List<Object> values) {
List<Integer> boltIds = new ArrayList();
if(values.size()>0){
String str = values.get(0).toString();
if(str.isEmpty())
boltIds.add(0);
else
boltIds.add(str.charAt(0) % numTasks);
}
return boltIds;
} @Override
public void prepare(TopologyContext context, Fields outFields,
List<Integer> targetTasks) {
numTasks = targetTasks.size();
}
}

Here we can see a simple implementation of CustomStreamGrouping, where we use the amount of tasks to take the modulus of the integer value of the first character of the word, thus selecting which bolt will receive the tuple.

To use this grouping in our example we should change the word-normalizer grouping by the next:

       builder.setBolt("word-normalizer", new WordNormalizer())
.customGrouping("word-reader", new ModuleGrouping());

Direct Grouping

This is a special grouping where the source decides which component will receive the tuple. Similarly to the previous example, the source will decide which bolt receives the tuple based on the first letter of the word. To use direct grouping, in the WordNormalizer bolt we use the emitDirect method instead of emit.

    public void execute(Tuple input) {
....
for(String word : words){
if(!word.isEmpty()){
....
collector.emitDirect(getWordCountIndex(word),new Values(word));
}
}
// Acknowledge the tuple
collector.ack(input);
} public Integer getWordCountIndex(String word) {
word = word.trim().toUpperCase();
if(word.isEmpty())
return 0;
else
return word.charAt(0) % numCounterTasks;
}

We work out the number of target tasks in the prepare method:

    public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
this.numCounterTasks = context.getComponentTasks("word-counter");
}

And in the topology definition, we specify that the stream will be grouped directly:

    builder.setBolt("word-counter", new WordCounter(),2)
.directGrouping("word-normalizer");

Global grouping

Global Grouping sends tuples generated by all instances of the source to a single target instance (specifically, the task with lowest id).

None grouping

At the time of writing (storm version 0.7.1), using this grouping is the same as using Shuffle Grouping. In other words, when using this grouping, we don’t care how streams are grouped

storm学习-storm入门的更多相关文章

  1. storm学习之入门篇(一)

    海量数据处理使用的大多是鼎鼎大名的hadoop或者hive,作为一个批处理系统,hadoop以其吞吐量大.自动容错等优点,在海量数据处理上得到了广泛的使用.但是,hadoop不擅长实时计算,因为它天然 ...

  2. storm学习之入门篇(二)

    Strom的简单实现 Spout的实现 对文件的改变进行分开的监听,并监视目录下有无新日志文件添加. 在数据得到了字段的说明后,将其转换成tuple. 声明Spout和Bolt之间的分组,并决定tup ...

  3. storm学习路线指南

    关于对storm的介绍已经有很多了,我这里不做过多的介绍,我介绍一下我自己的学习路线,希望能帮助一些造轮子的同学走一些捷径,毕竟我也是站在前人总结整理的基础上学习了,如果有不足之处,还请大家不要喷我. ...

  4. 【原】Storm学习资料推荐

    4.Storm学习资料推荐 书籍: 英文: Learning Storm: Ankit Jain, Anand Nalya: 9781783981328: Amazon.com: Books Gett ...

  5. Storm学习笔记 - 消息容错机制

    Storm学习笔记 - 消息容错机制 文章来自「随笔」 http://jsynk.cn/blog/articles/153.html 1. Storm消息容错机制概念 一个提供了可靠的处理机制的spo ...

  6. Storm学习笔记 - Storm初识

    Storm学习笔记 - Storm初识 1. Strom是什么? Storm是一个开源免费的分布式计算框架,可以实时处理大量的数据流. 2. Storm的特点 高性能,低延迟. 分布式:可解决数据量大 ...

  7. 大数据-storm学习资料视频

    storm学习资料视频 https://pan.baidu.com/s/18iQPoVFNHF1NCRBhXsMcWQ

  8. Storm 学习之路(二)—— Storm核心概念详解

    一.Storm核心概念 1.1 Topologies(拓扑) 一个完整的Storm流处理程序被称为Storm topology(拓扑).它是一个是由Spouts 和Bolts通过Stream连接起来的 ...

  9. storm学习初步

    本文根据自己的了解,对学习storm所需的一些知识进行汇总,以备之后详细了解. maven工具 参考书目 Maven权威指南 官方文档 Vagrant 分布式开发环境 博客 storm 参考书目 Ge ...

随机推荐

  1. POJ 3187 杨辉三角+枚举排列 好题

    如果给出一个由1~n组成的序列,我们可以每相邻2个数求和,得到一个新的序列,不断重复,最后得到一个数sum, 现在输入n,sum,要求输出一个这样的排列,如果有多种情况,输出字典序最小的那一个. 刚开 ...

  2. Zend Guard Run-time support missing问题的解决

    Zend Guard不仅可以实现对PHP应用的脚本进行加密保护和对PHP应用的产品进行商业许可证管理,还可以为许多软件生产商.IT服务提供商提供完善的加密和安全的产品发布系统. 虽然现在可以成功加密p ...

  3. js实现的新闻列表垂直滚动实现详解

    js实现的新闻列表垂直滚动实现详解:新闻列表垂直滚动效果在大量的网站都有应用,有点自然是不言而喻的,首先由于网页的空间有限,使用滚动代码可以使用最小的空间提供更多的信息量,还有让网页有了动态的效果,更 ...

  4. GCD信号量并发控制

    /** *  当我们在处理一系列线程的时候,当数量达到一定量,在以前我们可能会选择使用NSOperationQueue来处理并发控制,但如何在GCD中快速的控制并发呢?答案就是dispatch_sem ...

  5. AutoResetEvent和ManualResetEvent理解 z

    AutoResetEvent和ManualResetEvent用于多线程之间代码执行顺序的控制,它们继承自WaitHandle,API相同,但在使用中还是有区别的. 每次使用时虽然理解了,但由于没有去 ...

  6. 如何为PHP贡献代码

    PHP在之前把源代码迁移到了git下管理, 同时也在github(https://github.com/php/php-src)上做了镜像, 这样一来, 就方便了更多的开发者为PHP来贡献代码. 今天 ...

  7. [ActionScript 3.0] AS3.0 马赛克效果

    var bmpd:BitmapData; var matrix:Matrix; var bmp:Bitmap; var size:Number = 5; /** * @author:Frost.Yen ...

  8. Javascript同源策略对context.getImageData的影响

    在本机测试HTML5 Canvas程序的时候,如果用context.drawImage()后再用context.getImageData()获取图片像素数据的时候会抛出错:SECURITY_ERR: ...

  9. C# 数据回滚

    public int GetExecteQuery(string strAddSql, string strUpdateSql, string strDelSql) { SqlConnection c ...

  10. Centos 7配置LAMP

    因为安装zabbix需要LAMP环境,特记录如下. LAMP指的Linux(操作系统).Apache HTTP 服务器,MySQL(有时也指MariaDB,数据库软件)和PHP(有时也是指Perl或P ...