（六）storm-kafka源代码走读之PartitionManager

PartitionManager算是storm-kafka的核心类了，如今開始简单分析一下。还是先声明一下，metric部分这里不做分析。

PartitionManager主要负责的是消息的发送、容错处理，所以PartitionManager会有三个集合

_pending：尚未发送的message的offset集合，是个TreeSet<Long>()
failed ：发送失败的offset 集合，是个TreeSet<Long>()
_waitingToEmit：存放待发射的message，是个LinkedList。从kafka读到的message所有先放在这里

还有几个变量须要说一下：

Long _emittedToOffset; 从kafka读到的offset。从kafka读到的messages会放入_waitingToEmit，放入这个list。我们就觉得一定会被emit，所以emittedToOffset能够觉得是从kafka读到的offset
Long _committedTo; 已经写入zk的offset
lastCompletedOffset() 已经被成功处理的offset，因为message是要在storm里面处理的，当中是可能fail的，所以正在处理的offset是缓存在_pending中的

假设_pending为空，那么lastCompletedOffset=_emittedToOffset

假设_pending不为空，那么lastCompletedOffset为pending list里面第一个offset，由于后面都还在等待ack

public long lastCompletedOffset() {

        if (_pending.isEmpty()) {

            return _emittedToOffset;

        } else {

            return _pending.first();

        }

    }

来看初始化的过程：

     public PartitionManager(DynamicPartitionConnections connections, String topologyInstanceId, ZkState state, Map stormConf, SpoutConfig spoutConfig, Partition id) {

        _partition = id;

        _connections = connections;

        _spoutConfig = spoutConfig;

        _topologyInstanceId = topologyInstanceId;

        _consumer = connections.register(id.host, id.partition);

        _state = state;

        _stormConf = stormConf;

        numberAcked = numberFailed = 0;

        String jsonTopologyId = null;

        Long jsonOffset = null;

        String path = committedPath();

        try {

            Map<Object, Object> json = _state.readJSON(path); // 从zk读取offset

            LOG.info("Read partition information from: " + path +  "  --> " + json );

            if (json != null) {

                jsonTopologyId = (String) ((Map<Object, Object>) json.get("topology")).get("id");

                jsonOffset = (Long) json.get("offset");

            }

        } catch (Throwable e) {

            LOG.warn("Error reading and/or parsing at ZkNode: " + path, e);

        }

		/**

		 * 依据用户设置的startOffsetTime，值来读取offset（-2 从kafka头開始  -1 是从最新的開始 0 =无 从ZK開始）

		 **/

        Long currentOffset = KafkaUtils.getOffset(_consumer, spoutConfig.topic, id.partition, spoutConfig);

        if (jsonTopologyId == null || jsonOffset == null) { // failed to parse JSON?

            _committedTo = currentOffset;

            LOG.info("No partition information found, using configuration to determine offset");

        } else if (!topologyInstanceId.equals(jsonTopologyId) && spoutConfig.forceFromStart) {

            _committedTo = KafkaUtils.getOffset(_consumer, spoutConfig.topic, id.partition, spoutConfig.startOffsetTime);

            LOG.info("Topology change detected and reset from start forced, using configuration to determine offset");

        } else {

            _committedTo = jsonOffset;

            LOG.info("Read last commit offset from zookeeper: " + _committedTo + "; old topology_id: " + jsonTopologyId + " - new topology_id: " + topologyInstanceId );

        }

		/**

		 * 以下这个if推断是假设当前读取的offset值与提交到zk的值不一致，且相差Long.MAX_VALUE,就觉得中间非常大部分msg发射了没有提交，就把这部分所有放弃，避免重发

		 * 令_committedTo = currentOffset， 这个是新修复的bug。之前maxOffsetBehind=100000（好像是这个值，这个太小），bug isuue 是 STORM-399

		 **/

        if (currentOffset - _committedTo > spoutConfig.maxOffsetBehind || _committedTo <= 0) {

            LOG.info("Last commit offset from zookeeper: " + _committedTo);

            _committedTo = currentOffset;

            LOG.info("Commit offset " + _committedTo + " is more than " +

                    spoutConfig.maxOffsetBehind + " behind, resetting to startOffsetTime=" + spoutConfig.startOffsetTime);

        }

        LOG.info("Starting Kafka " + _consumer.host() + ":" + id.partition + " from offset " + _committedTo);

        _emittedToOffset = _committedTo;

        _fetchAPILatencyMax = new CombinedMetric(new MaxMetric());

        _fetchAPILatencyMean = new ReducedMetric(new MeanReducer());

        _fetchAPICallCount = new CountMetric();

        _fetchAPIMessageCount = new CountMetric();

    }

刚開始的时候须要读取message，放到_waitingToEmit中。这是fill的过程，看代码

private void fill() {

        long start = System.nanoTime();

        long offset;

		// 首先要推断是否有fail的offset， 假设有的话，在须要从这个offset開始往下去读取message，所以这里有重发的可能

        final boolean had_failed = !failed.isEmpty(); 

        // Are there failed tuples?

If so, fetch those first.

        if (had_failed) {

            offset = failed.first(); // 取失败的最小的offset值。

        } else {

            offset = _emittedToOffset;

        }

        ByteBufferMessageSet msgs = KafkaUtils.fetchMessages(_spoutConfig, _consumer, _partition, offset);

        long end = System.nanoTime();

        long millis = (end - start) / 1000000;

        _fetchAPILatencyMax.update(millis);

        _fetchAPILatencyMean.update(millis);

        _fetchAPICallCount.incr();

        if (msgs != null) {

            int numMessages = 0;

            for (MessageAndOffset msg : msgs) {

                final Long cur_offset = msg.offset();

                if (cur_offset < offset) {

                    // Skip any old offsets.

                    continue;

                }

				/**

				 * 仅仅要是没有失败的或者失败的set中含有该offset（由于失败msg有非常多，我们仅仅是从最小的offset開始读取msg的）

				 * ，就把这个message放到待发射的list中

				 **/

                if (!had_failed || failed.contains(cur_offset)) {

                    numMessages += 1;

                    _pending.add(cur_offset);

                    _waitingToEmit.add(new MessageAndRealOffset(msg.message(), cur_offset));

                    _emittedToOffset = Math.max(msg.nextOffset(), _emittedToOffset);

                    if (had_failed) { // 假设失败列表中含有该offset，就移除，由于要又一次发射了。

                        failed.remove(cur_offset);

                    }

                }

            }

            _fetchAPIMessageCount.incrBy(numMessages);

        }

    }

=================== github 最新代码======================

把例如以下代码改成

ByteBufferMessageSet msgs = KafkaUtils.fetchMessages(_spoutConfig, _consumer, _partition, offset);

改动成：

ByteBufferMessageSet msgs = null;

        try {

            msgs = KafkaUtils.fetchMessages(_spoutConfig, _consumer, _partition, offset);

        } catch (UpdateOffsetException e) {

            _emittedToOffset = KafkaUtils.getOffset(_consumer, _spoutConfig.topic, _partition.partition, _spoutConfig);

            LOG.warn("Using new offset: {}", _emittedToOffset);

            // fetch failed, so don't update the metrics

            return;

        }

====================2014-11-10========================

fetchMessage过程

public static ByteBufferMessageSet fetchMessages(KafkaConfig config, SimpleConsumer consumer, Partition partition, long offset) {

        ByteBufferMessageSet msgs = null;

        String topic = config.topic;

        int partitionId = partition.partition;

        for (int errors = 0; errors < 2 && msgs == null; errors++) { //容忍两次错误

            FetchRequestBuilder builder = new FetchRequestBuilder();

            FetchRequest fetchRequest = builder.addFetch(topic, partitionId, offset, config.fetchSizeBytes).

                    clientId(config.clientId).maxWait(config.fetchMaxWait).build();

            FetchResponse fetchResponse;

            try {

                fetchResponse = consumer.fetch(fetchRequest);

            } catch (Exception e) {

                if (e instanceof ConnectException ||

                        e instanceof SocketTimeoutException ||

                        e instanceof IOException ||

                        e instanceof UnresolvedAddressException

                        ) {

                    LOG.warn("Network error when fetching messages:", e);

                    throw new FailedFetchException(e);

                } else {

                    throw new RuntimeException(e);

                }

            }

            if (fetchResponse.hasError()) { // 主要处理offset outofrange的case。通过getOffset从earliest或latest读

                KafkaError error = KafkaError.getError(fetchResponse.errorCode(topic, partitionId));

                if (error.equals(KafkaError.OFFSET_OUT_OF_RANGE) && config.useStartOffsetTimeIfOffsetOutOfRange && errors == 0) {

                    long startOffset = getOffset(consumer, topic, partitionId, config.startOffsetTime);

                    LOG.warn("Got fetch request with offset out of range: [" + offset + "]; " +

                            "retrying with default start offset time from configuration. " +

                            "configured start offset time: [" + config.startOffsetTime + "] offset: [" + startOffset + "]");

                    offset = startOffset;

                } else {

                    String message = "Error fetching data from [" + partition + "] for topic [" + topic + "]: [" + error + "]";

                    LOG.error(message);

                    throw new FailedFetchException(message);

                }

            } else {

                msgs = fetchResponse.messageSet(topic, partitionId);

            }

        }

        return msgs;

    }

====================== github 上关于fetchMessage已有相关修改。修改代码例如以下====================================

public static ByteBufferMessageSet fetchMessages(KafkaConfig config, SimpleConsumer consumer, Partition partition, long offset) throws UpdateOffsetException {

        ByteBufferMessageSet msgs = null;

        String topic = config.topic;

        int partitionId = partition.partition;

        FetchRequestBuilder builder = new FetchRequestBuilder();

        FetchRequest fetchRequest = builder.addFetch(topic, partitionId, offset, config.fetchSizeBytes).

                clientId(config.clientId).maxWait(config.fetchMaxWait).build();

        FetchResponse fetchResponse;

        try {

            fetchResponse = consumer.fetch(fetchRequest);

        } catch (Exception e) {

            if (e instanceof ConnectException ||

                    e instanceof SocketTimeoutException ||

                    e instanceof IOException ||

                    e instanceof UnresolvedAddressException

                    ) {

                LOG.warn("Network error when fetching messages:", e);

                throw new FailedFetchException(e);

            } else {

                throw new RuntimeException(e);

            }

        }

        if (fetchResponse.hasError()) {

            KafkaError error = KafkaError.getError(fetchResponse.errorCode(topic, partitionId));

            if (error.equals(KafkaError.OFFSET_OUT_OF_RANGE) && config.useStartOffsetTimeIfOffsetOutOfRange) {

                LOG.warn("Got fetch request with offset out of range: [" + offset + "]; " +

                        "retrying with default start offset time from configuration. " +

                        "configured start offset time: [" + config.startOffsetTime + "]");

                throw new UpdateOffsetException();

            } else {

                String message = "Error fetching data from [" + partition + "] for topic [" + topic + "]: [" + error + "]";

                LOG.error(message);

                throw new FailedFetchException(message);

            }

        } else {

            msgs = fetchResponse.messageSet(topic, partitionId);

        }

        return msgs;

    }

去掉了三次容错处理。之所以这么改动的原因大家去看STORM-511 isuue。描写叙述的原因例如以下：

With default behaviour (KafkaConfig.useStartOffsetTimeIfOffsetOutOfRange == true) when Kafka returns the error about offset being out of range, storm.kafka.KafkaUtils.fetchMessages
tries to fix offset in local scope and retry fetch request. But if there are no more messages appeared under that specified partition it will never update the PartitionManager, but keep sending tons of requests with invalid offset to Kafka broker. On both
sides Storm and Kafka logs grow extremely quick during that time.

==========================================2014-11-10加入=====================================================

再来看next方法。这种方法就是KafkaSpout 的nextTuple真正调用的方法

//returns false if it's reached the end of current batch

    public EmitState next(SpoutOutputCollector collector) {

        if (_waitingToEmit.isEmpty()) {

            fill(); // 開始时获取message

        }

        while (true) {

            MessageAndRealOffset toEmit = _waitingToEmit.pollFirst(); //每次读取一条

            if (toEmit == null) {

                return EmitState.NO_EMITTED;

            }

			// 假设忘记了，能够再返回看下自己定义scheme这篇 ： http://blog.csdn.net/wzhg0508/article/details/40874155

            Iterable<List<Object>> tups = KafkaUtils.generateTuples(_spoutConfig, toEmit.msg);

            if (tups != null) {

                for (List<Object> tup : tups) { //这个地方在讲述自己定义Scheme时。提到了

                    collector.emit(tup, new KafkaMessageId(_partition, toEmit.offset));

                }

                break; // 这里就是每成功发射一天msg，就break掉，返回emitstate给kafkaSpout的nextTuple中做推断和定时commit成功处理的offset到zk

            } else {

                ack(toEmit.offset); // ack 做清除工作

            }

        }

        if (!_waitingToEmit.isEmpty()) {

            return EmitState.EMITTED_MORE_LEFT;

        } else {

            return EmitState.EMITTED_END;

        }

    }

来看ack的操作：

public void ack(Long offset) {

        if (!_pending.isEmpty() && _pending.first() < offset - _spoutConfig.maxOffsetBehind) {

            // Too many things pending!

            _pending.headSet(offset - _spoutConfig.maxOffsetBehind).clear();

        }

        _pending.remove(offset);

        numberAcked++;

    }

接下来看下。commit的操作，这个操作也是在kafkaSpout中调用的

public void commit() {

        long lastCompletedOffset = lastCompletedOffset();

        if (_committedTo != lastCompletedOffset) {

            LOG.debug("Writing last completed offset (" + lastCompletedOffset + ") to ZK for " + _partition + " for topology: " + _topologyInstanceId);

            Map<Object, Object> data = (Map<Object, Object>) ImmutableMap.builder()

                    .put("topology", ImmutableMap.of("id", _topologyInstanceId,

                            "name", _stormConf.get(Config.TOPOLOGY_NAME)))

                    .put("offset", lastCompletedOffset)

                    .put("partition", _partition.partition)

                    .put("broker", ImmutableMap.of("host", _partition.host.host,

                            "port", _partition.host.port))

                    .put("topic", _spoutConfig.topic).build();

            _state.writeJSON(committedPath(), data);

            _committedTo = lastCompletedOffset;

            LOG.debug("Wrote last completed offset (" + lastCompletedOffset + ") to ZK for " + _partition + " for topology: " + _topologyInstanceId);

        } else {

            LOG.debug("No new offset for " + _partition + " for topology: " + _topologyInstanceId);

        }

    }

这个commit非常easy。不用我多做解释了

重点来看下fail处理吧

public void fail(Long offset) {

        if (offset < _emittedToOffset - _spoutConfig.maxOffsetBehind) {

            LOG.info(

                    "Skipping failed tuple at offset=" + offset +

                            " because it's more than maxOffsetBehind=" + _spoutConfig.maxOffsetBehind +

                            " behind _emittedToOffset=" + _emittedToOffset

            );

        } else {

            LOG.debug("failing at offset=" + offset + " with _pending.size()=" + _pending.size() + " pending and _emittedToOffset=" + _emittedToOffset);

            failed.add(offset);

            numberFailed++;

            if (numberAcked == 0 && numberFailed > _spoutConfig.maxOffsetBehind) {

                throw new RuntimeException("Too many tuple failures");

            }

        }

    }

之前storm-kafka-0.8plus的版本号是这种（摘自storm-kafka-0.8-plus
源代码解析）

首先作者没有cache message。而仅仅是cache offset

所以fail的时候，他是无法直接replay的，在他的凝视里面写了，不这样做的原因是怕内存爆掉

所以他的做法是，当一个offset fail的时候，直接将_emittedToOffset回滚到当前fail的这个offset

下次从Kafka fetch的时候会从_emittedToOffset開始读，这样做的优点就是依赖kafka做replay，问题就是会有反复问题

所以使用时，一定要考虑，能否够接受反复问题

public void fail(Long offset) {

        //TODO: should it use in-memory ack set to skip anything that's been acked but not committed?

?

?

// things might get crazy with lots of timeouts

        if (_emittedToOffset > offset) {

            _emittedToOffset = offset;

            _pending.tailSet(offset).clear();

        }

    }

大家对照一下吧

兴许继续更新上述未说清楚的地方，望大家包涵

reference

http://www.cnblogs.com/fxjwind/p/3808346.html

（六）storm-kafka源代码走读之PartitionManager的更多相关文章

【原创】Windows平台搭建Kafka源代码开发环境(Eclipse版本)
最近在研究Kafka源代码,需要自己搭建一个开发环境.官网上给出的提示略显简单,照着做了一遍也碰到了一些问题.特此记录下来. 开发环境: Oracle Java 1.7_u71 + Eclipse 4 ...
Storm+kafka的HelloWorld初体验
从16年4月5号开始学习kafka,后来由于项目需要又涉及到了storm. 经过几天的扫盲,到今天16年4月13日,磕磕碰碰的总算是写了一个kafka+storm的HelloWorld的例子. 为了达 ...
Kafka 源代码分析.
这里记录kafka源代码笔记.(代码版本是0.8.2.1) kafka的源代码如何下载.这里简单说一下. git clone https://git-wip-us.apache.org/repos/a ...
kafka源代码环境配置
kafka版本10.0.0.没有采用最新版本是因为项目中目前使用了这个版本. 1.安装gradle 首先进入https://gradle.org/install 查看Install manually- ...
Kafka设计解析（六）Kafka高性能架构之道
转载自技术世界,原文链接 Kafka设计解析(六)- Kafka高性能架构之道本文从宏观架构层面和微观实现层面分析了Kafka如何实现高性能.包含Kafka如何利用Partition实现并行处理和 ...
storm笔记：Storm+Kafka简单应用
storm笔记:Storm+Kafka简单应用这几天工作须要使用storm+kafka,基本场景是应用出现错误,发送日志到kafka的某个topic.storm订阅该topic.然后进行兴许处理.场 ...
大数据学习——Storm+Kafka+Redis整合
1 pom.xml <?xml version="1.0" encoding="UTF-8"?> <project xmlns="h ...
hadoop+yarn+hbase+storm+kafka+spark+zookeeper)高可用集群详细配置
配置 hadoop+yarn+hbase+storm+kafka+spark+zookeeper 高可用集群,同时安装相关组建:JDK,MySQL,Hive,Flume 文章目录环境介绍节点介绍 ...
word2vec 中的数学原理具体解释（六）若干源代码细节
word2vec 是 Google 于 2013 年开源推出的一个用于获取 word vector 的工具包,它简单.高效,因此引起了非常多人的关注.因为 word2vec 的作者 Tomas M ...

随机推荐

Python 绘图与可视化 matplotlib 制作Gif动图
参考链接:https://blog.csdn.net/theonegis/article/details/51037850 官方文档:https://matplotlib.org/3.1.0/api/ ...
2019-03-18 Python time 将2015年11月20日转换为2015-11-20
#ReportingDate = soup.select('body > div.main > div > div.ctr > div.recruit > ul > ...
【转】python 关键字
转自:http://www.cnblogs.com/hongten/p/hongten_python_keywords.html python3.3.2中的关键字如下: The following i ...
SQL SERVER-in,between,like
and 1 LIKE 操作符用于在 WHERE 子句中搜索列中的指定模式 ( 可以使用正则表达式) select * from [User] where UserName like '%r' -- 以 ...
atitit。流程图的设计与制作 attilax 总结
atitit.流程图的设计与制作 attilax 总结 1. 流程图的规范1 2. 画图语言2 2.1. atitit.CSDN-markdown编辑器2 2.2. js-sequence-diagr ...
基于MySQL元数据的Hive的安装和简单測试
引言: Hive是一种强大的数据仓库查询语言,类似SQL,本文将介绍怎样搭建Hive的开发測试环境. 1. 什么是Hive? hive是基于Hadoop的一个数据仓库工具,能够将结构化的数据文件映射为 ...
改动android 系统时间
命令如 date -s "yyyymmdd.[[[hh]mm]ss]" 直接在CRT上执行,举例:date -s "20120801.120503" 但在adb ...
朝花夕拾——finally/final/finalize拨云雾见青天
Java编程中.常常会使用到异常处理,而finally看似的是try/catch后对逻辑处理的完好,事实上里面却存在非常多隐晦的陷阱.final常见于变量修饰,那么你在内部类中也见过吧.finaliz ...
2435: [Noi2011]道路修建(树上操作)
2435: [Noi2011]道路修建题目:传送门题解: 建完边之后以1为根建树,统计深度和各个点的子树大小(包括自己) 询问的时候:答案=长度*abs(n-深度大的点的子树大小*2) ans+= ...
Git 时间，将代码托管到GitHub 上
第一步:在github上创建一个项目,选择所属类型.会自动生成下面的文件. 第二步:使用安卓创建项目第三步:使用git bash 进入项目目录,通过指令clone到本地克隆完成后会出现下面的内容 ...

（六）storm-kafka源代码走读之PartitionManager

reference

（六）storm-kafka源代码走读之PartitionManager的更多相关文章

随机推荐

热门专题