Message Delivery Semantics
4.6 Message Delivery Semantics
Now that we understand a little about how producers and consumers work, let's discuss the semantic guarantees Kafka provides between producer and consumer. Clearly there are multiple possible message delivery guarantees that could be provided:
- At most once—Messages may be lost but are never redelivered.
- At least once—Messages are never lost but may be redelivered.
- Exactly once—this is what people actually want, each message is delivered once and only once.
It's worth noting that this breaks down into two problems: the durability guarantees for publishing a message and the guarantees when consuming a message.
Many systems claim to provide "exactly once" delivery semantics, but it is important to read the fine print, most of these claims are misleading (i.e. they don't translate to the case where consumers or producers can fail, cases where there are multiple consumer processes, or cases where data written to disk can be lost).
Kafka's semantics are straight-forward. When publishing a message we have a notion of the message being "committed" to the log. Once a published message is committed it will not be lost as long as one broker that replicates the partition to which this message was written remains "alive". The definition of committed message, alive partition as well as a description of which types of failures we attempt to handle will be described in more detail in the next section. For now let's assume a perfect, lossless broker and try to understand the guarantees to the producer and consumer. If a producer attempts to publish a message and experiences a network error it cannot be sure if this error happened before or after the message was committed. This is similar to the semantics of inserting into a database table with an autogenerated key.
Prior to 0.11.0.0, if a producer failed to receive a response indicating that a message was committed, it had little choice but to resend the message. This provides at-least-once delivery semantics since the message may be written to the log again during resending if the original request had in fact succeeded. Since 0.11.0.0, the Kafka producer also supports an idempotent delivery option which guarantees that resending will not result in duplicate entries in the log. To achieve this, the broker assigns each producer an ID and deduplicates messages using a sequence number that is sent by the producer along with every message. Also beginning with 0.11.0.0, the producer supports the ability to send messages to multiple topic partitions using transaction-like semantics: i.e. either all messages are successfully written or none of them are. The main use case for this is exactly-once processing between Kafka topics (described below).
Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durability level it desires. If the producer specifies that it wants to wait on the message being committed this can take on the order of 10 ms. However the producer can also specify that it wants to perform the send completely asynchronously or that it wants to wait only until the leader (but not necessarily the followers) have the message.
Now let's describe the semantics from the point-of-view of the consumer. All replicas have the exact same log with the same offsets. The consumer controls its position in this log. If the consumer never crashed it could just store this position in memory, but if the consumer fails and we want this topic partition to be taken over by another process the new process will need to choose an appropriate position from which to start processing. Let's say the consumer reads some messages -- it has several options for processing the messages and updating its position.
- It can read the messages, then save its position in the log, and finally process the messages. In this case there is a possibility that the consumer process crashes after saving its position but before saving the output of its message processing. In this case the process that took over processing would start at the saved position even though a few messages prior to that position had not been processed. This corresponds to "at-most-once" semantics as in the case of a consumer failure messages may not be processed.
- It can read the messages, process the messages, and finally save its position. In this case there is a possibility that the consumer process crashes after processing messages but before saving its position. In this case when the new process takes over the first few messages it receives will already have been processed. This corresponds to the "at-least-once" semantics in the case of consumer failure. In many cases messages have a primary key and so the updates are idempotent (receiving the same message twice just overwrites a record with another copy of itself).
So what about exactly once semantics (i.e. the thing you actually want)? When consuming from a Kafka topic and producing to another topic (as in a Kafka Streams application), we can leverage the new transactional producer capabilities in 0.11.0.0 that were mentioned above. The consumer's position is stored as a message in a topic, so we can write the offset to Kafka in the same transaction as the output topics receiving the processed data. If the transaction is aborted, the consumer's position will revert to its old value and the produced data on the output topics will not be visible to other consumers, depending on their "isolation level." In the default "read_uncommitted" isolation level, all messages are visible to consumers even if they were part of an aborted transaction, but in "read_committed," the consumer will only return messages from transactions which were committed (and any messages which were not part of a transaction).
When writing to an external system, the limitation is in the need to coordinate the consumer's position with what is actually stored as output. The classic way of achieving this would be to introduce a two-phase commit between the storage of the consumer position and the storage of the consumers output. But this can be handled more simply and generally by letting the consumer store its offset in the same place as its output. This is better because many of the output systems a consumer might want to write to will not support a two-phase commit. As an example of this, consider a Kafka Connect connector which populates data in HDFS along with the offsets of the data it reads so that it is guaranteed that either data and offsets are both updated or neither is. We follow similar patterns for many other data systems which require these stronger semantics and for which the messages do not have a primary key to allow for deduplication.
So effectively Kafka supports exactly-once delivery in Kafka Streams, and the transactional producer/consumer can be used generally to provide exactly-once delivery when transfering and processing data between Kafka topics. Exactly-once delivery for other destination systems generally requires cooperation with such systems, but Kafka provides the offset which makes implementing this feasible (see also Kafka Connect). Otherwise, Kafka guarantees at-least-once delivery by default, and allows the user to implement at-most-once delivery by disabling retries on the producer and committing offsets in the consumer prior to processing a batch of messages.
总之,Kafka 默认保证 At least once,并且允许通过设置 producer 异步提交来实现 At most once(见文章《kafka consumer防止数据丢失》)。而 Exactly once 要求与外部存储系统协作,幸运的是 kafka 提供的 offset 可以非常直接非常容易得使用这种方式。
Message Delivery Semantics的更多相关文章
- Kafka消息delivery可靠性保证(Message Delivery Semantics)
原文见:http://kafka.apache.org/documentation.html#semantics kafka在生产者和消费者之间的传输是如何保证的,我们可以知道有这么几种可能提供的de ...
- Apache Kafka(八)- Kafka Delivery Semantics for Consumers
Kafka Delivery Semantics 在Kafka Consumer中,有3种delivery semantics,分别为:至多一次(at most once).至少一次(at least ...
- kafka学习笔记:知识点整理
一.为什么需要消息系统 1.解耦: 允许你独立的扩展或修改两边的处理过程,只要确保它们遵守同样的接口约束. 2.冗余: 消息队列把数据进行持久化直到它们已经被完全处理,通过这一方式规避了数据丢失风险. ...
- Kafka基本原理
简介 Apache Kafka是分布式发布-订阅消息系统.它最初由LinkedIn公司开发,之后成为Apache项目的一部分.Kafka是一种快速.可扩展的.设计内在就是分布式的,分区的和可复制的提交 ...
- Flume官方文档翻译——Flume 1.7.0 User Guide (unreleased version)(一)
Flume 1.7.0 User Guide Introduction(简介) Overview(综述) System Requirements(系统需求) Architecture(架构) Data ...
- kafka producer源码
producer接口: /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor l ...
- Kafka 0.10.0
2.1 Producer API We encourage all new development to use the new Java producer. This client is produ ...
- Apache kafka原理与特性(0.8V)
前言: kafka是一个轻量级的/分布式的/具备replication能力的日志采集组件,通常被集成到应用系统中,收集"用户行为日志"等,并可以使用各种消费终端(consumer) ...
- kafka概念
一.结构与概念解释 1.基础概念 topics: kafka通过topics维护各类信息. producer:发布消息到Kafka topic的进程. consumer:订阅kafka topic进程 ...
随机推荐
- PHP:微信小程序调用【统一下单】【微信支付】【支付回调】API;XML转Array,Array转XML方法(通用)
1.微信公众号.微信小程序开发过程中,第三方服务器与微信服务器数据交互,需要进行数据转换,必须用到这两个函数: 分别是xml_to_array.array_to_xml ; /** * 输出xml字符 ...
- StringBuilder.AppendFormat(String, Object, Object) 方法
将通过处理复合格式字符串(包含零个或零个以上格式项)返回的字符串追加到此实例. 每个格式项都替换为这两个参数中任意一个参数的字符串表示形式. 说明: public StringBuilder Appe ...
- HDUOJ----2647Reward
Reward Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total Subm ...
- 阿里云ECS服务器Linux环境下配置php服务器(二)--phpMyAdmin篇
上一篇讲了PHP服务器的基本配置,我们安装了apache,php,还有MySQL,最后还跑通了一个非常简单的php页面,有兴趣的朋友可以看我的这篇博客: 阿里云ECS服务器Linux环境下配置php服 ...
- Redis基本操作——List
Redis基本操作——List(原理篇) 学习过数据结构的同学,一定对链表(Linked List)十分的熟悉.相信我们自己也曾经使用过这种数据结构. 链表分为很多种:单向链表,双向链表,循环链表,块 ...
- 使用 HTML5 History 新特性增强 Ajax 的体验(转)
一. 场景再现 如大家熟知,Ajax 可以实现页面的无刷新操作,但会造成两个与普通页面操作(有刷新地改变页面)有着明显差别的问题—— URL 没有修改以及无法使用前进.后退按钮.例如常见的 Ajax ...
- 【js】正则表达式(I)
正则表达式是由英文词语regular expression翻译过来的,就是符合某种规则的表达式.正则表达式在软件开发中应用非常广泛,例如,找出网页中的超链接,找出网页中的email地址,找出网页中的手 ...
- 深度解析(一六)Floyd算法
Floyd算法(一)之 C语言详解 本章介绍弗洛伊德算法.和以往一样,本文会先对弗洛伊德算法的理论论知识进行介绍,然后给出C语言的实现.后续再分别给出C++和Java版本的实现. 目录 1. 弗洛伊德 ...
- Python modf() 函数
描述 modf() 方法返回x的整数部分与小数部分,两部分的数值符号与x相同,整数部分以浮点型表示. 语法 以下是 modf() 方法的语法: import math math.modf( x ) 注 ...
- mysqld_safe与mysqld区别详解
mysqld_safe与mysqld区别,直接运行mysqld程序来启动MySQL服务的方法很少见,mysqld_safe脚本会在启动MySQL服务器后继续监控其运行情况,并在其死机时重新启动它. 用 ...