在生产中需要将一些数据发到kafka,而且需要做到EXACTLY_ONCE,kafka使用的版本为1.1.0,flink的版本为1.8.0,但是会很经常因为提交事务引起错误,甚至导致任务重启

kafka producer的配置如下

  def getKafkaProducer(kafkaAddr: String, targetTopicName: String, kafkaProducersPoolSize: Int): FlinkKafkaProducer[String] = {
val properties = new Properties()
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaAddr)
properties.setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, 6000 * 6 + "")
// 设置了retries参数,可以在Kafka的Partition发生leader切换时,Flink不重启,而是做5次尝试:
properties.setProperty(ProducerConfig.RETRIES_CONFIG, "5")
properties.setProperty(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, String.valueOf(1048576 * 5))
val serial = new KeyedSerializationSchemaWrapper(new SimpleStringSchema())
//val producer = new FlinkKafkaProducer011[String](targetTopicName, serial, properties, Optional.of(new KafkaProducerPartitioner[String]()), Semantic.EXACTLY_ONCE, kafkaProducersPoolSize)
val producer = new FlinkKafkaProducer[String](targetTopicName, serial, properties, Optional.of(new KafkaProducerPartitioner[String]()), FlinkKafkaProducer.Semantic.EXACTLY_ONCE, kafkaProducersPoolSize)
producer.setWriteTimestampToKafka(true)
producer
}

Flink env如下

    val env = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(60 * 1000 * 1, CheckpointingMode.EXACTLY_ONCE)
val config = env.getCheckpointConfig
//RETAIN_ON_CANCELLATION在job canceled的时候会保留externalized checkpoint state
config.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
//用于指定checkpoint coordinator上一个checkpoint完成之后最小等多久可以出发另一个checkpoint,当指定这个参数时,maxConcurrentCheckpoints的值为1
config.setMinPauseBetweenCheckpoints(3000)
//用于指定运行中的checkpoint最多可以有多少个,如果有设置了minPauseBetweenCheckpoints,则maxConcurrentCheckpoints这个参数就不起作用了(大于1的值不起作用)
config.setMaxConcurrentCheckpoints(1)
//指定checkpoint执行的超时时间(单位milliseconds),超时没完成就会被abort掉
config.setCheckpointTimeout(30000)
//用于指定在checkpoint发生异常的时候,是否应该fail该task,默认为true,如果设置为false,则task会拒绝checkpoint然后继续运行
//https://issues.apache.org/jira/browse/FLINK-11662
config.setFailOnCheckpointingErrors(false)

然后经常会出现事务失效的问题,报错有很多种,大概为以下

java.lang.RuntimeException: Error while confirming checkpoint
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1218)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.util.FlinkRuntimeException: Committing one of transactions failed, logging first encountered failure
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.notifyCheckpointComplete(TwoPhaseCommitSinkFunction.java:296)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.notifyCheckpointComplete(AbstractUdfStreamOperator.java:130)
at org.apache.flink.streaming.runtime.tasks.StreamTask.notifyCheckpointComplete(StreamTask.java:684)
at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1213)
... 5 more
Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
org.apache.flink.streaming.connectors.kafka.FlinkKafkaException: Failed to send data to Kafka: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:1002)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:619)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:97)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:228)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
org.apache.kafka.common.KafkaException: Cannot perform send because at least one previous transactional or idempotent request has failed with errors.
at org.apache.kafka.clients.producer.internals.TransactionManager.failIfNotReadyForSend(TransactionManager.java:278)
at org.apache.kafka.clients.producer.internals.TransactionManager.maybeAddPartitionToTransaction(TransactionManager.java:263)
at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:804)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:760)
at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaInternalProducer.send(FlinkKafkaInternalProducer.java:105)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:650)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:97)
at org.apache.flink.streaming.api.functions.sink.TwoPhaseCommitSinkFunction.invoke(TwoPhaseCommitSinkFunction.java:228)
at org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:56)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.
Checkpoint failed: Failed to send data to Kafka: Producer attempted an operation with an old epoch. Either there is a newer producer with the same transactionalId, or the producer's transaction has been expired by the broker.

Checkpoint failed: Could not complete snapshot 11 for operator Sink: data_Sink (2/2).

这些错误基本涉及到两阶段提交、事务、checkpoint。

查看kafka documentation和研究ProducerConfig这个类后发现 kafka producer 在使用EXACTLY_ONCE的时候需要增加一些配置

the transaction timeout must be larger than the checkpoint interval, but smaller than the broker transaction.max.timeout.ms.

在getKafkaProducer增加以下配置后,出现原来的错误减少

    //checkpoint 间隔时间<TRANSACTION_TIMEOUT_CONFIG<kafka transaction.max.timeout.ms (默认900秒)
properties.setProperty(ProducerConfig.TRANSACTION_TIMEOUT_CONFIG, 1000 * 60 * 3 + "")
properties.setProperty(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1")
properties.setProperty(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, "true")

问题得到缓解。

参考:

https://www.cnblogs.com/wangzhuxing/p/10111831.html

http://www.heartthinkdo.com/?p=2040

http://romanmarkunas.com/web/blog/kafka-transactions-in-practice-1-producer/

Flink生产数据到Kafka频繁出现事务失效导致任务重启的更多相关文章

  1. kafka没配置好,导致服务器重启之后,topic丢失,topic里面的消息也丢失

    转,原文:https://blog.csdn.net/zfszhangyuan/article/details/53389916 ----------------------------------- ...

  2. Kafka科普系列 | Kafka中的事务是什么样子的?

    事务,对于大家来说可能并不陌生,比如数据库事务.分布式事务,那么Kafka中的事务是什么样子的呢? 在说Kafka的事务之前,先要说一下Kafka中幂等的实现.幂等和事务是Kafka 0.11.0.0 ...

  3. flink⼿手动维护kafka偏移量量

    flink对接kafka,官方模式方式是自动维护偏移量 但并没有考虑到flink消费kafka过程中,如果出现进程中断后的事情! 如果此时,进程中段: 1:数据可能丢失 从获取了了数据,但是在执⾏行行 ...

  4. java面试记录二:spring加载流程、springmvc请求流程、spring事务失效、synchronized和volatile、JMM和JVM模型、二分查找的实现、垃圾收集器、控制台顺序打印ABC的三种线程实现

    注:部分答案引用网络文章 简答题 1.Spring项目启动后的加载流程 (1)使用spring框架的web项目,在tomcat下,是根据web.xml来启动的.web.xml中负责配置启动spring ...

  5. Mysql引起的spring事务失效

    老项目加新功能,导致出现service调用service的情况..一共2张表有数据的添加删除.然后测试了一下事务,表A和表B,我在表B中抛了异常,但结果发现,表B回滚正常,但是表A并没有回滚.显示事务 ...

  6. spring声明式事务 同一类内方法调用事务失效

    只要避开Spring目前的AOP实现上的限制,要么都声明要事务,要么分开成两个类,要么直接在方法里使用编程式事务 [问题] Spring的声明式事务,我想就不用多介绍了吧,一句话“自从用了Spring ...

  7. SpringMvc配置 导致实事务失效

    SpringMVC回归MVC本质,简简单单的Restful式函数,没有任何基类之后,应该是传统Request-Response框架中最好用的了. Tips 1.事务失效的惨案 Spring MVC最打 ...

  8. Spring component-scan 的逻辑 、单例模式下多实例问题、事务失效

    原创内容,转发请保留:http://www.cnblogs.com/iceJava/p/6930118.html,谢谢 之前遇到该问题,今天查看了下 spring 4.x 的代码 一,先理解下 con ...

  9. spring事务失效情况分析

    详见:http://blog.yemou.net/article/query/info/tytfjhfascvhzxcyt113 <!--[if !supportLists]-->一.&l ...

  10. kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition...Error: NETWORK_EXCEPTION

      kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition... ...

随机推荐

  1. VIM、VI编辑中一个Tab设置为4个空格

    配置方式 配置方式主要两种: 当前用户目录下创建或修改~/.vimrc Root用户下修改/etc/virc 和 /etc/vimrc 在文件末尾添加如下内容: set ts=4 set softta ...

  2. 初学银河麒麟linux笔记 第一章 虚拟机、麒麟系统、QT安装与运行

    由于手头一个项目的QT软件开发需要在银河麒麟系统上运行,借此机会开始从头学习linux系统 首先下载虚拟机VMware 16和麒麟系统iso,这里参考的 https://blog.51cto.com/ ...

  3. fread()函数读文本文件重复读最后一个字符问题【已解决】

    对文本文件读写时遇到一个问题,fread()读所有内容的时候文件的最后一个字符总会重复读,我的代码如下: FILE* file = nullptr; fopen_s(&file, " ...

  4. IO流(1)

    IO流(1) 目录 IO流(1) 文件 创建文件 获取文件信息 目录的操作和文件删除 文件 文件流 文件在程序中以流的形式来操作 输入流:数据从数据源(文件)到程序(内存)的路径 输出流:数据从程序( ...

  5. Docker 安装 PHP+Nginx

    安装Nginx docker pull nginx 安装PHP docker pull php:7.3.5-fpm 启动PHP-FPM docker run --name myphpfpm -v /d ...

  6. 云函数调用云函数 openid不存在

    最近发现了一个问题 就是我使用云函数调用云函数的时候,将openid存入数据库发现为空,怎么回事呢,一查, 原来是这样 还是没有仔细看文档的错啊

  7. uniapp离线打包安卓未配置appkey或配置错误

    按照这4步检查都没问题 1.查看签名文件是否配置到了主APP的build.gradle. signingConfigs { config { keyAlias 'newPt' keyPassword ...

  8. Hadoop集群搭建(详细简单粗暴

    搭建所用Hadoop java版本 hadoop-3.1.3.tar.gz jdk-8u212-linux-x64.tar.gz 安装包链接:Hadoop及jdk安装包提取码:icn6 首先,我们先下 ...

  9. SQL server数据库 账户SA登录失败,提示错误:18456

    在我们使用数据库的时候,偶尔会遇到一些登录上的错误提示.比如,在数据库配置上没有正确开启用户的登录策略以及服务器身份验证模式时,就会提示"用户'sa'登录失败.(Microsoft SQL ...

  10. Enhancement S_ALR_87011964 Asset Balance Report to add custom column

    Enhancement S_ALR_87011964 Asset Balance Report to add custom column - SAP Tutorial Include own fiel ...