Kafka producer异步发送在某些情况会阻塞主线程，使用时候慎重

最近发现一个Kafka producer异步发送在某些情况会阻塞主线程，后来在排查解决问题过程中发现这可以算是Kafka的一个说明不恰当的地方。

问题说明

在很多场景下我们会使用异步方式来发送Kafka的消息，会使用KafkaProducer中的以下方法：

public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {}

根据文档的说明它是一个异步的发送方法，按道理不管如何它都不应该阻塞主线程，但实际中某些情况下会出现阻塞线程，比如broker未正确运行，topic未创建等情况，有些时候我们不需要对发送的结果做保证，但是如果出现阻塞的话，会影响其他业务逻辑。

问题出现点

从KafkaProducer send这个方法声明上看并没有什么问题，那么我们来看一下她的具体实现：

public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {

    // intercept the record, which can be potentially modified; this method does not throw exceptions

    ProducerRecord<K, V> interceptedRecord = this.interceptors.onSend(record);

    return doSend(interceptedRecord, callback);

}

/**

  * Implementation of asynchronously send a record to a topic.

  */

private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {

    TopicPartition tp = null;

    try {

        throwIfProducerClosed();

        // first make sure the metadata for the topic is available

        ClusterAndWaitTime clusterAndWaitTime;

        try {

            clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);  //出现问题的地方

        } catch (KafkaException e) {

            if (metadata.isClosed())

                throw new KafkaException("Producer closed while send in progress", e);

            throw e;

        }

        ...

    } catch (ApiException e) {

        ...

    }

}

private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long maxWaitMs) throws InterruptedException {

    // add topic to metadata topic list if it is not there already and reset expiry

    Cluster cluster = metadata.fetch();

    if (cluster.invalidTopics().contains(topic))

        throw new InvalidTopicException(topic);

    metadata.add(topic);

    Integer partitionsCount = cluster.partitionCountForTopic(topic);

    // Return cached metadata if we have it, and if the record's partition is either undefined

    // or within the known partition range

    if (partitionsCount != null && (partition == null || partition < partitionsCount))

        return new ClusterAndWaitTime(cluster, 0);

    long begin = time.milliseconds();

    long remainingWaitMs = maxWaitMs;

    long elapsed;

    //一直获取topic的元数据信息，直到获取成功，若获取时间超过maxWaitMs，则抛出异常

    do {

        if (partition != null) {

            log.trace("Requesting metadata update for partition {} of topic {}.", partition, topic);

        } else {

            log.trace("Requesting metadata update for topic {}.", topic);

        }

        metadata.add(topic);

        int version = metadata.requestUpdate();

        sender.wakeup();

        try {

            metadata.awaitUpdate(version, remainingWaitMs);

        } catch (TimeoutException ex) {

            // Rethrow with original maxWaitMs to prevent logging exception with remainingWaitMs

            throw new TimeoutException(

                    String.format("Topic %s not present in metadata after %d ms.",

                            topic, maxWaitMs));

        }

        cluster = metadata.fetch();

        elapsed = time.milliseconds() - begin;

        if (elapsed >= maxWaitMs) {  //判断执行时间是否大于maxWaitMs

            throw new TimeoutException(partitionsCount == null ?

                    String.format("Topic %s not present in metadata after %d ms.",

                            topic, maxWaitMs) :

                    String.format("Partition %d of topic %s with partition count %d is not present in metadata after %d ms.",

                            partition, topic, partitionsCount, maxWaitMs));

        }

        metadata.maybeThrowException();

        remainingWaitMs = maxWaitMs - elapsed;

        partitionsCount = cluster.partitionCountForTopic(topic);

    } while (partitionsCount == null || (partition != null && partition >= partitionsCount));

    return new ClusterAndWaitTime(cluster, elapsed);

}

从它的实现我们可以看出，会导致线程阻塞的原因在于以下这个逻辑：

private ClusterAndWaitTime waitOnMetadata(String topic, Integer partition, long maxWaitMs) throws InterruptedException

通过KafkaProducer 执行send的过程中需要先获取Metadata，而这是一个不断循环的操作，直到获取成功，或者抛出异常。

其实Kafka本意这么实现并没有问题，因为你要发送消息的前提就是能获取到border和topic的信息，问题在于这个send对外暴露的是Future的方法，但是内部实现却是有阻塞的，那么在有些时候没有考虑到这种情况，一旦出现border或者topic异常，将会阻塞系统线程，导致系统响应变慢，直到奔溃。

问题解决

其实解决这个问题很简单，就是单独创建几个线程用于消息发送，这样即使遇到意外情况，也只会阻塞几个线程，不会引起系统线程大面积阻塞，不可用，具体实现：

import java.util.concurrent.Callable

import java.util.concurrent.ExecutorService

import java.util.concurrent.Executors

import org.apache.kafka.clients.producer.{Callback, KafkaProducer, ProducerRecord, RecordMetadata}

class ProducerF[K,V](kafkaProducer: KafkaProducer[K,V]) {

  val executor: ExecutorService = Executors.newScheduledThreadPool(1)

  def sendAsync(producerRecord: ProducerRecord[K,V], callback: Callback) = {

    executor.submit(new Callable[RecordMetadata]() {

      def call = kafkaProducer.send(producerRecord, callback).get()

    })

  }

}