kafka producer 源码总结

kafka producer可以总体上分为两个部分：

producer调用send方法，将消息存放到内存中
sender线程轮询的从内存中将消息通过NIO发送到网络中

1 调用send方法

其实在调用new KafkaProducer初始化一个producer实例的时候，已经初始化了一个sender线程在后台轮询，不过为了方便理解，我们先分析send方法，即producer如何将消息放到内存队列中的。

1.1 producer存储结构

producer的整体存储结构如下图

1.2 整体流程

kafka在发送消息的时候，首先会连接一台broker来获取metadata信息，从metadata中可以知道要发送的topic一共有几个partiton、partiton的leader所在broker等信息。获取到metadata信息后，会通过获取到metadata信息并通过消息的key来计算消息被分配到哪个partiton（注意：消息被分配到哪个partiton是在客户端被计算好的）。然后会将消息按照partiton分组，放到对应的RecordBatch中，如果RecordBatch大于batch.size的大小，则新建一个RecordBatch放在list末尾。

doSend流程

private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {

        TopicPartition tp = null;

            // 阻塞并唤醒sender线程，等待sender线程获取到metadata

            long waitedOnMetadataMs = waitOnMetadata(record.topic(), this.maxBlockTimeMs);

            long remainingWaitMs = Math.max(0, this.maxBlockTimeMs - waitedOnMetadataMs);

            byte[] serializedKey;

            // 序列化key和value

            serializedKey = keySerializer.serialize(record.topic(), record.key());

            byte[] serializedValue;

            serializedValue = valueSerializer.serialize(record.topic(), record.value());

           // 计算消息应该放在哪个partition分区，如1.3详细介绍

            int partition = partition(record, serializedKey, serializedValue, metadata.fetch());

            int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);

            ensureValidRecordSize(serializedSize);

            tp = new TopicPartition(record.topic(), partition);

            long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();

            Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);

            // 根据计算到的分区，将消息追加到对应分区所在的Deque<RecordBatch>中，如1.4详细介绍

            RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);

            // 如果是新创建batch或者batch满了，那么就唤醒sender线程。

            if (result.batchIsFull || result.newBatchCreated) {

                log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);

                this.sender.wakeup();

            }

            return result.future;

            // handling exceptions and record the errors;

            // for API exceptions return them in the future,

            // for other exceptions throw directly

            //...

    }

1.3 选择分区

在doSend方法中，调用partition 来计算消息的分区。如果没有特别指定的话，会使用默认的分区方法：

如果消息含有key，则计算方式是 key的绝对值 % （partiton个数 - 1)
如果不包含key的话，则采用round-robin方式发送。

1.4 消息放到内存中

accumulator#append(..)中会首先尝试将消息放到分区所对应的Deque<RecordBatch> 的最后一个batch中，如果添加失败（比如RecordBatch已经满了），则会使用BufferPool从内存中申请一块大小为batch.size的内存出来（如果消息体大于batch.size，则申请消息体大小的内存）,将消息放到新的batch中，并将新的batch添加到Deque<RecordBatch>中。

append

public RecordAppendResult append(TopicPartition tp,

                                     long timestamp,

                                     byte[] key,

                                     byte[] value,

                                     Callback callback,

                                     long maxTimeToBlock) throws InterruptedException {

        appendsInProgress.incrementAndGet();

        try {

            Deque<RecordBatch> dq = getOrCreateDeque(tp);

            // 注意，这里操作是加锁的。加锁的原因是

            // 1.producer是可以多线程访问的

            // 2.sender线程也会操作Deque<RecordBatch>

            synchronized (dq) {

                if (closed)

                    throw new IllegalStateException("Cannot send after the producer is closed.");

                RecordBatch last = dq.peekLast();

                if (last != null) {

                    FutureRecordMetadata future = last.tryAppend(timestamp, key, value, callback, time.milliseconds());

                    if (future != null)

                        return new RecordAppendResult(future, dq.size() > 1 || last.records.isFull(), false);

                }

            }

            // 申请内存，大小为消息体和batch.size的最大值,另外buffer中其实只缓存batch.size大小的内存，只有batch.size大小的内存申请才会从buffer中获取，大于batch.size会重新开辟空间，

            // 所以合理规划batch.size和消息体大小可以有效提供客户端内存使用效率

            int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));

            // 从池子中申请

            ByteBuffer buffer = free.allocate(size, maxTimeToBlock);

            // 注意，这里又用到锁，不将两个锁合并成一个锁原因是减少锁的粒度

            synchronized (dq) {

                if (closed)

                    throw new IllegalStateException("Cannot send after the producer is closed.");

                RecordBatch last = dq.peekLast();

                if (last != null) {

                    FutureRecordMetadata future = last.tryAppend(timestamp, key, value, callback, time.milliseconds());

                    if (future != null) {

                        free.deallocate(buffer);

                        return new RecordAppendResult(future, dq.size() > 1 || last.records.isFull(), false);

                    }

                }

                MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);

                RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());

                FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));

                dq.addLast(batch);

                incomplete.add(batch);

                return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);

            }

        } finally {

            appendsInProgress.decrementAndGet();

        }

    }

2 sender线程

sender线程在new出一个Kafka Producer实例后就已经开始运行了.

消息放到内存中是按照partiton来进行分组的，但是sender线程发送的时候是按照broker的node节点来发送，这点需要注意。

sender线程的整体逻辑如下：

void run(long now) {

        Cluster cluster = metadata.fetch();

        // ready用来获取已经ready的节点，注意是节点，所谓ready节点是指partiton满足以下条件之一后，partition的leader所在的节点为ready

        // 1. Deque<RecordBatch> size大于1，说明已经有一个Batch满了，可以发送

        // 2. 内存池已经耗尽，这时候需要发送写消息，来释放内存

        // 3. linger.ms 时间到了，表示可以发送了

        // 4. 调用了close方法，需要将内存中消息发送出去

        // 如果partiton满足以上条件之一，那么parttion所在的leader节点就算准备好了

        RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);

        if (result.unknownLeadersExist)

            this.metadata.requestUpdate();

        Iterator<Node> iter = result.readyNodes.iterator();

        long notReadyTimeout = Long.MAX_VALUE;

        while (iter.hasNext()) {

            Node node = iter.next();

            if (!this.client.ready(node, now)) {

                iter.remove();

                notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));

            }

        }

        // 由于RecordBatch是按照partiton来组织的，而sender线程是按照节点来发送的，所以drain的作用就是将RecordBatch转换为按照节点来组织的方式。drain只会获取每个分区的第一个BatchRecord，而不是将一个分区的所有BatchRecord都发送，主要是避免饥饿

        Map<Integer, List<RecordBatch>> batches = this.accumulator.drain(cluster,

                                                                         result.readyNodes,

                                                                         this.maxRequestSize,

                                                                         now);

        if (guaranteeMessageOrder) {

            for (List<RecordBatch> batchList : batches.values()) {

                for (RecordBatch batch : batchList)

                    this.accumulator.mutePartition(batch.topicPartition);

            }

        }

        List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);

        for (RecordBatch expiredBatch : expiredBatches)

            this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);

        sensors.updateProduceRequestMetrics(batches);

        // 一个节点只会产生一个request

        List<ClientRequest> requests = createProduceRequests(batches, now);

        long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);

        if (result.readyNodes.size() > 0) {

            log.trace("Nodes with data ready to send: {}", result.readyNodes);

            log.trace("Created {} produce requests: {}", requests.size(), requests);

            pollTimeout = 0;

        }

        for (ClientRequest request : requests)

            client.send(request, now);

        // 真正的发送

        this.client.poll(pollTimeout, now);

    }

3 一些细节总结

batch.size和linger.ms满足其中之一，sender线程便会被激活进行发送消息
sender每次只拿出一个partiton的一个RecordBatch进行发送，即便该partiton已经有多个RecordBatch满了，这样做主要为了避免其他parttion饥饿, 详见RecordAccumulator#drain(..)
RecordAccumulator#drain(..)后，被drain的RecordBatch会被close，不可写；同时从相应的Deque<RecordBatch>中移除。