kafka-rest:A Comprehensive, Open Source REST Proxy for Kafka
Ewen Cheslack-Postava March 25, 2015 时间有点久,但讲的还是很清楚的

As part of Confluent Platform 1.0 released about a month ago, we included a new Kafka REST Proxy to allow more flexibility for developers and to significantly broaden the number of systems and languages that can access Apache Kafka clusters. In this post, I’ll explain the REST Proxy’s features, how it works, and why we built it.
What is the REST Proxy and why do you need one?
The REST Proxy is an open source HTTP-based proxy for your Kafka cluster. The API supports many interactions with your cluster, including producing and consuming messages and accessing cluster metadata such as the set of topics and mapping of partitions to brokers. Just as with Kafka, it can work with arbitrary binary data, but also includes first-class support for Avro and integrates well with Confluent’s Schema Registry. And it is scalable, designed to be deployed in clusters and work with a variety of load balancing solutions.
We built the REST Proxy first and foremost to meet the growing demands of many organizations that want to use Kafka, but also want more freedom to select languages beyond those for which stable native clients exist today. However, it also includes functionality beyond traditional clients, making it useful for building tools for managing your Kafka cluster. See the documentation for a more detailed description of the included features.
A quick example
If you’ve used the Confluent Platform Quickstart to start a local test cluster, starting the REST Proxy for your local Kafka cluster should be as simple as running
$ kafka-rest-start
To use it with a real cluster, you only need to specify a few connection settings. The proxy includes good default settings so you can start using it without any need for customization.
The complete API provides too much functionality to cover in this blog post, but as an example I’ll show a couple of the most common use cases. To keep things language agnostic, I’ll just show the cURL commands. Producing messages is as easy as:$ curl -i -X POST -H "Content-Type: application/vnd.kafka.avro.v1+json" --data '{ "value_schema": "{\"type\": \"record\", \"name\": \"User\", \"fields\": [{\"name\": \"username\", \"type\": \"string\"}]}", "records": [ {"value": {"username": "testUser"}}, {"value": {"username": "testUser2"}} ]}' \http://localhost:8082/topics/avrotest
This sends an HTTP request using the POST method to the endpoint http://localhost:8082/topics/avrotest, which is a resource representing the topic avrotest. The content type, application/vnd.kaka.avro.v1+json, indicates the data in the request is for the Kafka proxy (application/vnd.kafka), contains Avro keys and values (.avro), using the first API version (.v1), and JSON encoding (+json). The payload contains a value schema to specify the format of the data (records with a single field username) and a set of records. Records can specify a key, value, and partition, but in this case we only include the value. The values are just JSON objects because the REST Proxy can automatically translate Avro’s JSON encoding, which is more convenient for your applications, to the more efficient binary encoding you want to store in Kafka.
The server will respond with:HTTP/1.1 200 OKContent-Length: 209Content-Type: application/vnd.kafka.v1+jsonServer: Jetty(8.1.16.v20140903){ "key_schema_id": null, "value_schema_id": 1, "offsets": [ {"partition": 0, "offset":0, "error_code": null, "error": null}, {"partition": 0, "offset":1, "error_code": null, "error": null} ]}
This indicates the request was successful (200 OK) and returns some information about the messages. Schema IDs are included, which can be used as shorthand for the same schema in future requests. Information about any errors for individual messages (error and error_code) are provided in case of failure, or are null in case of success. Successfully recorded messages include the partition and offset of the message.
Consuming messages requires a bit more setup and cleanup, but is also easy. Here we’ll consume the messages we just produced to the topic avrotest. Start by creating the consumer, which will return a base URI you use for all subsequent requests:$ curl -i -X POST -H "Content-Type: application/vnd.kafka.v1+json" \--data '{"format": "avro", "auto.offset.reset": "smallest"}' \http://localhost:8082/consumers/my_avro_consumer
HTTP/1.1 200 OKContent-Length: 121Content-Type: application/vnd.kafka.v1+jsonServer: Jetty(8.1.16.v20140903)
{ "instance_id": "rest-consumer-1", "base_uri": "http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1"}
Notice that we specified that we’ll consume Avro messages and that we want to read from the beginning of topics we subscribe to. Next, just GET a topic resource under that consumer to start receiving messages:$ curl -i -X GET -H "Accept: application/vnd.kafka.avro.v1+json" \http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1/topics/avrotest
HTTP/1.1 200 OKContent-Length: 134Content-Type: application/vnd.kafka.avro.v1+jsonServer: Jetty(8.1.16.v20140903)
[ { "key": null, "value": {"username": "testUser"}, "partition": 0, "offset": 0 }, { "key": null, "value": {"username": "testUser2"}, "partition": 0, "offset": 1 }]
When you’re done consuming or need to shut down this instance, you should try to clean up the consumer so other instances in the same group can quickly pick up the partitions that had been assigned to this consumer:$ curl -i -X DELETE \http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1
HTTP/1.1 204 No ContentServer: Jetty(8.1.16.v20140903)
This is only a short example of the API, but hopefully shows how simple it is to work with. The example above can quickly be adapted to any language with a good HTTP library in just a few lines of code. Using the API in real applications is just as simple.
How does the REST Proxy work?
I’ll leave a detailed explanation of the REST Proxy implementation for another post, but I do want to highlight some high level design decisions.
HTTP wrapper of Java libraries – At it’s core, the REST Proxy simply wraps the existing libraries provided with the Apache Kafka open source project. This includes not only the producer and consumer you would expect, but also access to cluster metadata and admin operations. Currently this means using some internal (but publicly visible) interfaces that may require some ongoing maintenance because that code has no compatibility guarantees; therefore the REST Proxy depends on specific versions of the Kafka libraries and the Confluent Platform packaging ensures the REST Proxy uses a compatible version. As the Kafka code standardizes some interfaces (and protocols) for these operations we’ll be able to rely on those public interfaces and therefore be less tied to particular Kafka versions.
JSON with flexible embedded data – Good JSON libraries are available just about everywhere, so this was an easy choice. However, REST Proxy requests need to include embedded data — the serialized key and value data that Kafka deals with. To make this flexible, we use vendor specific content types in Content-Type and Accept headers to make the format of the data explicit. We recommend using Avro to help your organization build a manageable and scalable stream data platform. Avro has a JSON encoding, so you can embed your data in a natural, readable way, as the introductory example demonstrated. Schemas need to be included with every request so they can be registered and validated against the Confluent Schema Registry and the data can be serialized using the more efficient binary encoding. To avoid the overhead of sending schemas with every request, API responses include the schema ID that the Schema Registry returns, which can subsequently be used in place of the full schema. If you opt to use raw binary data, it cannot be embedded directly in JSON, so the API uses a string containing the base64 encoded data.
Stateful consumers – Consumers are stateful and tied to a particular proxy instance. This is actually a violation of REST principles, which state that requests should be stateless and should contain all the information necessary to serve them. There are ways to make the REST Proxy layer stateless, e.g., by moving consumer state to a separate service, just as a separate database is used to manage state for a RESTful API. However, we felt the added complexity and the cost of extra network hops and latency wasn’t warranted in this case. The approach we use is simpler and the consumer implementation already provides fault tolerance: if one instance of the proxy fails, other consumers in the same group automatically pick up the slack until the failed consumers can be restarted.
Distributed and load balanced – Despite consumers being tied to the proxy instance that created them, the entire REST Proxy API is designed to be accessed via any load balancing mechanism (e.g. discovery services, round robin DNS, or software load balancing proxy). To support consumers, each instance must be individually addressable by clients, but all non-consumer requests can be handled by any REST Proxy instance.
Why a REST Proxy?
In addition to the official Java clients, there are more than a dozen community-contributed clients across a variety of languages listed on the wiki. So why bother writing the REST Proxy at all?
It turns out that writing a feature complete, high performance Kafka client is currently a pretty difficult endeavor for a few reasons:
Flexible consumer group model – Kafka has a consumer group model, as shown in the figure below, that generalizes a few different messaging models, including both queuing and publish-subscribe semantics. However, the protocol currently leaves much of the work to clients. This has some significant benefits for brokers, e.g., minimizing the work they do and making performance very predictable. But it also means Kafka consumers are “thick clients” that have to implement complex algorithms for features like partition load balancing and offset management. As a result, it’s common to find Kafka client libraries supporting only a fraction of the Java client’s functionality, sometime omitting high-level consumer support entirely.

Low-level networking – Even a small Kafka cluster on moderate hardware can support very high throughput, but writing a cluster to take advantage of this requires low level networking code. Producers need to efficiently batch messages to achieve high throughput while minimizing latency, manage many connections to brokers (usually using an async I/O API), and gracefully handle a wide variety of edge cases and errors. It’s a substantial time investment to write, debug, and maintain these libraries.
Protocol evolution – The 0.8 releases come with a promise of compatibility and a new versioned protocol to enable this. The community has also implemented a new approach to making user visible changes, both to catch compatibility problems before they occur and to act as documentation for users. As the protocol evolves and new features are added, client implementations need to add support for new features and remove deprecated ones. By leveraging the official Java clients, the REST Proxy picks up most of these changes and fixes for free. Some new features will require REST API changes, but careful API design that anticipates new features will help insulate applications from these changes.
The REST Proxy is a simple solution to these problems: instead of waiting for a good, feature-complete client for your language of choice, you can use a simple HTTP-based interface that should be easily accessible from just about any language.
But there were already a few REST proxies for Kafka out there, so why build another? Although the existing proxies were good — and informed the design of the Confluent REST Proxy — it was clear that they were written to address a few specific use cases. We wanted to provide a truly comprehensive interface to Kafka, which includes administrative actions in addition to complete producer and consumer functionality. We haven’t finished yet — today we provide read-only access to cluster metadata but plan to add support for administrative actions such as partition assignment — but soon you should be able to both operate and use your cluster entirely through the REST Proxy.
Finally, it’s worth pointing out that the goal of the REST Proxy is not to replace existing clients. For example, if you’re using C or Python and only need to produce or consume messages, you may be better served by the existing high quality librdkafka library for C/C++ or kafka-python library for Python. Confluent is invested in development of native clients for a variety of popular languages. However, the REST Proxy provides a broad-reaching solution very quickly and supports languages where no developer has yet taken on the task of developing a client.
Tradeoffs
Of course, there are some tradeoffs to using the REST Proxy instead of native clients for your language, but in many cases these tradeoffs are acceptable when you consider the capabilities you get with the REST Proxy.
The most obvious tradeoff is complexity: the REST Proxy is yet another service you need to run and maintain. We’ve tried to make this as simple as possible, but adding any service to your stack has a cost.
Another cost is in performance. The REST Proxy adds extra processing: clients construct and make HTTP requests, the REST Proxy needs to parse requests, transform data between formats both for produce and consume requests, and handle all the interaction with Kafka itself, at least doubling bandwidth usage. And the performance costs aren’t only as simple as doubling costs because the REST Proxy can’t use all the same optimizations that the Kafka broker can (e.g., sendfile is not an option for the proxy consumer responses). Still, the REST Proxy can achieve high throughput. I’ll discuss some benchmarking numbers in more detail in a future post, but the most recent version of the proxy achieves about ⅔ the throughput of the Java clients when producing binary messages and about ⅕ the rate when consuming binary messages. Some of this performance can be regained with higher performance CPUs on your REST Proxy instances because a significant fraction of the performance reduction is due to increased CPU usage (parsing JSON, decoding embedded content in produce requests or encoding embedded content in responses, serializing JSON responses).
The REST Proxy also exposes a very limited subset of the normal client configuration options to applications. It is designed to act as shared infrastructure, sitting between your Kafka cluster and many of your applications. Limiting configuration of many settings to the operator of the proxy in this multi-tenant environment is important to maintaining stability. One good example where this is critical is consumer memory settings. To achieve good throughput, consumer responses should include a relatively large amount of data to amortize per-request overheads over many messages. But the proxy needs to buffer that amount of data for each consumer, so it would be risky to give complete control of that setting to each application that creates consumers. One misbehaving consumer could affect all your applications. Future versions should offer more control, but these options need to be added to the API with considerable care.
Finally, unless you have a language-specific wrapper of the API, you won’t get an idiomatic API for your language. This can have a significant impact on the verbosity and readability of your code. However, this is also the simplest problem to fix, and at relatively low cost. For example, to evaluate and validate the API, I wrote a thin node.js wrapper of the API. The entire library clocks in at about 650 lines of non-comment non-whitespace code, and a simple example that ingests data from Twitter’s public stream and computes trending hashtags is only about 200 lines of code.
Conclusion
This first version of the REST Proxy provides basic metadata, producer, and consumer support using Avro or arbitrary binary data. We hope it will enable broader adoption of Kafka by making all of Kafka’s features accessible, no matter what language or technology stack you’re working with.
We’re just getting started with the REST Proxy and have a lot more functionality planned for future versions. You can get some idea of where we’re heading by looking at the features we’ve noted are missing right now, but we also want to hear from you about what features you would find most useful, either on our community mailing list, on the open source project’s issue tracker, or through pull requests.
kafka-rest:A Comprehensive, Open Source REST Proxy for Kafka的更多相关文章
- 一次flume exec source采集日志到kafka因为单条日志数据非常大同步失败的踩坑带来的思考
本次遇到的问题描述,日志采集同步时,当单条日志(日志文件中一行日志)超过2M大小,数据无法采集同步到kafka,分析后,共踩到如下几个坑.1.flume采集时,通过shell+EXEC(tail -F ...
- 【kafka】安装部署kafka集群(kafka版本:kafka_2.12-2.3.0)
3.2.1 下载kafka并安装kafka_2.12-2.3.0.tgz tar -zxvf kafka_2.12-2.3.0.tgz 3.2.2 配置kafka集群 在config/server.p ...
- kafka入门:简介、使用场景、设计原理、主要配置及集群搭建(转)
问题导读: 1.zookeeper在kafka的作用是什么? 2.kafka中几乎不允许对消息进行"随机读写"的原因是什么? 3.kafka集群consumer和producer状 ...
- Akka(17): Stream:数据流基础组件-Source,Flow,Sink简介
在大数据程序流行的今天,许多程序都面临着共同的难题:程序输入数据趋于无限大,抵达时间又不确定.一般的解决方法是采用回调函数(callback-function)来实现的,但这样的解决方案很容易造成“回 ...
- kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition...Error: NETWORK_EXCEPTION
kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition... ...
- Kafka#4:存储设计 分布式设计 源码分析
https://sites.google.com/a/mammatustech.com/mammatusmain/kafka-architecture/4-kafka-detailed-archite ...
- Kafka剖析:Kafka背景及架构介绍
<Kafka剖析:Kafka背景及架构介绍> <Kafka设计解析:Kafka High Availability(上)> <Kafka设计解析:Kafka High A ...
- Kafka分布式:ZooKeeper扩展
[ZooKeeper] 服务注册.服务发现.客户端负载均衡.Offset偏移量分布式存储. kafka使用zookeeper来实现动态的集群扩展,不需要更改客户端(producer和consumer) ...
- kafka之一:Windows上搭建Kafka运行环境
搭建环境 1. 安装JDK 1.1 安装文件:http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-213315 ...
随机推荐
- shell从入门到精通进阶之一:Shell基础知识
1.1 简介 Shell是一个C语言编写的脚本语言,它是用户与Linux的桥梁,用户输入命令交给Shell处理,Shell将相应的操作传递给内核(Kernel),内核把处理的结果输出给用户. 下面是处 ...
- mysql数据插入前判断是否存在
今天在对一些抓取到的数据做插入的时候,因为使用了定时器,每间隔几分钟会抓取一次,导致很多数据插入的是重复数据,为了解决这个问题, 一般是在插入之前先通过一个标识去查询表数据看是否已经有了,没有再执行插 ...
- 在windows中创建.gitignore文件
1.先任意创建一个文件,例如:1.txt 2.打开cmd命令行窗口,到1.txt目录下 windows7/8输入ren 1.txt .gitignore修改成功 windows10输入mv 1.txt ...
- Asp.Net路由重写为用户名或者ID
有一个需求如下:指定某个Area的路由(Area:Wx)在其后面添加用户名或者ID作为URL参数,即像下面的样子: /Wx/xiaoming/ /Wx/xiaoming/photo /Wx/xiaom ...
- 使用NOPI写入Excel基础代码
using NPOI.XSSF.UserModel; using System; using System.Collections.Generic; using System.IO; using Sy ...
- 从零开始学安全(三十六)●利用python 爆破form表单
import sys import requests from requests.auth import HTTPBasicAuth def Brute_Force_Web(postData): re ...
- Linux万能快捷键与命令
tab键:补全命令 \ :命令折行写 Ctrl+C :结束命令 --help :查看命令详细信息 man :类似于help 比help更加详细. sudo :临时以管理员权限执行命令. 还有吗?
- Java学习--使用 Date 和 SimpleDateFormat 类表示时间
使用 Date 和 SimpleDateFormat 类表示时间 在程序开发中,经常需要处理日期和时间的相关数据,此时我们可以使用 java.util 包中的 Date 类.这个类最主要的作用就是获取 ...
- Java开发笔记(五十九)Java8之后的扩展接口
前面介绍了接口的基本用法,有心的朋友可能注意到这么一句话“在Java8以前,接口内部的所有方法都必须是抽象方法”,如此说来,在Java8之后,接口的内部方法也可能不是抽象方法了吗?之所以Java8对接 ...
- Qt Creator的下载和安装
原文:https://blog.csdn.net/weixin_38090427/article/details/83827678 一,Qt和Qt Creator的区别 Qt是C++的一个库,或者说是 ...