Ewen Cheslack-Postava  March 25, 2015  时间有点久,但讲的还是很清楚的

As part of Confluent Platform 1.0 released about a month ago, we included a new Kafka REST Proxy to allow more flexibility for developers and to significantly broaden the number of systems and languages that can access Apache Kafka clusters. In this post, I’ll explain the REST Proxy’s features, how it works, and why we built it.

What is the REST Proxy and why do you need one?

The REST Proxy is an open source HTTP-based proxy for your Kafka cluster. The API supports many interactions with your cluster, including producing and consuming messages and accessing cluster metadata such as the set of topics and mapping of partitions to brokers. Just as with Kafka, it can work with arbitrary binary data, but also includes first-class support for Avro and integrates well with Confluent’s Schema Registry. And it is scalable, designed to be deployed in clusters and work with a variety of load balancing solutions.

We built the REST Proxy first and foremost to meet the growing demands of many organizations that want to use Kafka, but also want more freedom to select languages beyond those for which stable native clients exist today. However, it also includes functionality beyond traditional clients, making it useful for building tools for managing your Kafka cluster. See the documentation for a more detailed description of the included features.

A quick example

If you’ve used the Confluent Platform Quickstart to start a local test cluster, starting the REST Proxy for your local Kafka cluster should be as simple as running

$ kafka-rest-start

To use it with a real cluster, you only need to specify a few connection settings. The proxy includes good default settings so you can start using it without any need for customization.

The complete API provides too much functionality to cover in this blog post, but as an example I’ll show a couple of the most common use cases. To keep things language agnostic, I’ll just show the cURL commands. Producing messages is as easy as:
$ curl -i -X POST -H "Content-Type: application/vnd.kafka.avro.v1+json" --data '{
   "value_schema": "{\"type\": \"record\", \"name\": \"User\", \"fields\": [{\"name\": \"username\", \"type\": \"string\"}]}",
   "records": [
     {"value": {"username": "testUser"}},
     {"value": {"username": "testUser2"}}
   ]
}' \
http://localhost:8082/topics/avrotest

This sends an HTTP request using the POST method to the endpoint http://localhost:8082/topics/avrotest, which is a resource representing the topic avrotest. The content type, application/vnd.kaka.avro.v1+json, indicates the data in the request is for the Kafka proxy (application/vnd.kafka), contains Avro keys and values (.avro), using the first API version (.v1), and JSON encoding (+json). The payload contains a value schema to specify the format of the data (records with a single field username) and a set of records. Records can specify a key, value, and partition, but in this case we only include the value. The values are just JSON objects because the REST Proxy can automatically translate Avro’s JSON encoding, which is more convenient for your applications, to the more efficient binary encoding you want to store in Kafka.

The server will respond with:
HTTP/1.1 200 OK
Content-Length: 209
Content-Type: application/vnd.kafka.v1+json
Server: Jetty(8.1.16.v20140903)
{
   "key_schema_id": null,
   "value_schema_id": 1,
   "offsets": [
     {"partition": 0, "offset":0, "error_code": null, "error": null},
     {"partition": 0, "offset":1, "error_code": null, "error": null}
   ]
}

This indicates the request was successful (200 OK) and returns some information about the messages. Schema IDs are included, which can be used as shorthand for the same schema in future requests. Information about any errors for individual messages (error and error_code) are provided in case of failure, or are null in case of success. Successfully recorded messages include the partition and offset of the message.

Consuming messages requires a bit more setup and cleanup, but is also easy. Here we’ll consume the messages we just produced to the topic avrotest. Start by creating the consumer, which will return a base URI you use for all subsequent requests:
$ curl -i -X POST -H "Content-Type: application/vnd.kafka.v1+json" \
--data '{"format": "avro", "auto.offset.reset": "smallest"}' \
http://localhost:8082/consumers/my_avro_consumer

HTTP/1.1 200 OK
Content-Length: 121
Content-Type: application/vnd.kafka.v1+json
Server: Jetty(8.1.16.v20140903)

{
  "instance_id": "rest-consumer-1",
   "base_uri": "http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1"
}

Notice that we specified that we’ll consume Avro messages and that we want to read from the beginning of topics we subscribe to. Next, just GET a topic resource under that consumer to start receiving messages:
$ curl -i -X GET -H "Accept: application/vnd.kafka.avro.v1+json" \
http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1/topics/avrotest

HTTP/1.1 200 OK
Content-Length: 134
Content-Type: application/vnd.kafka.avro.v1+json
Server: Jetty(8.1.16.v20140903)

[
  {
     "key": null,
     "value": {"username": "testUser"},
     "partition": 0,
     "offset": 0
   },
   {
     "key": null,
     "value": {"username": "testUser2"},
     "partition": 0,
     "offset": 1
   }
]

When you’re done consuming or need to shut down this instance, you should try to clean up the consumer so other instances in the same group can quickly pick up the partitions that had been assigned to this consumer:
$ curl -i -X DELETE \
http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1

HTTP/1.1 204 No Content
Server: Jetty(8.1.16.v20140903)

This is only a short example of the API, but hopefully shows how simple it is to work with. The example above can quickly be adapted to any language with a good HTTP library in just a few lines of code. Using the API in real applications is just as simple.

How does the REST Proxy work?

I’ll leave a detailed explanation of the REST Proxy implementation for another post, but I do want to highlight some high level design decisions.

HTTP wrapper of Java libraries – At it’s core, the REST Proxy simply wraps the existing libraries provided with the Apache Kafka open source project. This includes not only the producer and consumer you would expect, but also access to cluster metadata and admin operations. Currently this means using some internal (but publicly visible) interfaces that may require some ongoing maintenance because that code has no compatibility guarantees; therefore the REST Proxy depends on specific versions of the Kafka libraries and the Confluent Platform packaging ensures the REST Proxy uses a compatible version. As the Kafka code standardizes some interfaces (and protocols) for these operations we’ll be able to rely on those public interfaces and therefore be less tied to particular Kafka versions.

JSON with flexible embedded data – Good JSON libraries are available just about everywhere, so this was an easy choice. However, REST Proxy requests need to include embedded data — the serialized key and value data that Kafka deals with. To make this flexible, we use vendor specific content types in Content-Type and Accept headers to make the format of the data explicit. We recommend using Avro to help your organization build a manageable and scalable stream data platform. Avro has a JSON encoding, so you can embed your data in a natural, readable way, as the introductory example demonstrated. Schemas need to be included with every request so they can be registered and validated against the Confluent Schema Registry and the data can be serialized using the more efficient binary encoding. To avoid the overhead of sending schemas with every request, API responses include the schema ID that the Schema Registry returns, which can subsequently be used in place of the full schema. If you opt to use raw binary data, it cannot be embedded directly in JSON, so the API uses a string containing the base64 encoded data.

Stateful consumers – Consumers are stateful and tied to a particular proxy instance. This is actually a violation of REST principles, which state that requests should be stateless and should contain all the information necessary to serve them. There are ways to make the REST Proxy layer stateless, e.g., by moving consumer state to a separate service, just as a separate database is used to manage state for a RESTful API. However, we felt the added complexity and the cost of extra network hops and latency wasn’t warranted in this case. The approach we use is simpler and the consumer implementation already provides fault tolerance: if one instance of the proxy fails, other consumers in the same group automatically pick up the slack until the failed consumers can be restarted.

Distributed and load balanced – Despite consumers being tied to the proxy instance that created them, the entire REST Proxy API is designed to be accessed via any load balancing mechanism (e.g. discovery services, round robin DNS, or software load balancing proxy). To support consumers, each instance must be individually addressable by clients, but all non-consumer requests can be handled by any REST Proxy instance.

Why a REST Proxy?

In addition to the official Java clients, there are more than a dozen community-contributed clients across a variety of languages listed on the wiki. So why bother writing the REST Proxy at all?

It turns out that writing a feature complete, high performance Kafka client is currently a pretty difficult endeavor for a few reasons:

Flexible consumer group model – Kafka has a consumer group model, as shown in the figure below, that generalizes a few different messaging models, including both queuing and publish-subscribe semantics. However, the protocol currently leaves much of the work to clients. This has some significant benefits for brokers, e.g., minimizing the work they do and making performance very predictable. But it also means Kafka consumers are “thick clients” that have to implement complex algorithms for features like partition load balancing and offset management. As a result, it’s common to find Kafka client libraries supporting only a fraction of the Java client’s functionality, sometime omitting high-level consumer support entirely.

Kafka’s consumer group model supports multiple consumers on the same topic, organized into groups of variable and dynamic size, and supports offset management. This is very flexible, scalable, and fault tolerant, but means non-Java clients have to implement more functionality to achieve feature parity with the Java clients.

Low-level networking – Even a small Kafka cluster on moderate hardware can support very high throughput, but writing a cluster to take advantage of this requires low level networking code. Producers need to efficiently batch messages to achieve high throughput while minimizing latency, manage many connections to brokers (usually using an async I/O API), and gracefully handle a wide variety of edge cases and errors. It’s a substantial time investment to write, debug, and maintain these libraries.

Protocol evolution – The 0.8 releases come with a promise of compatibility and a new versioned protocol to enable this. The community has also implemented a new approach to making user visible changes, both to catch compatibility problems before they occur and to act as documentation for users. As the protocol evolves and new features are added, client implementations need to add support for new features and remove deprecated ones. By leveraging the official Java clients, the REST Proxy picks up most of these changes and fixes for free. Some new features will require REST API changes, but careful API design that anticipates new features will help insulate applications from these changes.

The REST Proxy is a simple solution to these problems: instead of waiting for a good, feature-complete client for your language of choice, you can use a simple HTTP-based interface that should be easily accessible from just about any language.

But there were already a few REST proxies for Kafka out there, so why build another? Although the existing proxies were good — and informed the design of the Confluent REST Proxy — it was clear that they were written to address a few specific use cases. We wanted to provide a truly comprehensive interface to Kafka, which includes administrative actions in addition to complete producer and consumer functionality. We haven’t finished yet — today we provide read-only access to cluster metadata but plan to add support for administrative actions such as partition assignment — but soon you should be able to both operate and use your cluster entirely through the REST Proxy.

Finally, it’s worth pointing out that the goal of the REST Proxy is not to replace existing clients. For example, if you’re using C or Python and only need to produce or consume messages, you may be better served by the existing high quality librdkafka library for C/C++ or kafka-python library for Python. Confluent is invested in development of native clients for a variety of popular languages. However, the REST Proxy provides a broad-reaching solution very quickly and supports languages where no developer has yet taken on the task of developing a client.

Tradeoffs

Of course, there are some tradeoffs to using the REST Proxy instead of native clients for your language, but in many cases these tradeoffs are acceptable when you consider the capabilities you get with the REST Proxy.

The most obvious tradeoff is complexity: the REST Proxy is yet another service you need to run and maintain. We’ve tried to make this as simple as possible, but adding any service to your stack has a cost.

Another cost is in performance. The REST Proxy adds extra processing: clients construct and make HTTP requests, the REST Proxy needs to parse requests, transform data between formats both for produce and consume requests, and handle all the interaction with Kafka itself, at least doubling bandwidth usage. And the performance costs aren’t only as simple as doubling costs because the REST Proxy can’t use all the same optimizations that the Kafka broker can (e.g., sendfile is not an option for the proxy consumer responses). Still, the REST Proxy can achieve high throughput. I’ll discuss some benchmarking numbers in more detail in a future post, but the most recent version of the proxy achieves about ⅔ the throughput of the Java clients when producing binary messages and about ⅕ the rate when consuming binary messages. Some of this performance can be regained with higher performance CPUs on your REST Proxy instances because a significant fraction of the performance reduction is due to increased CPU usage (parsing JSON, decoding embedded content in produce requests or encoding embedded content in responses, serializing JSON responses).

The REST Proxy also exposes a very limited subset of the normal client configuration options to applications. It is designed to act as shared infrastructure, sitting between your Kafka cluster and many of your applications. Limiting configuration of many settings to the operator of the proxy in this multi-tenant environment is important to maintaining stability. One good example where this is critical is consumer memory settings. To achieve good throughput, consumer responses should include a relatively large amount of data to amortize per-request overheads over many messages. But the proxy needs to buffer that amount of data for each consumer, so it would be risky to give complete control of that setting to each application that creates consumers. One misbehaving consumer could affect all your applications. Future versions should offer more control, but these options need to be added to the API with considerable care.

Finally, unless you have a language-specific wrapper of the API, you won’t get an idiomatic API for your language. This can have a significant impact on the verbosity and readability of your code. However, this is also the simplest problem to fix, and at relatively low cost. For example, to evaluate and validate the API, I wrote a thin node.js wrapper of the API. The entire library clocks in at about 650 lines of non-comment non-whitespace code, and a simple example that ingests data from Twitter’s public stream and computes trending hashtags is only about 200 lines of code.

Conclusion

This first version of the REST Proxy provides basic metadata, producer, and consumer support using Avro or arbitrary binary data. We hope it will enable broader adoption of Kafka by making all of Kafka’s features accessible, no matter what language or technology stack you’re working with.

We’re just getting started with the REST Proxy and have a lot more functionality planned for future versions. You can get some idea of where we’re heading by looking at the features we’ve noted are missing right now, but we also want to hear from you about what features you would find most useful, either on our community mailing list, on the open source project’s issue tracker, or through pull requests.

kafka-rest:A Comprehensive, Open Source REST Proxy for Kafka的更多相关文章

  1. 一次flume exec source采集日志到kafka因为单条日志数据非常大同步失败的踩坑带来的思考

    本次遇到的问题描述,日志采集同步时,当单条日志(日志文件中一行日志)超过2M大小,数据无法采集同步到kafka,分析后,共踩到如下几个坑.1.flume采集时,通过shell+EXEC(tail -F ...

  2. 【kafka】安装部署kafka集群(kafka版本:kafka_2.12-2.3.0)

    3.2.1 下载kafka并安装kafka_2.12-2.3.0.tgz tar -zxvf kafka_2.12-2.3.0.tgz 3.2.2 配置kafka集群 在config/server.p ...

  3. kafka入门:简介、使用场景、设计原理、主要配置及集群搭建(转)

    问题导读: 1.zookeeper在kafka的作用是什么? 2.kafka中几乎不允许对消息进行"随机读写"的原因是什么? 3.kafka集群consumer和producer状 ...

  4. Akka(17): Stream:数据流基础组件-Source,Flow,Sink简介

    在大数据程序流行的今天,许多程序都面临着共同的难题:程序输入数据趋于无限大,抵达时间又不确定.一般的解决方法是采用回调函数(callback-function)来实现的,但这样的解决方案很容易造成“回 ...

  5. kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition...Error: NETWORK_EXCEPTION

      kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition... ...

  6. Kafka#4:存储设计 分布式设计 源码分析

    https://sites.google.com/a/mammatustech.com/mammatusmain/kafka-architecture/4-kafka-detailed-archite ...

  7. Kafka剖析:Kafka背景及架构介绍

    <Kafka剖析:Kafka背景及架构介绍> <Kafka设计解析:Kafka High Availability(上)> <Kafka设计解析:Kafka High A ...

  8. Kafka分布式:ZooKeeper扩展

    [ZooKeeper] 服务注册.服务发现.客户端负载均衡.Offset偏移量分布式存储. kafka使用zookeeper来实现动态的集群扩展,不需要更改客户端(producer和consumer) ...

  9. kafka之一:Windows上搭建Kafka运行环境

    搭建环境 1. 安装JDK 1.1 安装文件:http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-213315 ...

随机推荐

  1. 如何实现登录、URL和页面按钮的访问控制?

    用户权限管理一般是对用户页面.按钮的访问权限管理.Shiro框架是一个强大且易用的Java安全框架,执行身份验证.授权.密码和会话管理,对于Shiro的介绍这里就不多说.本篇博客主要是了解Shiro的 ...

  2. RDIFramework.NET ━ .NET快速信息化系统开发框架 V3.2->Web版本模块管理界面新增模块排序功能

    模块(菜单)的排序是每个系统都必须要有的功能,我们框架模块的排序在业务逻辑中已经体现. WinForm版本可以直接在界面上对模块进行排序以控制模块展示的顺序.Web版本在3.2版本中也新增了直接可以模 ...

  3. RDIFramework.NET ━ .NET快速信息化系统开发框架 V3.2-新增锁定用户与解除锁定用户的功能

    锁定用户功能在现实应用场景中得到了大量的应用,当我们需要限制某用户的登录,又不能删除这个用户时就可以使用锁定功能,如:未授权的用户尝试错误密码错误过多可以尝试的用户进行锁定,又如ATM机上取钱时密码错 ...

  4. Jenkins+Maven+Gitlab+Tomcat 自动化构建打包、部署

    一.环境需求 本帖针对的是Linux环境,Windows或其他系统也可借鉴.具体只讲述Jenkins配置以及整个流程的实现. 1.JDK(或JRE)及Java环境变量配置,我用的是JDK1.8.0_1 ...

  5. -1-6 java 异常简单介绍 java异常 异常体系 Throwable 分类 throws和throw 异常处理 自定义异常

      异常   异常:异常就是Java程序在运行过程中出现的错误. 异常由来:问题也是现实生活中一个具体事务,也可以通过java 的类的形式进行描述,并封装成对象. 其实就是Java对不正常情况进行描述 ...

  6. Spring Boot 2.x (八):日志框架的使用

    我们为啥要用日志? 最初我们开始接触Java的时候,我们通常会使用System.out.println()将我们想要知道的信息打印到控制台. 但是,如果在服务器上我们去运行我们的Java程序,这个时候 ...

  7. Spring Boot 2.x(四):整合Mybatis的四种方式

    前言 目前的大环境下,使用Mybatis作为持久层框架还是占了绝大多数的,下面我们来说一下使用Mybatis的几种姿势. 姿势一:零配置注解开发 第一步:引入依赖 首先,我们需要在pom文件中添加依赖 ...

  8. Perl系列文章

    0.Perl书籍推荐 Perl书籍下载 密码:kkqx 下面是一些我学习Perl过程中读过完整的或部分章节的觉得好的书. 入门级别1:<Perl语言入门>即小骆驼 入门级别2:<In ...

  9. 解读经典《C#高级编程》第七版 Page94-100.继承.Chapter4

    前言 今天,我们开始进入第四章的解读.本章讲的是继承.要做稍微复杂一些的开发,便不可避免的会使用到继承.本篇文章我们主要解读"实现继承". 另外,从本文开始,我开始使用Markdo ...

  10. 表达式目录树(Expression)

    一:什么是表达式树 Expression我们称为是表达式树,是一种数据结构体,用于存储需要计算,运算的一种结构,这种结构可以只是存储,而不进行运算.通常表达式目录树是配合Lambda一起来使用的,la ...