Avro使用手册
1. Overview
Data serialization is a technique of converting data into binary or text format. There are multiple systems available for this purpose. Apache Avro is one of those data serialization systems.
Avro is a language independent, schema-based data serialization library. It uses a schema to perform serialization and deserialization. Moreover, Avro uses a JSON format to specify the data structure which makes it more powerful.
In this tutorial, we'll explore more about Avro setup, the Java API to perform serialization and a comparison of Avro with other data serialization systems.
We'll focus primarily on schema creation which is the base of the whole system.
2. Apache Avro
Avro is a language-independent serialization library. To do this Avro uses a schema which is one of the core components. It stores the schema in a file for further data processing.
Avro is the best fit for Big Data processing. It's quite popular in Hadoop and Kafka world for its faster processing.
Avro creates a data file where it keeps data along with schema in its metadata section. Above all, it provides a rich data structure which makes it more popular than other similar solutions.
To use Avro for serialization, we need to follow the steps mentioned below.
3. Problem Statement
Let's start with defining a class called AvroHttRequest that we'll use for our examples. The class contains primitive as well as complex type attributes:
class AvroHttpRequest {
private long requestTime;
private ClientIdentifier clientIdentifier;
private List<String> employeeNames;
private Active active;
}
Here, requestTime is a primitive value. ClientIdentifier is another class which represents a complex type. We also have employeeName which is again a complex type. Active is an enum to describe whether the given list of employees is active or not.
Our objective is to serialize and de-serialize the *AvroHttRequest* class using Apache Avro.
4. Avro Data Types
Before proceeding further, let's discuss the data types supported by Avro.
Avro supports two types of data:
- Primitive type: Avro supports all the primitive types. We use primitive type name to define a type of a given field (null, boolean, int, long, float, double, bytes, and string). For example, a value which holds a String should be declared as {“type”: “string”} in Schema
- Complex type: Avro supports six kinds of complex types: records, enums, arrays, maps, unions and fixed
For example, in our problem statement, ClientIdentifier is a record.
In that case schema for ClientIdentifier should look like:
{
"type":"record",
"name":"ClientIdentifier",
"namespace":"com.baeldung.avro",
"fields":[
{
"name":"hostName",
"type":"string"
},
{
"name":"ipAddress",
"type":"string"
}
]
}
5. Using Avro
To start with, let's add the Maven dependencies we'll need to our pom.xml file.
We should include the following dependencies:
- Apache Avro – core components
- Compiler – Apache Avro Compilers for Avro IDL and Avro Specific Java APIT
- Tools – which includes Apache Avro command line tools and utilities
- Apache Avro Maven Plugin for Maven projects
We're using version 1.8.2 for this tutorial.
However, it's always advised to find the latest version on [Maven Central](https://search.maven.org/classic/#search|ga|1|a%3A"avro" AND g%3A"org.apache.avro"):
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-compiler</artifactId>
<version>1.8.2</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.8.2</version>
</dependency>
After adding maven dependencies, the next steps will be:
- Schema creation
- Reading the schema in our program
- Serializing our data using Avro
- Finally, de-serialize the data
6. Schema Creation
Avro describes its Schema using a JSON format. There are mainly four attributes for a given Avro Schema:
- Type- which describes the type of Schema whether its complex type or primitive value
- Namespace- which describes the namespace where the given Schema belongs to
- Name – the name of the Schema
- Fields- which tells about the fields associated with a given schema. Fields can be of primitive as well as complex type.
One way of creating the schema is to write the JSON representation, as we saw in the previous sections.
We can also create a schema using SchemaBuilder which is undeniably a better and efficient way to create it.
6.1. SchemaBuilder Utility
The class org.apache.avro.SchemaBuilder is useful for creating the Schema.
First of all, let's create the schema for ClientIdentifier:
Schema clientIdentifier = SchemaBuilder.record("ClientIdentifier")
.namespace("com.baeldung.avro")
.fields().requiredString("hostName").requiredString("ipAddress")
.endRecord();
Now, let's use this for creating an avroHttpRequest schema:
Schema avroHttpRequest = SchemaBuilder.record("AvroHttpRequest")
.namespace("com.baeldung.avro")
.fields().requiredLong("requestTime")
.name("clientIdentifier")
.type(clientIdentifier)
.noDefault()
.name("employeeNames")
.type()
.array()
.items()
.stringType()
.arrayDefault(null)
.name("active")
.type()
.enumeration("Active")
.symbols("YES","NO")
.noDefault()
.endRecord();
It's important to note here that we've assigned clientIdentifier as the type for the clientIdentifier field. In this case, clientIdentifier used to define type is the same schema we created before.
Later we can apply the toString method to get the JSON structure of Schema.
Schema files are saved using the .avsc extension. Let's save our generated schema to the “src/main/resources/avroHttpRequest-schema.avsc” file.
7. Reading the Schema
Reading a schema is more or less about creating Avro classes for the given schema. Once Avro classes are created we can use them to serialize and deserialize objects.
There are two ways to create Avro classes:
- Programmatically generating Avro classes: Classes can be generated using SchemaCompiler. There are a couple of APIs which we can use for generating Java classes. We can find the code for generation classes on GitHub.
- Using Maven to generate classes
We do have one maven plugin which does the job well. We need to include the plugin and run mvn clean install.
Let's add the plugin to our pom.xml file:
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>${avro.version}</version>
<executions>
<execution>
<id>schemas</id>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
<goal>protocol</goal>
<goal>idl-protocol</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resources/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
8. Serialization and Deserialization with Avro
As we're done with generating the schema let's continue exploring the serialization part.
There are two data serialization formats which Avro supports: JSON format and Binary format.
First, we'll focus on the JSON format and then we'll discuss the Binary format.
Before proceeding further, we should go through a few key interfaces. We can use the interfaces and classes below for serialization:
DatumWriter: We should use this to write data on a given Schema. We'll be using the SpecificDatumWriter implementation in our example, however, DatumWriter has other implementations as well. Other implementations are GenericDatumWriter, Json.Writer, ProtobufDatumWriter, ReflectDatumWriter, ThriftDatumWriter.
Encoder: Encoder is used or defining the format as previously mentioned. EncoderFactory provides two types of encoders, binary encoder, and JSON encoder.
DatumReader: Single interface for de-serialization. Again, it got multiple implementations, but we'll be using SpecificDatumReader in our example. Other implementations are- GenericDatumReader, Json.ObjectReader, Json.Reader, ProtobufDatumReader, ReflectDatumReader, ThriftDatumReader.
Decoder: Decoder is used while de-serializing the data. Decoderfactory provides two types of decoders: binary decoder and JSON decoder.
Next, let's see how serialization and de-serialization happen in Avro.
8.1. Serialization
We'll take the example of AvroHttpRequest class and try to serialize it using Avro.
First of all, let's serialize it in JSON format:
public byte[] serealizeAvroHttpRequestJSON(
AvroHttpRequest request) {
DatumWriter<AvroHttpRequest> writer = new SpecificDatumWriter<>(
AvroHttpRequest.class);
byte[] data = new byte[0];
ByteArrayOutputStream stream = new ByteArrayOutputStream();
Encoder jsonEncoder = null;
try {
jsonEncoder = EncoderFactory.get().jsonEncoder(
AvroHttpRequest.getClassSchema(), stream);
writer.write(request, jsonEncoder);
jsonEncoder.flush();
data = stream.toByteArray();
} catch (IOException e) {
logger.error("Serialization error:" + e.getMessage());
}
return data;
}
Let's have a look at a test case for this method:
@Test
public void whenSerialized_UsingJSONEncoder_ObjectGetsSerialized(){
byte[] data = serealizer.serealizeAvroHttpRequestJSON(request);
assertTrue(Objects.nonNull(data));
assertTrue(data.length > 0);
}
Here we've used the jsonEncoder method and passing the schema to it.
If we wanted to use a binary encoder, we need to replace the jsonEncoder() method with binaryEncoder():
Encoder jsonEncoder = EncoderFactory.get().binaryEncoder(stream,null);
8.2. Deserialization
To do this, we'll be using the above-mentioned DatumReader and Decoder interfaces.
As we used EncoderFactory to get an Encoder, similarly we'll use DecoderFactory to get a Decoder object.
Let's de-serialize the data using JSON format:
public AvroHttpRequest deSerealizeAvroHttpRequestJSON(byte[] data) {
DatumReader<AvroHttpRequest> reader
= new SpecificDatumReader<>(AvroHttpRequest.class);
Decoder decoder = null;
try {
decoder = DecoderFactory.get().jsonDecoder(
AvroHttpRequest.getClassSchema(), new String(data));
return reader.read(null, decoder);
} catch (IOException e) {
logger.error("Deserialization error:" + e.getMessage());
}
}
And let's see the test case:
@Test
public void whenDeserializeUsingJSONDecoder_thenActualAndExpectedObjectsAreEqual(){
byte[] data = serealizer.serealizeAvroHttpRequestJSON(request);
AvroHttpRequest actualRequest = deSerealizer
.deSerealizeAvroHttpRequestJSON(data);
assertEquals(actualRequest,request);
assertTrue(actualRequest.getRequestTime()
.equals(request.getRequestTime()));
}
Similarly, we can use a binary decoder:
Decoder decoder = DecoderFactory.get().binaryDecoder(data, null);
9. Conclusion
Apache Avro is especially useful while dealing with big data. It offers data serialization in binary as well as JSON format which can be used as per the use case.
The Avro serialization process is faster, and it's space efficient as well. Avro does not keep the field type information with each field; instead, it creates metadata in a schema.
Last but not least Avro has a great binding with a wide range of programming languages, which gives it an edge.
Avro使用手册的更多相关文章
- (转)Sqoop中文手册
Sqoop中文手册 1. 概述 本文档主要对SQOOP的使用进行了说明,参考内容主要来自于Cloudera SQOOP的官方文档.为了用中文更清楚明白地描述各参数的使用含义,本文档几乎所有参数 ...
- FREERTOS 手册阅读笔记
郑重声明,版权所有! 转载需说明. FREERTOS堆栈大小的单位是word,不是byte. 根据处理器架构优化系统的任务优先级不能超过32,If the architecture optimized ...
- JS魔法堂:不完全国际化&本地化手册 之 理論篇
前言 最近加入到新项目组负责前端技术预研和选型,其中涉及到一个熟悉又陌生的需求--国际化&本地化.熟悉的是之前的项目也玩过,陌生的是之前的实现仅仅停留在"有"的阶段而已. ...
- 转职成为TypeScript程序员的参考手册
写在前面 作者并没有任何可以作为背书的履历来证明自己写作这份手册的分量. 其内容大都来自于TypeScript官方资料或者搜索引擎获得,期间掺杂少量作者的私见,并会标明. 大部分内容来自于http:/ ...
- Redis学习手册(目录)
为什么自己当初要选择Redis作为数据存储解决方案中的一员呢?现在能想到的原因主要有三.其一,Redis不仅性能高效,而且完全免费.其二,是基于C/C++开发的服务器,这里应该有一定的感情因素吧.最后 ...
- JS魔法堂:不完全国际化&本地化手册 之 实战篇
前言 最近加入到新项目组负责前端技术预研和选型,其中涉及到一个熟悉又陌生的需求--国际化&本地化.熟悉的是之前的项目也玩过,陌生的是之前的实现仅仅停留在"有"的阶段而已. ...
- Windows API 函数列表 附帮助手册
所有Windows API函数列表,为了方便查询,也为了大家查找,所以整理一下贡献出来了. 帮助手册:700多个Windows API的函数手册 免费下载 API之网络函数 API之消息函数 API之 ...
- linux命令在线手册
下面几个网址有一些 Linux命令的在线手册,而且还是中文的,还可以搜索.非常方便 Linux命令手册 Linux命令大全 Linux中文man在线手册 每日一linux命令
- Mysql完全手册(笔记二,使用数据与性能优化)
一.使用数据 1.使用变量 MySQL也可以让我们以用户自定义的变量来存储select查询的结果,以便在将来select查询中使用.它们只会在客户会话期间存在,但是它们提供一个方便有效的方法来连接查询 ...
随机推荐
- Swoole_process实现进程池的方法
Swoole 的进程之间有两种通信方式,一种是消息队列(queue),另一种是管道(pipe),对swoole_process 的研究在swoole中显得尤为重要. 预备知识 IO多路复用 swool ...
- LeetCode周赛5214
我去,暴力超时,没啥人情味了.难受,一看答案,结果是dp的题目... 思路分析: 1.用个表记录下每个数当前的最大长度,同时是等差,说明有上一个数,那么当前的长度就是上一个数最大加一
- STM32笔记一
1.脉冲宽度调制是(PWM):用微处理器的数字输出来对模拟电路进行控制的一种非常有效的技术,广泛应用在从测量.通信到功率控制与变换的许多领域中.一般用于直流电机调速. 2.外部中断:外部中断是单片机实 ...
- 解决 .net core 中 nuget 包版本冲突问题[转载]
今天在一个 asp.net core 项目中遇到了 nuget 包版本冲突的问题,错误信息如下: Version conflict detected for Microsoft.AspNet.WebA ...
- F5负载均衡_monitors(健康检查)
故障现象: 后端有5台服务器,每个服务器上跑着8个应用.使用F5做应用负载调度.这40个应用里面,3-10个应用在高峰期的时候weblogic的DOS窗口显示与数据库断开连接(端口通.业务断),但是F ...
- java二叉树的遍历(1)
树(tree)是一种抽象数据类型(ADT),用来模拟具有树状结构性质的数据集合.它是由n(n>0)个有限节点通过连接它们的边组成一个具有层次关系的集合 节点:上图的圆圈,比如A,B,C等都是表示 ...
- CentOS 8 已经不再支持,Rocky Linux 才是未来
2020年12月8日,红帽公司宣布,他们将停止开发CentOS,而在此之前CentOS一直作为红帽企业Linux的生产型分支及下游版本,此后他们将转而开发该操作系统的一个更新的上游开发变种,即 &qu ...
- Django基础006--在pycharm中将项目配置为Django项目
1.在File--Settings--搜索Django 操作按照如图所示 2.在pycharm右上方项目处,选择Edit Configurations 3.在Name处写上项目名称 python环境选 ...
- [刘阳Java]_Spring IOC程序代码如何编写_第3讲
第2讲我们介绍了Spring IOC的基本原理,这篇文章告诉大家Spring IOC程序代码是如何编写的,从而可以更好的理解IOC和DI的概念(所有的Java类的初始化工作扔给Spring框架,一个J ...
- 微信小程序云开发-列表数据分页加载显示
一.准备工作 1.创建数据库nums,向数据库中导入108条数据 2.修改数据库表nums的权限 二.新建页面ListPaginated 1.wxml文件 <!-- 显示列表数据 --> ...