kafka-connect-hive sink实现要点小结
kafka-connect-hive sink插件实现了以ORC和Parquet两种方式向Hive表中写入数据。Connector定期从Kafka轮询数据并将其写入HDFS,来自每个Kafka主题的数据由提供的分区字段进行分区并划分为块,每个数据块都表示为一个HDFS文件,文件名由topic名称+分区编号+offset构成。如果配置中没有指定分区,则使用默认分区方式,每个数据块的大小由已写入HDFS的文件长度、写入HDFS的时间和未写入HDFS的记录数决定。
在阅读该插件的源码过程中,觉得有很多值得学习的地方,特总结如下以备后忘。
一、分区策略
该插件可以配置两种分区策略:
STRICT:要求必须已经创建了所有分区
DYNAMIC:根据PARTITIONBY指定的分区字段创建分区
STRICT方式
实现代码及注释如下:
package com.landoop.streamreactor.connect.hive.sink.partitioning
import com.landoop.streamreactor.connect.hive.{DatabaseName, Partition, TableName}
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hive.metastore.IMetaStoreClient
import scala.collection.JavaConverters._
import scala.util.control.NonFatal
import scala.util.{Failure, Success, Try}
/**
* A [[PartitionHandler]] that requires any partition
* to already exist in the metastore.
*
* 要求分区已经在metastore中存在
*/
object StrictPartitionHandler extends PartitionHandler {
override def path(partition: Partition,
db: DatabaseName,
tableName: TableName)
(client: IMetaStoreClient,
fs: FileSystem): Try[Path] = {
try {
// 获取Hive metastore中表的存储位置,成功则返回
val part = client.getPartition(db.value, tableName.value, partition.entries.map(_._2).toList.asJava)
Success(new Path(part.getSd.getLocation))
} catch { // 未找到表的存储位置,返回异常
case NonFatal(e) =>
Failure(new RuntimeException(s"Partition '${partition.entries.map(_._2).toList.mkString(",")}' does not exist and strict policy requires upfront creation", e))
}
}
}
DYNAMIC方式
实现代码及注释如下:
package com.landoop.streamreactor.connect.hive.sink.partitioning
import com.landoop.streamreactor.connect.hive.{DatabaseName, Partition, TableName}
import com.typesafe.scalalogging.slf4j.StrictLogging
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hive.metastore.IMetaStoreClient
import org.apache.hadoop.hive.metastore.api.{StorageDescriptor, Table}
import scala.collection.JavaConverters._
import scala.util.{Failure, Success, Try}
/**
* A [[PartitionHandler]] that creates partitions
* on the fly as required.
*
* The path of the partition is determined by the given
* [[PartitionPathPolicy]] parameter. By default this will
* be an implementation that uses the standard hive
* paths of key1=value1/key2=value2.
*/
class DynamicPartitionHandler(pathPolicy: PartitionPathPolicy = DefaultMetastorePartitionPathPolicy)
extends PartitionHandler with StrictLogging {
override def path(partition: Partition,
db: DatabaseName,
tableName: TableName)
(client: IMetaStoreClient,
fs: FileSystem): Try[Path] = {
def table: Table = client.getTable(db.value, tableName.value)
def create(path: Path, table: Table): Unit = {
logger.debug(s"New partition will be created at $path")
// 设置的表的存储位置信息
val sd = new StorageDescriptor(table.getSd)
sd.setLocation(path.toString)
val params = new java.util.HashMap[String, String]
// 获取分区key的值、分区创建时间
val values = partition.entries.map(_._2).toList.asJava
val ts = (System.currentTimeMillis / 1000).toInt
// 给表设置并创建新分区
val p = new org.apache.hadoop.hive.metastore.api.Partition(values, db.value, tableName.value, ts, 0, sd, params)
logger.debug(s"Updating hive metastore with partition $p")
client.add_partition(p)
logger.info(s"Partition has been created in metastore [$partition]")
}
// 获取分区信息
Try(client.getPartition(db.value, tableName.value, partition.entries.toList.map(_._2).asJava)) match {
case Success(p) => Try { // 成功则返回
new Path(p.getSd.getLocation)
}
case Failure(_) => Try { // 失败则根据分区路径创建策略生成分区路径并返回
val t = table
val tableLocation = new Path(t.getSd.getLocation)
val path = pathPolicy.path(tableLocation, partition)
create(path, t)
path
}
}
}
}
该方式会以标准的Hive分区路径来创建分区,也就是分区字段=分区字段值的方式。
二、文件命名和大小控制
Kafka轮询数据并将其写入HDFS,来自每个Kafka主题的数据由提供的分区字段进行分区并划分为块,每个数据块都表示为一个HDFS文件,这里涉及到两个细节:
如何给文件命名
如何文件分块,文件大小及数量如何控制
接下来逐一看一下相关代码实现,文件命名部分实现代码如下:
package com.landoop.streamreactor.connect.hive.sink.staging
import com.landoop.streamreactor.connect.hive.{Offset, Topic}
import scala.util.Try
trait FilenamePolicy {
val prefix: String
}
object DefaultFilenamePolicy extends FilenamePolicy {
val prefix = "streamreactor"
}
object CommittedFileName {
private val Regex = s"(.+)_(.+)_(\\d+)_(\\d+)_(\\d+)".r
def unapply(filename: String): Option[(String, Topic, Int, Offset, Offset)] = {
filename match {
case Regex(prefix, topic, partition, start, end) =>
// 返回主题名称、分区编号、起始offset和结束offset
Try((prefix, Topic(topic), partition.toInt, Offset(start.toLong), Offset(end.toLong))).toOption
case _ => None
}
}
}
从上面代码可以看出,文件名由topic名称+分区编号+offset构成。假设文件前缀是streamreactor,topic名称是hive_sink_orc,分布编号是0,当前最大的offset是1168,那么最终生成的文件名称就是streamreactor_hive_sink_orc_0_1168。
接下来看看文件的大小是如何控制的。文件的大小主要由sink插件的三个配置项决定,这些配置项信息如下:
WITH_FLUSH_INTERVAL:long类型,表示文件提交的时间间隔,单位是毫秒
WITH_FLUSH_SIZE:long类型,表示执行提交操作之前,已提交到HDFS的文件长度,单位是字节
WITH_FLUSH_COUNT:long类型,表示执行提交操作之前,未提交到HDFS的记录数,一条数据算一个记录
这些参数在CommitPolicy特质中被使用,该特质的信息及实现类如下:
package com.landoop.streamreactor.connect.hive.sink.staging
import com.landoop.streamreactor.connect.hive.TopicPartitionOffset
import com.typesafe.scalalogging.slf4j.StrictLogging
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.kafka.connect.data.Struct
import scala.concurrent.duration.FiniteDuration
/**
* The [[CommitPolicy]] is responsible for determining when
* a file should be flushed (closed on disk, and moved to be visible).
*
* Typical implementations will flush based on number of records,
* file size, or time since the file was opened.
*
* 负责决定文件何时被刷新(在磁盘上关闭,以及移动到可见),一般情况下基于记录数量、文件大小和文件被打开的时间来刷新
*/
trait CommitPolicy {
/**
* This method is invoked after a file has been written.
*
* If the output file should be committed at this time, then this
* method should return true, otherwise false.
*
* Once a commit has taken place, a new file will be opened
* for the next record.
*
* 该方法在文件被写入之后调用,在这时如果文件应该被提交,该方法返回true,否则返回false。一旦发生了提交,新文件将为下一个记录打开
*
* @param tpo the [[TopicPartitionOffset]] of the last record written 最后一次记录的TopicPartitionOffset
* @param path the path of the file that the struct was written to 文件写入的路径
* @param count the number www.jiahuayulpt.com of records written thus far to the file 到目前为止写入文件的记录数
*
*/
def shouldFlush(struct: Struct, tpo: TopicPartitionOffset, path: Path, count: Long)
(implicit fs: FileSystem): Boolean
}
/**
* Default implementation of [[CommitPolicy]] that will flush the
* output file under the following www.078886.cn/ circumstances:
* - file size reaches limit
* - time since file was created
* - number of files is reached
*
* CommitPolicy 的默认实现,将根据以下场景刷新输出文件:
* 文件大小达到限制
* 文件创建以来的时间
* 达到文件数量
*
* @param interval in millis 毫秒间隔
*/
case class DefaultCommitPolicy(fileSize: Option[Long],
interval: Option[FiniteDuration],
fileCount: Option[Long]) extends CommitPolicy with StrictLogging {
require(fileSize.isDefined || interval.isDefined || fileCount.isDefined)
override def shouldFlush(struct: Struct, tpo: TopicPartitionOffset, path: Path, count: Long)
(implicit fs: FileSystem): Boolean = {
// 返回文件状态
val stat = fs.getFileStatus(path)
val open_time = System.currentTimeMillis() - stat.getModificationTime // 计算文件打开时间
/**
* stat.getLen:文件长度,以字节为单位
* stat.getModificationTime:文件修改时间,以毫秒为单位
*/
fileSize.exists(_ <www.yongshiyule178.com= stat.getLen) || interval.exists(_.toMillis <= open_time) || fileCount.exists(_ <= count)
}
}
现在来分析一下DefaultCommitPolicy类的实现逻辑:
首先,返回HDFS上文件的状态,接着计算文件被打开的时间。最后使用exists函数来执行以下逻辑判断:
fileSize.exists(_ <= stat.getLen):已提交到HDFS的文件长度stat.getLen是否大于设置的文件长度阈值fileSize
interval.exists(_.toMillis <= open_time):文件打开时间open_time是否大于设置的文件打开时间阈值interval
fileCount.exists(_ <= count):未提交到HDFS的记录数count是否大于设置未提交到HDFS的记录数阈值fileCount
以上三个逻辑判断只要任何一个成立,就返回true,接着执行flush操作,将文件刷新到HDFS的对应目录中。这样就很好地控制了文件的大小以及数量,避免过多小文件的产生。
三、异常处理策略
异常处理不当的话,会直接影响服务的高可用,产生不可预估的损失。kafka-connect在处理数据读写的过程中产生的异常默认是直接抛出的,这类异常容易使负责读写的task停止服务,示例异常信息如下:
[2019-02-25 11:03:56,170] ERROR WorkerSinkTask{id=hive-sink-example-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:177)
MetaException(message:Could not dasheng178.com connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Operation timed out (Connection timed out)
at org.apache.thrift.transport.TSocket.open(TSocket.java:226)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:477)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:285)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:210)
at com.landoop.streamreactor.connect.hive.sink.HiveSinkTask.start(HiveSinkTask.scala:56)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:302)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:191)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Operation timed out (Connection timed out)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.thrift.transport.TSocket.open(TSocket.java:221)
... 13 more
)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:525)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:285)
at org.apache.hadoop.hive.www.tiaotiaoylzc.com/ metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:210)
at com.landoop.streamreactor.connect.hive.sink.HiveSinkTask.start(HiveSinkTask.scala:56)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:302)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:191)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:175)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:219)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2019-02-25 11:03:56,172] ERROR WorkerSinkTask{id=hive-sink-example-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:178)
在以上异常信息可以看到,由于连接Hive metastore超时,因此相关的Task被杀死,需要我们手动重启。当然这只是kafka-connect在运行中发生的一个异常,对于这类容易使Task停止工作的异常,需要设置相关的异常处理策略,sink插件在实现中定义了三种异常处理策略,分别如下:
NOOP:表示在异常发生后,不处理异常,继续工作
THROW:表示在异常发生后,直接抛出异常,这样会使服务停止
RETRY:表示在异常发生后,进行重试,相应地,需要定义重试次数,来避免无限重试情况的发生
基于以上三种异常处理策略,sink插件相关的实现类如下:
/*
* Copyright 2017 Datamountaineer.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the www.mytxyl1.com License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.datamountaineer.streamreactor.connect.errors
import java.util.Date
import com.datamountaineer.streamreactor.connect.errors.ErrorPolicyEnum.ErrorPolicyEnum
import com.typesafe.scalalogging.slf4j.StrictLogging
import org.apache.kafka.connect.errors.RetriableException
/**
* Created by andrew@datamountaineer.com on 19/05/16.
* kafka-connect-common
*/
object ErrorPolicyEnum extends Enumeration {
type ErrorPolicyEnum = Value
val NOOP, THROW, RETRY =www.feifanyule.cn Value
}
case class ErrorTracker(retries: Int, maxRetries: Int, lastErrorMessage: String, lastErrorTimestamp: Date, policy: ErrorPolicy)
trait ErrorPolicy extends www.fengshen157.com/ StrictLogging {
def handle(error: Throwable, sink: Boolean = true, retryCount: Int = 0)
}
object ErrorPolicy extends StrictLogging {
def apply(policy: ErrorPolicyEnum): ErrorPolicy = {
policy match {
case ErrorPolicyEnum.NOOP => NoopErrorPolicy()
case ErrorPolicyEnum.THROW => ThrowErrorPolicy()
case ErrorPolicyEnum.RETRY => RetryErrorPolicy()
}
}
}
/**
* 不处理异常策略
*/
case class NoopErrorPolicy(www.yongshi123.cn) extends ErrorPolicy {
override def handle(error: Throwable, sink: Boolean = true, retryCount: Int = 0){
logger.warn(s"Error policy NOOP: ${error.getMessage}. Processing continuing.")
}
}
/**
* 异常抛出处理策略
*/
case class ThrowErrorPolicy() extends ErrorPolicy {
override def handle(error: Throwable, sink: Boolean = true, retryCount: Int = 0){
throw new RuntimeException(error)
}
}
/**
* 异常重试处理策略
*/
case class RetryErrorPolicy() extends ErrorPolicy {
override def handle(error: Throwable, sink: Boolean www.yigouyule2.cn/= true, retryCount: Int) = {
if (retryCount == 0) {
throw new RuntimeException(error)
}
else {
logger.warn(s"Error policy set to RETRY. Remaining attempts $retryCount")
throw new RetriableException(error)
}
}
}
四、总结
基于kafka-connect实现相关数据同步插件时,应该尽可能地利用Kafka的topic信息,并对异常进行适当地处理,这样才可以保证插件的可扩展、高可用。
kafka-connect-hive sink实现要点小结的更多相关文章
- Streaming data from Oracle using Oracle GoldenGate and Kafka Connect
This is a guest blog from Robin Moffatt. Robin Moffatt is Head of R&D (Europe) at Rittman Mead, ...
- Kafka connect快速构建数据ETL通道
摘要: 作者:Syn良子 出处:http://www.cnblogs.com/cssdongl 转载请注明出处 业余时间调研了一下Kafka connect的配置和使用,记录一些自己的理解和心得,欢迎 ...
- kafka connect rest api
1. 获取 Connect Worker 信息curl -s http://127.0.0.1:8083/ | jq lenmom@M1701:~/workspace/software/kafka_2 ...
- Kafka Connect HDFS
概述 Kafka 的数据如何传输到HDFS?如果仔细思考,会发现这个问题并不简单. 不妨先想一下这两个问题? 1)为什么要将Kafka的数据传输到HDFS上? 2)为什么不直接写HDFS而要通过Kaf ...
- Build an ETL Pipeline With Kafka Connect via JDBC Connectors
This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via ...
- 使用kafka connect,将数据批量写到hdfs完整过程
版权声明:本文为博主原创文章,未经博主允许不得转载 本文是基于hadoop 2.7.1,以及kafka 0.11.0.0.kafka-connect是以单节点模式运行,即standalone. 首先, ...
- 基于Kafka Connect框架DataPipeline可以更好地解决哪些企业数据集成难题?
DataPipeline已经完成了很多优化和提升工作,可以很好地解决当前企业数据集成面临的很多核心难题. 1. 任务的独立性与全局性. 从Kafka设计之初,就遵从从源端到目的的解耦性.下游可以有很多 ...
- 基于Kafka Connect框架DataPipeline在实时数据集成上做了哪些提升?
在不断满足当前企业客户数据集成需求的同时,DataPipeline也基于Kafka Connect 框架做了很多非常重要的提升. 1. 系统架构层面. DataPipeline引入DataPipeli ...
- 以Kafka Connect作为实时数据集成平台的基础架构有什么优势?
Kafka Connect是一种用于在Kafka和其他系统之间可扩展的.可靠的流式传输数据的工具,可以更快捷和简单地将大量数据集合移入和移出Kafka的连接器.Kafka Connect为DataPi ...
随机推荐
- Jmeter接口测试(五)变量及参数化
在请求过程中,有时我们需要在请求中设置一些变量来测试不同的场景. 提示:在调试请求过程中,无关的请求可以暂时禁用掉,选择某个暂时不用的请求,右键--禁用 Jmeter 支持以下类型变量:所有类型的变量 ...
- Docker创建容器
容器是镜像的一个运行实例,是基于镜像运行的轻量级环境,是一个或者一组应用. 怎样创建容器?将容器所基于的镜像名称传入即可,Docker会从本地仓库中寻找该镜像,如果本地仓库没有,则会自动从远程仓库中拉 ...
- CHAPTER 38 Reading ‘the Book of Life’ The Human Genome Project 第38章 阅读生命之书 人体基因组计划
CHAPTER 38 Reading ‘the Book of Life’ The Human Genome Project 第38章 阅读生命之书 人体基因组计划 Humans have about ...
- Netty源码分析第2章(NioEventLoop)---->第2节: NioEventLoopGroup之NioEventLoop的创建
Netty源码分析第二章: NioEventLoop 第二节: NioEventLoopGroup之NioEventLoop的创建 回到上一小节的MultithreadEventExecutorG ...
- 高可用OpenStack(Queen版)集群-11.Neutron计算节点
参考文档: Install-guide:https://docs.openstack.org/install-guide/ OpenStack High Availability Guide:http ...
- C++:友元
前言:友元对于我来说一直是一个难点,最近看了一些有关友元的课程与博客,故在此将自己的学习收获做一个简单的总结 一.什么是友元 在C++的自定义类中,一个常规的成员函数声明往往意味着: • 该成员函数能 ...
- pycharm 2017注册码
1.在浏览器的地址栏输入:http://idea.lanyus.com/,该网址,无需修改用户名,点击获取注册码.复制该注册码,粘贴在注册界面的Activation code的输入框中,点击 ok 该 ...
- bata4
目录 组员情况 组员1(组长):胡绪佩 组员2:胡青元 组员3:庄卉 组员4:家灿 组员:恺琳 组员6:翟丹丹 组员7:何家伟 组员8:政演 组员9:黄鸿杰 组员10:刘一好 组员11:何宇恒 展示组 ...
- 校园跳蚤市场-Sprint计划(第二阶段)
- 404 Note Found团队-项目UML设计
目录 团队信息 分工选择 课上分工 课下分工 ToDolist alpha版本要做的事情 燃尽图 UML 用例图 状态图 活动图 类图 部署图 实例图 对象图 时序图 包图 通信图 贡献分评定 课上贡 ...