1. 目标

使用Flink读取Kafka数据并实时写入Hive表。

2. 环境配置

EMR环境:Hadoop 3.3.3, Hive 3.1.3, Flink 1.16.0

 

根据官网描述:

https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/table/hive/overview/

当前Flink 1.16.0 支持Hive 3.1.3版本,如果是开发,则需要加入依赖有:

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-hive_2.12</artifactId>
<version>1.16.0</version>
<scope>provided</scope>
</dependency> // Hive dependencies
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>3.1.3</version>
</dependency>

3. hive表

在读写hive表时,预先条件是注册hive catalog:

// set hive dialect
tableEnv.getConfig().setSqlDialect(SqlDialect.HIVE) // set hive catalog
tableEnv.executeSql("CREATE CATALOG myhive WITH (" +
"'type' = 'hive'," +
"'default-database' = 'default'," +
"'hive-conf-dir' = 'hiveconf'" +
")") tableEnv.executeSql("use catalog myhive")

然后创建hive表:

// hive table
tableEnv.executeSql("CREATE TABLE IF NOT EXISTS hive_table (" +
"id string," +
"`value` float," +
"hashdata string," +
"num integer," +
"token string," +
"info string," +
"ts timestamp " +
") " +
"PARTITIONED BY (dt string, hr string) STORED AS ORC TBLPROPERTIES (" +
// "'path'='hive-output'," +
"'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00'," +
"'sink.partition-commit.policy.kind'='metastore,success-file'," +
"'sink.partition-commit.trigger'='partition-time'," +
"'sink.partition-commit.delay'='0 s'" +
" )")

4. 消费Kafka并写入Hive表

参考官方文档:

https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/connectors/datastream/kafka/

添加对应依赖:

<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka</artifactId>
<version>${flink.version}</version>
</dependency>

flinksql参考代码:

package com.tang.hive

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.SqlDialect
import org.apache.flink.table.api.bridge.scala.StreamTableEnvironment object Kafka2Hive { /***
* create hive table
* @param tbl_env
* @param drop
* @param hiveConfDir
* @param database
* @return
*/
def buildHiveTable(tbl_env: StreamTableEnvironment,
drop: Boolean,
hiveConfDir: String,
database: String,
tableName: String,
dbLocation: String) = {
// set hive dialect
tbl_env.getConfig().setSqlDialect(SqlDialect.HIVE) // set hive catalog
tbl_env.executeSql("CREATE CATALOG myhive WITH (" +
"'type' = 'hive'," +
"'default-database' = '"+ database + "'," +
"'hive-conf-dir' = '" + hiveConfDir + "'" +
")")
tbl_env.executeSql("use catalog myhive") // whether drop hive table first
if (drop) {
// drop first
tbl_env.executeSql("drop table if exists" + tableName)
} val sql = "CREATE TABLE IF NOT EXISTS " + tableName + "(" +
"id string," +
"`value` float," +
"hashdata string," +
"num integer," +
"token string," +
"info string," +
"ts timestamp " +
") " +
"PARTITIONED BY (dt string, hr string) STORED AS ORC " +
"LOCATION '" + dbLocation + "/" + tableName +"' TBLPROPERTIES (" +
"'partition.time-extractor.timestamp-pattern'='$dt $hr:00:00'," +
"'sink.partition-commit.policy.kind'='metastore,success-file'," +
"'sink.partition-commit.trigger'='partition-time'," +
"'sink.partition-commit.watermark-time-zone'='Asia/Shanghai'," +
"'sink.partition-commit.delay'='0 s'," +
"'auto-compaction'='true'" +
" )" // hive table
tbl_env.executeSql(sql)
} /***
* create kafka table
* @param tbl_env
* @param drop
* @param bootstrapServers
* @param topic
* @param groupId
* @return
*/
def buildKafkaTable(tbl_env: StreamTableEnvironment,
drop: Boolean,
bootstrapServers: String,
topic: String,
groupId: String,
tableName: String) = {
// set to default dialect
tbl_env.getConfig.setSqlDialect(SqlDialect.DEFAULT) if (drop) {
tbl_env.executeSql("drop table if exists " + tableName)
} // kafka table
tbl_env.executeSql("CREATE TABLE IF NOT EXISTS "+ tableName + " (" +
"id string," +
"`value` float," +
"hashdata string," +
"num integer," +
"token string," +
"info string," +
"created_timestamp bigint," +
"ts AS TO_TIMESTAMP( FROM_UNIXTIME(created_timestamp) ), " +
"WATERMARK FOR ts AS ts - INTERVAL '5' SECOND "+
" )" +
"with (" +
" 'connector' = 'kafka'," +
" 'topic' = '" + topic + "'," +
" 'properties.bootstrap.servers' = '" + bootstrapServers +"'," +
" 'properties.group.id' = '" + groupId + "'," +
" 'scan.startup.mode' = 'latest-offset'," +
" 'format' = 'json'," +
" 'json.fail-on-missing-field' = 'false'," +
" 'json.ignore-parse-errors' = 'true'" +
")" ) } def main(args: Array[String]): Unit = {
val senv = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv = StreamTableEnvironment.create(senv) // set checkpoint
// senv.enableCheckpointing(60000);
//senv.getCheckpointConfig.setCheckpointStorage("file://flink-hive-chk"); // get parameter
val tool: ParameterTool = ParameterTool.fromArgs(args)
val hiveConfDir = tool.get("hive.conf.dir", "src/main/resources")
val database = tool.get("database", "default")
val hiveTableName = tool.get("hive.table.name", "hive_tbl")
val kafkaTableName = tool.get("kafka.table.name", "kafka_tbl")
val bootstrapServers = tool.get("bootstrap.servers", "b-2.cdc.62vm9h.c4.kafka.ap-northeast-1.amazonaws.com:9092,b-1.cdc.62vm9h.c4.kafka.ap-northeast-1.amazonaws.com:9092,b-3.cdc.62vm9h.c4.kafka.ap-northeast-1.amazonaws.com:9092")
val groupId = tool.get("group.id", "flinkConsumer")
val reset = tool.getBoolean("tables.reset", false)
val topic = tool.get("kafka.topic", "cider")
val hiveDBLocation = tool.get("hive.db.location", "s3://tang-emr-tokyo/flink/kafka2hive/") buildHiveTable(tableEnv, reset, hiveConfDir, database, hiveTableName, hiveDBLocation)
buildKafkaTable(tableEnv, reset, bootstrapServers, topic, groupId, kafkaTableName) // select from kafka table and write to hive table
tableEnv.executeSql("insert into " + hiveTableName + " select id, `value`, hashdata, num, token, info, ts, DATE_FORMAT(ts, 'yyyy-MM-dd'), DATE_FORMAT(ts, 'HH') from " + kafkaTableName) } }

Kafka写入数据格式:

{"id": "35f1c5a8-ec19-4dc3-afa5-84ef6bc18bd8", "value": 1327.12, "hashdata": "0822c055f097f26f85a581da2c937895c896200795015e5f9e458889", "num": 3, "token": "800879e1ef9a356cece14e49fb6949c1b8c1862107468dc682d406893944f2b6", "info": "valentine", "created_timestamp": 1690165700}


5.1. 代码配置说明

Hive表的部分配置:

"'sink.partition-commit.policy.kind'='metastore,success-file',"
=》在分区完成写入后,如何通知下游“分区数据已经可读”。目前支持metastore和success-file "'sink.partition-commit.trigger'='partition-time',"
=》什么时候触发partition commit。Partition-time表示在watermark超过了“分区时间”+“delay”的时间后,commit partition "'sink.partition-commit.delay'='0 s'"
=》延迟这个时间后再commit分区 'sink.partition-commit.watermark-time-zone'='Asia/Shanghai'
=》时区必须与数据时间戳一致 "'auto-compaction'='true'"
=》开启文件合并,在落盘前先合并 通过checkponit来决定落盘频率
senv.enableCheckpointing(60000);

在这个配置下,每1分钟会做一次checkpoint,即将文件写入s3。同时,还会触发自动合并的动作,最终每1分钟生成1个orc文件。

5.2. 提交job

参考flink官网:

需要移除flink-table-planner-loader-1.16.0.jar,并移入flink-table-planner_2.12-1.16.0:

cd /usr/lib/flink/lib
sudo mv flink-table-planner-loader-1.16.0.jar ../ sudo wget https://repo1.maven.org/maven2/org/apache/flink/flink-table-planner_2.12/1.16.0/flink-table-planner_2.12-1.16.0.jar sudo chown flink:flink flink-table-planner_2.12-1.16.0.jar
sudo chmod +x flink-table-planner_2.12-1.16.0.jar 然后主节点运行:
sudo cp /usr/lib/hive/lib/antlr-runtime-3.5.2.jar /usr/lib/flink/lib
sudo cp /usr/lib/hive/lib/hive-exec-3.1.3*.jar /lib/flink/lib
sudo cp /usr/lib/hive/lib/libfb303-0.9.3.jar /lib/flink/lib
sudo cp /usr/lib/flink/opt/flink-connector-hive_2.12-1.16.0.jar /lib/flink/lib sudo chmod 755 /usr/lib/flink/lib/antlr-runtime-3.5.2.jar
sudo chmod 755 /usr/lib/flink/lib/hive-exec-3.1.3*.jar
sudo chmod 755 /usr/lib/flink/lib/libfb303-0.9.3.jar
sudo chmod 755 /usr/lib/flink/lib/flink-connector-hive_2.12-1.16.0.jar

上传hive配置文件到hdfs:

hdfs dfs -mkdir /user/hadoop/hiveconf/
hdfs dfs -put /etc/hive/conf/hive-site.xml /user/hadoop/hiveconf/hive-site.xml

Emr主节点提交job:

flink run-application \
-t yarn-application \
-c com.tang.hive.Kafka2Hive \
-p 8 \
-D state.backend=rocksdb \
-D state.checkpoint-storage=filesystem \
-D state.checkpoints.dir=s3://tang-emr-tokyo/flink/kafka2hive/checkpoints \
-D execution.checkpointing.interval=60000 \
-D state.checkpoints.num-retained=5 \
-D execution.checkpointing.mode=EXACTLY_ONCE \
-D execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION \
-D state.backend.incremental=true \
-D execution.checkpointing.max-concurrent-checkpoints=1 \
-D rest.flamegraph.enabled=true \
flink-tutorial.jar \
--hive.conf.dir hdfs:///user/hadoop/hiveconf \
--reset true

5. 测试结果

5.1. 文件数量与大小

从写入基于s3的hive表来看,基本是1分钟2个文件(因为超出了默认rolling配置的128MB文件大小,所以会额外再写1个文件)。同时,未compaction的文件对下游不可见:

5.2. hive分区注册

从hive表来看,写入数据后默认在hive元数据内注册了新分区。

S3路径:

Hive分区:

5.3. 可见的最近数据

从hive查询结果来看,下游能查询到的数据为最近1分钟之前的数据:

select current_timestamp, ts from hive_tbl order by ts desc limit 10;

2023-07-27 09:25:24.193 2023-07-27 09:24:24
2023-07-27 09:25:24.193 2023-07-27 09:24:24
2023-07-27 09:25:24.193 2023-07-27 09:24:24
2023-07-27 09:25:24.193 2023-07-27 09:24:24

Flink-读Kafka写Hive表的更多相关文章

  1. sparkStreaming读取kafka写入hive表

    sparkStreaming: package hive import java.io.File import org.apache.kafka.clients.consumer.ConsumerRe ...

  2. spark-streaming读kafka数据到hive遇到的问题

    在项目中使用spark-stream读取kafka数据源的数据,然后转成dataframe,再后通过sql方式来进行处理,然后放到hive表中, 遇到问题如下,hive-metastor在没有做高可用 ...

  3. Flink写入kafka时,只写入kafka的部分Partitioner,无法写所有的Partitioner问题

    1. 写在前面 在利用flink实时计算的时候,往往会从kafka读取数据写入数据到kafka,但会发现当kafka多个Partitioner时,特别在P量级数据为了kafka的性能kafka的节点有 ...

  4. 把kafka数据从hbase迁移到hdfs,并按天加载到hive表(hbase与hadoop为不同集群)

    需求:由于我们用的阿里云Hbase,按存储收费,现在需要把kafka的数据直接同步到自己搭建的hadoop集群上,(kafka和hadoop集群在同一个局域网),然后对接到hive表中去,表按每天做分 ...

  5. Spark 读写hive 表

    spark 读写hive表主要是通过sparkssSession 读表的时候,很简单,直接像写sql一样sparkSession.sql("select * from xx") 就 ...

  6. Flink读写Kafka

    Flink 读写Kafka 在Flink中,我们分别用Source Connectors代表连接数据源的连接器,用Sink Connector代表连接数据输出的连接器.下面我们介绍一下Flink中用于 ...

  7. Flink消费Kafka到HDFS实现及详解

    1.概述 最近有同学留言咨询,Flink消费Kafka的一些问题,今天笔者将用一个小案例来为大家介绍如何将Kafka中的数据,通过Flink任务来消费并存储到HDFS上. 2.内容 这里举个消费Kaf ...

  8. 导hive表项目总结(未完待续)

    shell里面对日期的操作 #!/bin/bash THIS_FROM=$(date +%Y%m%d -d "-7 day") THIS_TO=$(date +%Y-%m-%d - ...

  9. 读Kafka Consumer源码

    最近一直在关注阿里的一个开源项目:OpenMessaging OpenMessaging, which includes the establishment of industry guideline ...

  10. (MariaDB/MySQL)MyISAM存储引擎读、写操作的优先级

    MariaDB/MySQL中使用表级锁的存储引擎(例如MyISAM.Aria(MariaDB对MyISAM引擎的改进,前身是MyISAM))在读(select).写操作(insert.delete.u ...

随机推荐

  1. 【GiraKoo】Java Native Interface(JNI)的空间(引用)管理

    Java Native Interface(JNI)的空间(引用)管理 Java是通过垃圾回收机制回收内存,C/C++是通过malloc,free,new,delete手动管理空间.那么在JNI层,同 ...

  2. AcWing900.整数划分(python)

    题目详情 知识点 计数类DP 分析题目,k个数是默认排好序的,也就是说,对于划分我们的考虑是无序的:例如 4 = 1+1+2 4 = 1+2+1 4 = 2+1+1 以上三种方式是没有区别的,所以在求 ...

  3. 解决echarts图形由于label过长导致文字显示不全问题

    使用echarts 打印饼图,在pc没问题,但一到移动端问题就来了,由于屏幕过小,导致label部分被遮挡 一.问题分析 如上图这个就尴尬了,囧么办呢? 还好echarts 提供了formatter方 ...

  4. 代码随想录算法训练营Day46 动态规划

    代码随想录算法训练营 代码随想录算法训练营Day46 动态规划| ●  139.单词拆分 关于多重背包,你该了解这些! 背包问题总结篇! 139.单词拆分 题目链接:139.单词拆分 给定一个非空字符 ...

  5. 如何让ChatGPT生成Midjourney提示词

    ​ 导读:最近AI绘画非常的火,今天我们看ChatGPT如何生成Midjourney提示词,让AI教AI做事. 本文字数:900,阅读时长大约:3分钟 正如 Midjourney 的官方网站报道的那样 ...

  6. 从源码角度剖析 golang 如何fork一个进程

    从源码角度剖析 golang 如何fork一个进程 创建一个新进程分为两个步骤,一个是fork系统调用,一个是execve 系统调用,fork调用会复用父进程的堆栈,而execve直接覆盖当前进程的堆 ...

  7. WC2021及学长分享题目

    部分题目见洛谷题单 动态更新. 标 * 为想做的题. hdhd: CF1214G Feeling Good CF1305F Kuroni and the Punishment AGC016F Game ...

  8. ImageMagick 图像处理学习笔记

    Use ImageMagick to create, edit, compose, or convert bitmap images. It can read and write images in ...

  9. 【python基础】循环语句-for循环

    1.初始for循环 for循环可以遍历任何可迭代对象,如一个列表或者一个字符串.这里可迭代对象的概念我们后期介绍,先知道这个名词就好了. 其语法格式之一: 比如我们遍历学员名单,编写程序如下所示: f ...

  10. Kubernetes(k8s)访问控制:身份认证

    目录 一.系统环境 二.前言 三.Kubernetes访问控制 四.身份认证简介 五.身份认证 5.1 配置客户端机器 5.2 使用base auth的方式进行认证 5.3 使用token的方式进行认 ...