FusionInsight MRS Flink DataStream API读写Hudi实践
摘要:目前Hudi只支持FlinkSQL进行数据读写,但是在实际项目开发中一些客户存在使用Flink DataStream API读写Hudi的诉求。
本文分享自华为云社区《FusionInsight MRS Flink DataStream API读写Hudi实践》,作者: yangxiao_mrs 。
目前Hudi只支持FlinkSQL进行数据读写,但是在实际项目开发中一些客户存在使用Flink DataStream API读写Hudi的诉求。
该实践包含三部分内容:
1)HoodiePipeline.java ,该类将Hudi内核读写接口进行封装,提供Hudi DataStream API。
2)WriteIntoHudi.java ,该类使用 DataStream API将数据写入Hudi。
3)ReadFromHudi.java ,该类使用 DataStream API读取Hudi数据。
1.HoodiePipeline.java 将Hudi内核读写接口进行封装,提供Hudi DataStream API。关键实现逻辑:
第一步:将原来Hudi流表的列名、主键、分区键set后,通过StringBuilder拼接成create table SQL。
第二步:将该hudi流表注册到catalog中。
第三步:将DynamicTable转换为DataStreamProvider后,进行数据produce或者consume。
import org.apache.flink.configuration.ConfigOption;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.configuration.ReadableConfig;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSink;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.internal.TableEnvironmentImpl;
import org.apache.flink.table.catalog.Catalog;
import org.apache.flink.table.catalog.CatalogTable;
import org.apache.flink.table.catalog.ObjectIdentifier;
import org.apache.flink.table.catalog.ObjectPath;
import org.apache.flink.table.catalog.exceptions.TableNotExistException;
import org.apache.flink.table.connector.sink.DataStreamSinkProvider;
import org.apache.flink.table.connector.source.DataStreamScanProvider;
import org.apache.flink.table.connector.source.ScanTableSource;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.factories.DynamicTableFactory;
import org.apache.flink.table.runtime.connector.sink.SinkRuntimeProviderContext;
import org.apache.flink.table.runtime.connector.source.ScanRuntimeProviderContext;
import org.apache.hudi.exception.HoodieException;
import org.apache.hudi.table.HoodieTableFactory; import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.stream.Collectors; /**
* A tool class to construct hoodie flink pipeline.
*
* <p>How to use ?</p>
* Method {@link #builder(String)} returns a pipeline builder. The builder
* can then define the hudi table columns, primary keys and partitions.
*
* <p>An example:</p>
* <pre>
* HoodiePipeline.Builder builder = HoodiePipeline.builder("myTable");
* DataStreamSink<?> sinkStream = builder
* .column("f0 int")
* .column("f1 varchar(10)")
* .column("f2 varchar(20)")
* .pk("f0,f1")
* .partition("f2")
* .sink(input, false);
* </pre>
*/
public class HoodiePipeline { /**
* Returns the builder for hoodie pipeline construction.
*/
public static Builder builder(String tableName) {
return new Builder(tableName);
} /**
* Builder for hudi source/sink pipeline construction.
*/
public static class Builder {
private final String tableName;
private final List<String> columns;
private final Map<String, String> options; private String pk;
private List<String> partitions; private Builder(String tableName) {
this.tableName = tableName;
this.columns = new ArrayList<>();
this.options = new HashMap<>();
this.partitions = new ArrayList<>();
} /**
* Add a table column definition.
*
* @param column the column format should be in the form like 'f0 int'
*/
public Builder column(String column) {
this.columns.add(column);
return this;
} /**
* Add primary keys.
*/
public Builder pk(String... pks) {
this.pk = String.join(",", pks);
return this;
} /**
* Add partition fields.
*/
public Builder partition(String... partitions) {
this.partitions = new ArrayList<>(Arrays.asList(partitions));
return this;
} /**
* Add a config option.
*/
public Builder option(ConfigOption<?> option, Object val) {
this.options.put(option.key(), val.toString());
return this;
} public Builder option(String key, Object val) {
this.options.put(key, val.toString());
return this;
} public Builder options(Map<String, String> options) {
this.options.putAll(options);
return this;
} public DataStreamSink<?> sink(DataStream<RowData> input, boolean bounded) {
TableDescriptor tableDescriptor = getTableDescriptor();
return HoodiePipeline.sink(input, tableDescriptor.getTableId(), tableDescriptor.getCatalogTable(), bounded);
} public TableDescriptor getTableDescriptor() {
EnvironmentSettings environmentSettings = EnvironmentSettings
.newInstance()
.build();
TableEnvironmentImpl tableEnv = TableEnvironmentImpl.create(environmentSettings);
String sql = getCreateHoodieTableDDL(this.tableName, this.columns, this.options, this.pk, this.partitions);
tableEnv.executeSql(sql);
String currentCatalog = tableEnv.getCurrentCatalog();
CatalogTable catalogTable = null;
String defaultDatabase = null;
try {
Catalog catalog = tableEnv.getCatalog(currentCatalog).get();
defaultDatabase = catalog.getDefaultDatabase();
catalogTable = (CatalogTable) catalog.getTable(new ObjectPath(defaultDatabase, this.tableName));
} catch (TableNotExistException e) {
throw new HoodieException("Create table " + this.tableName + " exception", e);
}
ObjectIdentifier tableId = ObjectIdentifier.of(currentCatalog, defaultDatabase, this.tableName);
return new TableDescriptor(tableId, catalogTable);
} public DataStream<RowData> source(StreamExecutionEnvironment execEnv) {
TableDescriptor tableDescriptor = getTableDescriptor();
return HoodiePipeline.source(execEnv, tableDescriptor.tableId, tableDescriptor.getCatalogTable());
}
} private static String getCreateHoodieTableDDL(
String tableName,
List<String> fields,
Map<String, String> options,
String pkField,
List<String> partitionField) {
StringBuilder builder = new StringBuilder();
builder.append("create table ")
.append(tableName)
.append("(\n");
for (String field : fields) {
builder.append(" ")
.append(field)
.append(",\n");
}
builder.append(" PRIMARY KEY(")
.append(pkField)
.append(") NOT ENFORCED\n")
.append(")\n");
if (!partitionField.isEmpty()) {
String partitons = partitionField
.stream()
.map(partitionName -> "`" + partitionName + "`")
.collect(Collectors.joining(","));
builder.append("PARTITIONED BY (")
.append(partitons)
.append(")\n");
}
builder.append("with ('connector' = 'hudi'");
options.forEach((k, v) -> builder
.append(",\n")
.append(" '")
.append(k)
.append("' = '")
.append(v)
.append("'"));
builder.append("\n)"); System.out.println(builder.toString());
return builder.toString();
} /**
* Returns the data stream sink with given catalog table.
*
* @param input The input datastream
* @param tablePath The table path to the hoodie table in the catalog
* @param catalogTable The hoodie catalog table
* @param isBounded A flag indicating whether the input data stream is bounded
*/
private static DataStreamSink<?> sink(DataStream<RowData> input, ObjectIdentifier tablePath, CatalogTable catalogTable, boolean isBounded) {
DefaultDynamicTableContext context = new DefaultDynamicTableContext(tablePath, catalogTable,
Configuration.fromMap(catalogTable.getOptions()), Thread.currentThread().getContextClassLoader(), false);
HoodieTableFactory hoodieTableFactory = new HoodieTableFactory();
return ((DataStreamSinkProvider) hoodieTableFactory.createDynamicTableSink(context)
.getSinkRuntimeProvider(new SinkRuntimeProviderContext(isBounded)))
.consumeDataStream(input);
} /**
* Returns the data stream source with given catalog table.
*
* @param execEnv The execution environment
* @param tablePath The table path to the hoodie table in the catalog
* @param catalogTable The hoodie catalog table
*/
private static DataStream<RowData> source(StreamExecutionEnvironment execEnv, ObjectIdentifier tablePath, CatalogTable catalogTable) {
DefaultDynamicTableContext context = new DefaultDynamicTableContext(tablePath, catalogTable,
Configuration.fromMap(catalogTable.getOptions()), Thread.currentThread().getContextClassLoader(), false);
HoodieTableFactory hoodieTableFactory = new HoodieTableFactory();
DataStreamScanProvider dataStreamScanProvider = (DataStreamScanProvider) ((ScanTableSource) hoodieTableFactory
.createDynamicTableSource(context))
.getScanRuntimeProvider(new ScanRuntimeProviderContext());
return dataStreamScanProvider.produceDataStream(execEnv);
} /***
* A POJO that contains tableId and resolvedCatalogTable.
*/
public static class TableDescriptor {
private ObjectIdentifier tableId;
private CatalogTable catalogTable; public TableDescriptor(ObjectIdentifier tableId, CatalogTable catalogTable) {
this.tableId = tableId;
this.catalogTable = catalogTable;
} public ObjectIdentifier getTableId() {
return tableId;
} public CatalogTable getCatalogTable() {
return catalogTable;
}
} private static class DefaultDynamicTableContext implements DynamicTableFactory.Context { private final ObjectIdentifier objectIdentifier;
private final CatalogTable catalogTable;
private final ReadableConfig configuration;
private final ClassLoader classLoader;
private final boolean isTemporary; DefaultDynamicTableContext(
ObjectIdentifier objectIdentifier,
CatalogTable catalogTable,
ReadableConfig configuration,
ClassLoader classLoader,
boolean isTemporary) {
this.objectIdentifier = objectIdentifier;
this.catalogTable = catalogTable;
this.configuration = configuration;
this.classLoader = classLoader;
this.isTemporary = isTemporary;
} @Override
public ObjectIdentifier getObjectIdentifier() {
return objectIdentifier;
} @Override
public CatalogTable getCatalogTable() {
return catalogTable;
} @Override
public ReadableConfig getConfiguration() {
return configuration;
} @Override
public ClassLoader getClassLoader() {
return classLoader;
} @Override
public boolean isTemporary() {
return isTemporary;
}
}
}
2.WriteIntoHudi.java 使用 DataStream API将数据写入Hudi。关键实现逻辑:
第一步:Demo中的数据源来自datagen connector Table。
第二步:使用toAppendStream将Table转化为Stream。
第三步:build hudi sink stream后写入Hudi。
在项目实践中也可以直接使用DataStream源写入Hudi。
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.table.data.RowData;
import org.apache.hudi.common.model.HoodieTableType;
import org.apache.hudi.configuration.FlinkOptions; import java.util.HashMap;
import java.util.Map; public class WriteIntoHudi {
public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
env.getCheckpointConfig().setCheckpointInterval(10000); tableEnv.executeSql("CREATE TABLE datagen (\n"
+ " uuid varchar(20),\n"
+ " name varchar(10),\n"
+ " age int,\n"
+ " ts timestamp(3),\n"
+ " p varchar(20)\n"
+ ") WITH (\n"
+ " 'connector' = 'datagen',\n"
+ " 'rows-per-second' = '5'\n"
+ ")"); Table table = tableEnv.sqlQuery("SELECT * FROM datagen"); DataStream<RowData> dataStream = tableEnv.toAppendStream(table, RowData.class);
String targetTable = "hudiSinkTable"; String basePath = "hdfs://hacluster/tmp/flinkHudi/hudiTable"; Map<String, String> options = new HashMap<>();
options.put(FlinkOptions.PATH.key(), basePath);
options.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.MERGE_ON_READ.name());
options.put(FlinkOptions.PRECOMBINE_FIELD.key(), "ts");
options.put(FlinkOptions.INDEX_BOOTSTRAP_ENABLED.key(), "true"); HoodiePipeline.Builder builder = HoodiePipeline.builder(targetTable)
.column("uuid VARCHAR(20)")
.column("name VARCHAR(10)")
.column("age INT")
.column("ts TIMESTAMP(3)")
.column("p VARCHAR(20)")
.pk("uuid")
.partition("p")
.options(options); builder.sink(dataStream, false); // The second parameter indicating whether the input data stream is bounded
env.execute("Api_Sink");
}
}
3.ReadFromHudi.java 使用 DataStream API读取Hudi数据。关键实现逻辑:
第一步:build hudi source stream读取hudi数据。
第二步:使用fromDataStream将stream转化为table。
第三步:将Hudi table的数据使用print connector打印输出。
在项目实践中也可以直接读取Hudi数据后写入sink DataStream。
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.table.data.RowData;
import org.apache.hudi.common.model.HoodieTableType;
import org.apache.hudi.configuration.FlinkOptions; import java.util.HashMap;
import java.util.Map; public class ReadFromHudi {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
String targetTable = "hudiSourceTable";
String basePath = "hdfs://hacluster/tmp/flinkHudi/hudiTable"; Map<String, String> options = new HashMap<>();
options.put(FlinkOptions.PATH.key(), basePath);
options.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.MERGE_ON_READ.name());
options.put(FlinkOptions.READ_AS_STREAMING.key(), "true"); // this option enable the streaming read
options.put("read.streaming.start-commit", "20210316134557"); // specifies the start commit instant time HoodiePipeline.Builder builder = HoodiePipeline.builder(targetTable)
.column("uuid VARCHAR(20)")
.column("name VARCHAR(10)")
.column("age INT")
.column("ts TIMESTAMP(3)")
.column("p VARCHAR(20)")
.pk("uuid")
.partition("p")
.options(options); DataStream<RowData> rowDataDataStream = builder.source(env); StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
Table table = tableEnv.fromDataStream(rowDataDataStream,"uuid, name, age, ts, p"); tableEnv.registerTable("hudiSourceTable",table); tableEnv.executeSql("CREATE TABLE print("
+ " uuid varchar(20),\n"
+ " name varchar(10),\n"
+ " age int,\n"
+ " ts timestamp(3),\n"
+ " p varchar(20)\n"
+ ") WITH (\n"
+ " 'connector' = 'print'\n"
+ ")"); tableEnv.executeSql("insert into print select * from hudiSourceTable");
env.execute("Api_Source");
}
}
4.在项目实践中如果有解析Kafka复杂Json的需求:
1)使用FlinkSQL: https://bbs.huaweicloud.com/forum/thread-153494-1-1.html
2)使用Flink DataStream MapFunction实现。
FusionInsight MRS Flink DataStream API读写Hudi实践的更多相关文章
- flink DataStream API使用及原理
传统的大数据处理方式一般是批处理式的,也就是说,今天所收集的数据,我们明天再把今天收集到的数据算出来,以供大家使用,但是在很多情况下,数据的时效性对于业务的成败是非常关键的. Spark 和 Flin ...
- Flink DataStream API Programming Guide
Example Program The following program is a complete, working example of streaming window word count ...
- Flink DataStream API 中的多面手——Process Function详解
之前熟悉的流处理API中的转换算子是无法访问事件的时间戳信息和水位线信息的.例如:MapFunction 这样的map转换算子就无法访问时间戳或者当前事件的时间. 然而,在一些场景下,又需要访问这些信 ...
- Flink DataStream API
Data Sources 源是程序读取输入数据的位置.可以使用 StreamExecutionEnvironment.addSource(sourceFunction) 将源添加到程序.Flink 有 ...
- Flink Program Guide (2) -- 综述 (DataStream API编程指导 -- For Java)
v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VM ...
- Flink中API使用详细范例--window
Flink Window机制范例实录: 什么是Window?有哪些用途? 1.window又可以分为基于时间(Time-based)的window 2.基于数量(Count-based)的window ...
- Flink-v1.12官方网站翻译-P016-Flink DataStream API Programming Guide
Flink DataStream API编程指南 Flink中的DataStream程序是对数据流实现转换的常规程序(如过滤.更新状态.定义窗口.聚合).数据流最初是由各种来源(如消息队列.套接字流. ...
- Flink Program Guide (10) -- Savepoints (DataStream API编程指导 -- For Java)
Savepoint 本文翻译自文档Streaming Guide / Savepoints ------------------------------------------------------ ...
- Flink Program Guide (8) -- Working with State :Fault Tolerance(DataStream API编程指导 -- For Java)
Working with State 本文翻译自Streaming Guide/ Fault Tolerance / Working with State ---------------------- ...
- Flink Program Guide (3) -- Event Time (DataStream API编程指导 -- For Java)
Event Time 本文翻译自DataStream API Docs v1.2的Event Time ------------------------------------------------ ...
随机推荐
- 轻松掌握组件启动之MongoDB(下):高可用复制集架构环境搭建
引言 上一章节中,我们详细介绍了在典型的三节点复制集环境中搭建MongoDB的步骤和注意事项.从准备配置文件到启动MongoDB进程,我们一步步指导读者完成了环境的设置.在本章节中,我们将进一步深入, ...
- 谱图论:Laplacian算子及其谱性质
1 Laplacian 算子 给定无向图\(G=(V, E)\),我们在上一篇博客<谱图论:Laplacian二次型和Markov转移算子>中介绍了其对应的Laplacian二次型: \[ ...
- 从零用VitePress搭建博客教程(1) – VitePress的安装和运行
1.从零用VitePress搭建博客说明(1) – VitePress的安装和运行 一.写在前面 最近在想更新一把自己的前端吧小博客,但发现wordPress版本停留在了5年之前,发现变化挺大,不支持 ...
- 2D物理引擎 Box2D for javascript Games 第五章 碰撞处理
2D物理引擎 Box2D for javascript Games 第五章 碰撞处理 碰撞处理 考虑到 Box2D 世界和在世界中移动的刚体之间迟早会发生碰撞. 而物理游戏的大多数功能则依赖于碰撞.在 ...
- Jenkins软件平台安装部署
1.Jenkins软件平台概念剖解: 基于主流的Hudson/Jenkins平台工具实现全自动网站部署.网站测试.网站回滚会大大的减轻网站部署的成本,Jenkins的前身为Hudson,Hudson主 ...
- Unity - Windows获取屏幕分辨率、可用区域
直接搜索最多的就是使用System.Windows.Form.Screen类,但因为unity用的是mono,不能正常使用这个方法 可使用win32api获取,这里只尝试了获取主要屏幕的分辨率,而且没 ...
- YbtOJ 「图论」第3章 最短路径
例题1.单源最短路径 dij 板子.(w36557658 原版 dij 代码! code #include<cmath> #include<queue> #include< ...
- 2023云栖大会议程&体验攻略
2023云栖大会倒计时1天 将围绕"计算,为了无法计算的价值" 为大家带来一场有用.有趣科技盛宴 City Walk 不如云栖Walk 今日,送上一份参会指南 给所有参会者& ...
- Qt5 学习积累
目录 1.cout/cin 2.随机数 3.QSting. string.QChar,.char等的转换 4.退出 5.Qt::tr() 6.QFrame::shape,shadow 7.QCombo ...
- JVM是如何处理反射的
反射实现1-调用本地方法 例: 1 // v0版本 2 import java.lang.reflect.Method; 3 4 public class Test { 5 public static ...