[BitSail] Connector开发详解系列三:SourceReader
更多技术交流、求职机会,欢迎关注字节跳动数据平台微信公众号,回复【1】进入官方交流群
Source Connector
SourceReader
SourceReader接口
public interface SourceReader<T, SplitT extends SourceSplit> extends Serializable, AutoCloseable {
void start();
void pollNext(SourcePipeline<T> pipeline) throws Exception;
void addSplits(List<SplitT> splits);
/**
* Check source reader has more elements or not.
*/
boolean hasMoreElements();
/**
* There will no more split will send to this source reader.
* Source reader could be exited after process all assigned split.
*/
default void notifyNoMoreSplits() {
}
/**
* Process all events which from {@link SourceSplitCoordinator}.
*/
default void handleSourceEvent(SourceEvent sourceEvent) {
}
/**
* Store the split to the external system to recover when task failed.
*/
List<SplitT> snapshotState(long checkpointId);
/**
* When all tasks finished snapshot, notify checkpoint complete will be invoked.
*/
default void notifyCheckpointComplete(long checkpointId) throws Exception {
}
interface Context {
TypeInfo<?>[] getTypeInfos();
String[] getFieldNames();
int getIndexOfSubtask();
void sendSplitRequest();
}
}
构造方法
示例
public RocketMQSourceReader(BitSailConfiguration readerConfiguration,
Context context,
Boundedness boundedness) {
this.readerConfiguration = readerConfiguration;
this.boundedness = boundedness;
this.context = context;
this.assignedRocketMQSplits = Sets.newHashSet();
this.finishedRocketMQSplits = Sets.newHashSet();
this.deserializationSchema = new RocketMQDeserializationSchema(
readerConfiguration,
context.getTypeInfos(),
context.getFieldNames());
this.noMoreSplits = false; cluster = readerConfiguration.get(RocketMQSourceOptions.CLUSTER);
topic = readerConfiguration.get(RocketMQSourceOptions.TOPIC);
consumerGroup = readerConfiguration.get(RocketMQSourceOptions.CONSUMER_GROUP);
consumerTag = readerConfiguration.get(RocketMQSourceOptions.CONSUMER_TAG);
pollBatchSize = readerConfiguration.get(RocketMQSourceOptions.POLL_BATCH_SIZE);
pollTimeout = readerConfiguration.get(RocketMQSourceOptions.POLL_TIMEOUT);
commitInCheckpoint = readerConfiguration.get(RocketMQSourceOptions.COMMIT_IN_CHECKPOINT);
accessKey = readerConfiguration.get(RocketMQSourceOptions.ACCESS_KEY);
secretKey = readerConfiguration.get(RocketMQSourceOptions.SECRET_KEY);
}
start方法
示例
public void start() {
try {
if (StringUtils.isNotEmpty(accessKey) && StringUtils.isNotEmpty(secretKey)) {
AclClientRPCHook aclClientRPCHook = new AclClientRPCHook(
new SessionCredentials(accessKey, secretKey));
consumer = new DefaultMQPullConsumer(aclClientRPCHook);
} else {
consumer = new DefaultMQPullConsumer();
}
consumer.setConsumerGroup(consumerGroup);
consumer.setNamesrvAddr(cluster);
consumer.setInstanceName(String.format(SOURCE_READER_INSTANCE_NAME_TEMPLATE,
cluster, topic, consumerGroup, UUID.randomUUID()));
consumer.setConsumerPullTimeoutMillis(pollTimeout);
consumer.start();
} catch (Exception e) {
throw BitSailException.asBitSailException(RocketMQErrorCode.CONSUMER_CREATE_FAILED, e);
}
}
public void start() {
this.connection = connectionHolder.connect();
// Construct statement.
String baseSql = ClickhouseJdbcUtils.getQuerySql(dbName, tableName, columnInfos);
String querySql = ClickhouseJdbcUtils.decorateSql(baseSql, splitField, filterSql, maxFetchCount, true);
try {
this.statement = connection.prepareStatement(querySql);
} catch (SQLException e) {
throw new RuntimeException("Failed to prepare statement.", e);
}
LOG.info("Task {} started.", subTaskId);
}
public void start() {
this.ftpHandler.loginFtpServer();
if (this.ftpHandler.getFtpConfig().getSkipFirstLine()) {
this.skipFirstLine = true;
}
}
addSplits方法
示例
public void addSplits(List<RocketMQSplit> splits) {
LOG.info("Subtask {} received {}(s) new splits, splits = {}.",
context.getIndexOfSubtask(),
CollectionUtils.size(splits),
splits);
assignedRocketMQSplits.addAll(splits);
}
hasMoreElements方法
public boolean hasMoreElements() {
if (boundedness == Boundedness.UNBOUNDEDNESS) {
return true;
}
if (noMoreSplits) {
return CollectionUtils.size(assignedRocketMQSplits) != 0;
}
return true;
}
pollNext方法
- 切片数据的读取
- 从构造好的切片中去读取数据。
- 数据类型的转换
- 将外部数据转换成BitSail的Row类型
示例
public void pollNext(SourcePipeline<Row> pipeline) throws Exception {
for (RocketMQSplit rocketmqSplit : assignedRocketMQSplits) {
MessageQueue messageQueue = rocketmqSplit.getMessageQueue();
PullResult pullResult = consumer.pull(rocketmqSplit.getMessageQueue(),
consumerTag,
rocketmqSplit.getStartOffset(),
pollBatchSize,
pollTimeout);
if (Objects.isNull(pullResult) || CollectionUtils.isEmpty(pullResult.getMsgFoundList())) {
continue;
}
for (MessageExt message : pullResult.getMsgFoundList()) {
Row deserialize = deserializationSchema.deserialize(message.getBody());
pipeline.output(deserialize);
if (rocketmqSplit.getStartOffset() >= rocketmqSplit.getEndOffset()) {
LOG.info("Subtask {} rocketmq split {} in end of stream.",
context.getIndexOfSubtask(),
rocketmqSplit);
finishedRocketMQSplits.add(rocketmqSplit);
break;
}
}
rocketmqSplit.setStartOffset(pullResult.getNextBeginOffset());
if (!commitInCheckpoint) {
consumer.updateConsumeOffset(messageQueue, pullResult.getMaxOffset());
}
}
assignedRocketMQSplits.removeAll(finishedRocketMQSplits);
}
转换为BitSail Row类型的常用方式
自定义RowDeserializer类
public class ClickhouseRowDeserializer {
interface FiledConverter {
Object apply(ResultSet resultSet) throws SQLException;
}
private final List<FiledConverter> converters;
private final int fieldSize;
public ClickhouseRowDeserializer(TypeInfo<?>[] typeInfos) {
this.fieldSize = typeInfos.length;
this.converters = new ArrayList<>();
for (int i = 0; i < fieldSize; ++i) {
converters.add(initFieldConverter(i + 1, typeInfos[i]));
}
}
public Row convert(ResultSet resultSet) {
Row row = new Row(fieldSize);
try {
for (int i = 0; i < fieldSize; ++i) {
row.setField(i, converters.get(i).apply(resultSet));
}
} catch (SQLException e) {
throw BitSailException.asBitSailException(ClickhouseErrorCode.CONVERT_ERROR, e.getCause());
}
return row;
}
private FiledConverter initFieldConverter(int index, TypeInfo<?> typeInfo) {
if (!(typeInfo instanceof BasicTypeInfo)) {
throw BitSailException.asBitSailException(CommonErrorCode.UNSUPPORTED_COLUMN_TYPE, typeInfo.getTypeClass().getName() + " is not supported yet.");
}
Class<?> curClass = typeInfo.getTypeClass();
if (TypeInfos.BYTE_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getByte(index);
}
if (TypeInfos.SHORT_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getShort(index);
}
if (TypeInfos.INT_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getInt(index);
}
if (TypeInfos.LONG_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getLong(index);
}
if (TypeInfos.BIG_INTEGER_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> {
BigDecimal dec = resultSet.getBigDecimal(index);
return dec == null ? null : dec.toBigInteger();
};
}
if (TypeInfos.FLOAT_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getFloat(index);
}
if (TypeInfos.DOUBLE_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getDouble(index);
}
if (TypeInfos.BIG_DECIMAL_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getBigDecimal(index);
}
if (TypeInfos.STRING_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getString(index);
}
if (TypeInfos.SQL_DATE_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getDate(index);
}
if (TypeInfos.SQL_TIMESTAMP_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getTimestamp(index);
}
if (TypeInfos.SQL_TIME_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getTime(index);
}
if (TypeInfos.BOOLEAN_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> resultSet.getBoolean(index);
}
if (TypeInfos.VOID_TYPE_INFO.getTypeClass() == curClass) {
return resultSet -> null;
}
throw new UnsupportedOperationException("Unsupported data type: " + typeInfo);
}
}
实现DeserializationSchema接口
public class TextInputFormatDeserializationSchema implements DeserializationSchema<Writable, Row> {
private BitSailConfiguration deserializationConfiguration;
private TypeInfo<?>[] typeInfos;
private String[] fieldNames;
private transient DeserializationSchema<byte[], Row> deserializationSchema;
public TextInputFormatDeserializationSchema(BitSailConfiguration deserializationConfiguration,
TypeInfo<?>[] typeInfos,
String[] fieldNames) {
this.deserializationConfiguration = deserializationConfiguration;
this.typeInfos = typeInfos;
this.fieldNames = fieldNames;
ContentType contentType = ContentType.valueOf(
deserializationConfiguration.getNecessaryOption(HadoopReaderOptions.CONTENT_TYPE, HadoopErrorCode.REQUIRED_VALUE).toUpperCase());
switch (contentType) {
case CSV:
this.deserializationSchema =
new CsvDeserializationSchema(deserializationConfiguration, typeInfos, fieldNames);
break;
case JSON:
this.deserializationSchema =
new JsonDeserializationSchema(deserializationConfiguration, typeInfos, fieldNames);
break;
default:
throw BitSailException.asBitSailException(HadoopErrorCode.UNSUPPORTED_ENCODING, "unsupported parser type: " + contentType);
}
}
@Override
public Row deserialize(Writable message) {
return deserializationSchema.deserialize((message.toString()).getBytes());
}
@Override
public boolean isEndOfStream(Row nextElement) {
return false;
}
}
public class MapredParquetInputFormatDeserializationSchema implements DeserializationSchema<Writable, Row> {
private final BitSailConfiguration deserializationConfiguration;
private final transient DateTimeFormatter localDateTimeFormatter;
private final transient DateTimeFormatter localDateFormatter;
private final transient DateTimeFormatter localTimeFormatter;
private final int fieldSize;
private final TypeInfo<?>[] typeInfos;
private final String[] fieldNames;
private final List<DeserializationConverter> converters;
public MapredParquetInputFormatDeserializationSchema(BitSailConfiguration deserializationConfiguration,
TypeInfo<?>[] typeInfos,
String[] fieldNames) {
this.deserializationConfiguration = deserializationConfiguration;
this.typeInfos = typeInfos;
this.fieldNames = fieldNames;
this.localDateTimeFormatter = DateTimeFormatter.ofPattern(
deserializationConfiguration.get(CommonOptions.DateFormatOptions.DATE_TIME_PATTERN));
this.localDateFormatter = DateTimeFormatter
.ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.DATE_PATTERN));
this.localTimeFormatter = DateTimeFormatter
.ofPattern(deserializationConfiguration.get(CommonOptions.DateFormatOptions.TIME_PATTERN));
this.fieldSize = typeInfos.length;
this.converters = Arrays.stream(typeInfos).map(this::createTypeInfoConverter).collect(Collectors.toList());
}
@Override
public Row deserialize(Writable message) {
int arity = fieldNames.length;
Row row = new Row(arity);
Writable[] writables = ((ArrayWritable) message).get();
for (int i = 0; i < fieldSize; ++i) {
row.setField(i, converters.get(i).convert(writables[i].toString()));
}
return row;
}
@Override
public boolean isEndOfStream(Row nextElement) {
return false;
}
private interface DeserializationConverter extends Serializable {
Object convert(String input);
}
private DeserializationConverter createTypeInfoConverter(TypeInfo<?> typeInfo) {
Class<?> typeClass = typeInfo.getTypeClass();
if (typeClass == TypeInfos.VOID_TYPE_INFO.getTypeClass()) {
return field -> null;
}
if (typeClass == TypeInfos.BOOLEAN_TYPE_INFO.getTypeClass()) {
return this::convertToBoolean;
}
if (typeClass == TypeInfos.INT_TYPE_INFO.getTypeClass()) {
return this::convertToInt;
}
throw BitSailException.asBitSailException(CsvFormatErrorCode.CSV_FORMAT_COVERT_FAILED,
String.format("Csv format converter not support type info: %s.", typeInfo));
}
private boolean convertToBoolean(String field) {
return Boolean.parseBoolean(field.trim());
}
private int convertToInt(String field) {
return Integer.parseInt(field.trim());
}
}
snapshotState方法
示例
public List<RocketMQSplit> snapshotState(long checkpointId) {
LOG.info("Subtask {} start snapshotting for checkpoint id = {}.", context.getIndexOfSubtask(), checkpointId);
if (commitInCheckpoint) {
for (RocketMQSplit rocketMQSplit : assignedRocketMQSplits) {
try {
consumer.updateConsumeOffset(rocketMQSplit.getMessageQueue(), rocketMQSplit.getStartOffset());
LOG.debug("Subtask {} committed message queue = {} in checkpoint id = {}.", context.getIndexOfSubtask(),
rocketMQSplit.getMessageQueue(),
checkpointId);
} catch (MQClientException e) {
throw new RuntimeException(e);
}
}
}
return Lists.newArrayList(assignedRocketMQSplits);
}
hasMoreElements方法
示例
public boolean hasMoreElements() {
if (noMoreSplits) {
return CollectionUtils.size(assignedHadoopSplits) != 0;
}
return true;
}
notifyNoMoreSplits方法
示例
public void notifyNoMoreSplits() {
LOG.info("Subtask {} received no more split signal.", context.getIndexOfSubtask());
noMoreSplits = true;
}
[BitSail] Connector开发详解系列三:SourceReader的更多相关文章
- 干货 | BitSail Connector 开发详解系列一:Source
更多技术交流.求职机会,欢迎关注字节跳动数据平台微信公众号,回复[1]进入官方交流群 BitSail 是字节跳动自研的数据集成产品,支持多种异构数据源间的数据同步,并提供离线.实时.全量.增量场景下全 ...
- Android高效率编码-第三方SDK详解系列(三)——JPush推送牵扯出来的江湖恩怨,XMPP实现推送,自定义客户端推送
Android高效率编码-第三方SDK详解系列(三)--JPush推送牵扯出来的江湖恩怨,XMPP实现推送,自定义客户端推送 很久没有更新第三方SDK这个系列了,所以更新一下这几天工作中使用到的推送, ...
- wpf 客户端【JDAgent桌面助手】开发详解(三) 瀑布流效果实现与UI虚拟化优化大数据显示
目录区域: 业余开发的wpf 客户端终于完工了..晒晒截图 wpf 客户端[JDAgent桌面助手]开发详解-开篇 wpf 客户端[JDAgent桌面助手]详解(一)主窗口 圆形菜单... wpf 客 ...
- PayPal 开发详解(三):在网站上创建【立即付款】按钮
1.使用[商家帐号]登录https://www.sandbox.paypal.com/ 2.点击[用户信息]->[其他选项]->[我保存的按钮] 3.选择[立即购买按钮事例] 4.[第一步 ...
- Mybatis源码详解系列(三)--从Mapper接口开始看Mybatis的执行逻辑
简介 Mybatis 是一个持久层框架,它对 JDBC 进行了高级封装,使我们的代码中不会出现任何的 JDBC 代码,另外,它还通过 xml 或注解的方式将 sql 从 DAO/Repository ...
- Eureka详解系列(三)--探索Eureka强大的配置体系
简介 通过前面的两篇博客,我们知道了:什么是 Eureka?为什么使用 Eureka?如何适用 Eureka?今天,我们开始来研究 Eureka 的源码,先从配置部分的源码开始看,其他部分后面再补充. ...
- 源码详解系列(七) ------ 全面讲解logback的使用和源码
什么是logback logback 用于日志记录,可以将日志输出到控制台.文件.数据库和邮件等,相比其它所有的日志系统,logback 更快并且更小,包含了许多独特并且有用的特性. logback ...
- Java源码详解系列(十二)--Eureka的使用和源码
eureka 是由 Netflix 团队开发的针对中间层服务的负载均衡器,在微服务项目中被广泛使用.相比 SLB.ALB 等负载均衡器,eureka 的服务注册是无状态的,扩展起来非常方便. 在这个系 ...
- wpf 客户端【JDAgent桌面助手】开发详解(四) popup控件的win8.0的bug
目录区域: 业余开发的wpf 客户端终于完工了..晒晒截图 wpf 客户端[JDAgent桌面助手]开发详解-开篇 wpf 客户端[JDAgent桌面助手]详解(一)主窗口 圆形菜单... wpf 客 ...
- Mybatis源码详解系列(四)--你不知道的Mybatis用法和细节
简介 这是 Mybatis 系列博客的第四篇,我本来打算详细讲解 mybatis 的配置.映射器.动态 sql 等,但Mybatis官方中文文档对这部分内容的介绍已经足够详细了,有需要的可以直接参考. ...
随机推荐
- 从零开始搭建antd4.x + react16 + redux4 + webpack4 + react-router5基础框架解析
以上是2020年10月份的版本,后来,我将xmind进行了完善,文档也写的差不多了,可是,电脑坏了,硬盘换了,文件都没有了.这已经是第三次写这个文档了,思维导图就不更新了,按照几个重点进行说明. 这个 ...
- C# 压缩PDF文件
PDF 文件可以包含文本.图片及各种媒体元素,但如果文件太大则会影响传输效果同时也会占用过多磁盘空间.通过压缩PDF文件,能够有效减小文件大小,从而提高传输效率并节省存储空间.想要通过C#代码快速有效 ...
- 小景的Dba之路--压力测试和Oracle数据库缓存
小景最近在做系统查询接口的压测相关的工作,其中涉及到了查询接口的数据库缓存相关的内容,在这里做一个汇总和思维发散,顺便简单说下自己的心得: 针对系统的查询接口,首次压测执行的时候TPS较低,平均响应时 ...
- 从零开始学习Python
从零开始学习Python是一个令人兴奋和有趣的过程.无论你是完全没有编程经验,还是已经熟悉其他编程语言,Python都可以成为你迈向程序员之路的理想起点. 首先,在开始学习之前,请确保在计算机上安装了 ...
- Vue05-Vuex
01. 什么是状态管理 在开发中,我们的应用程序需要处理各种各样的数据,这些数据需要保存在我们应用程序的某一个位置,对于这些数据的管理我们就称之为 状态管理. 在Vue开发中,我们使用组件化的开发方式 ...
- .NET生成微信小程序推广二维码
前言 对于小程序大家可能都非常熟悉了,随着小程序的不断普及越来越多的公司都开始推广使用起来了.今天接到一个需求就是生成小程序码,并且与运营给的推广图片合并在一起做成一张漂亮美观的推广二维码,扫码这种二 ...
- 【Javaweb】做一个房产信息管理系统三(src目录的部署工作【三层框架】各个层含义)
接下来,我打算进行Java文件的部署工作,但实际上为了得到更多的分数,我们还是应该先做页面 首先我们需要了解对于Javaweb,src下的目录应该如何部署:(三层架构单独开一篇讲) 那么这些都有什么含 ...
- 神经网络优化篇:详解训练,验证,测试集(Train / Dev / Test sets)
训练,验证,测试集 在配置训练.验证和测试数据集的过程中做出正确决策会在很大程度上帮助大家创建高效的神经网络.训练神经网络时,需要做出很多决策,例如: 神经网络分多少层 每层含有多少个隐藏单元 学习速 ...
- AVL树和红黑树的Python代码实现
AVL树 AVL树是一种自平衡二叉搜索树.在这种树中,任何节点的两个子树的高度差被严格控制在1以内.这确保了树的平衡,从而保证了搜索.插入和删除操作的高效性.AVL树是由Georgy Adelson- ...
- 解密数据可视化软件、BI软件和数字孪生软件的不同
在现代企业和科技领域,数据起着至关重要的作用.为了更好地管理和理解数据,不同类型的软件工具应运而生,其中包括数据可视化软件.BI(Business Intelligence)软件和数字孪生软件.虽然它 ...