Spark Strcutured Streaming中使用Dataset的groupBy agg 与 join 示例(java api)
Dataset的groupBy agg示例
Dataset<Row> resultDs = dsParsed
.groupBy("enodeb_id", "ecell_id")
.agg(
functions.first("scan_start_time").alias("scan_start_time1"),
functions.first("insert_time").alias("insert_time1"),
functions.first("mr_type").alias("mr_type1"),
functions.first("mr_ltescphr").alias("mr_ltescphr1"),
functions.first("mr_ltescpuschprbnum").alias("mr_ltescpuschprbnum1"),
functions.count("enodeb_id").alias("rows1"))
.selectExpr(
"ecell_id",
"enodeb_id",
"scan_start_time1 as scan_start_time",
"insert_time1 as insert_time",
"mr_type1 as mr_type",
"mr_ltescphr1 as mr_ltescphr",
"mr_ltescpuschprbnum1 as mr_ltescpuschprbnum",
"rows1 as rows");
Dataset Join示例:
Dataset<Row> ncRes = sparkSession.read().option("delimiter", "|").option("header", true).csv("/user/csv");
Dataset<Row> mro=sparkSession.sql("。。。"); Dataset<Row> ncJoinMro = ncRes
.join(mro, mro.col("id").equalTo(ncRes.col("id")).and(mro.col("calid").equalTo(ncRes.col("calid"))), "left_outer")
.select(ncRes.col("id").as("int_id"),
mro.col("vendor_id"),
。。。
);
join condition另外一种方式:
leftDfWithWatermark.join(rightDfWithWatermark,
expr(""" leftDfId = rightDfId AND leftDfTime >= rightDfTime AND leftDfTime <= rightDfTime + interval 1 hour"""),
joinType = "leftOuter" )
BroadcastHashJoin示例:
package com.dx.testbroadcast; import org.apache.spark.SparkConf;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions; import java.io.*; public class Test {
public static void main(String[] args) {
String personPath = "E:\\person.csv";
String personOrderPath = "E:\\personOrder.csv";
//writeToPersion(personPath);
//writeToPersionOrder(personOrderPath); SparkConf conf = new SparkConf();
SparkSession sparkSession = SparkSession.builder().config(conf).appName("test-broadcast-app").master("local[*]").getOrCreate(); Dataset<Row> person = sparkSession.read()
.option("header", "true")
.option("inferSchema", "true") //是否自动推到内容的类型
.option("delimiter", ",").csv(personPath).as("person");
person.printSchema(); Dataset<Row> personOrder = sparkSession.read()
.option("header", "true")
.option("inferSchema", "true") //是否自动推到内容的类型
.option("delimiter", ",").csv(personOrderPath).as("personOrder");
personOrder.printSchema(); // Default `inner`. Must be one of:`inner`, `cross`, `outer`, `full`, `full_outer`, `left`, `left_outer`,`right`, `right_outer`, `left_semi`, `left_anti`.
Dataset<Row> resultDs = personOrder.join(functions.broadcast(person), personOrder.col("personid").equalTo(person.col("id")),"left");
resultDs.explain();
resultDs.show(10);
} private static void writeToPersion(String personPath) {
BufferedWriter personWriter = null;
try {
personWriter = new BufferedWriter(new FileWriter(personPath));
personWriter.write("id,name,age,address\r\n");
for (int i = ; i < ; i++) {
personWriter.write("" + i + ",person-" + i + "," + i + ",address-address-address-address-address-address-address" + i + "\r\n");
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (personWriter != null) {
try {
personWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
} private static void writeToPersionOrder(String personOrderPath) {
BufferedWriter personWriter = null;
try {
personWriter = new BufferedWriter(new FileWriter(personOrderPath));
personWriter.write("personid,name,age,address\r\n");
for (int i = ; i < ; i++) {
personWriter.write("" + i + ",person-" + i + "," + i + ",address-address-address-address-address-address-address" + i + "\r\n");
}
} catch (Exception e) {
e.printStackTrace();
} finally {
if (personWriter != null) {
try {
personWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}
打印结果:
== Physical Plan ==
*() BroadcastHashJoin [personid#], [id#], LeftOuter, BuildRight
:- *() FileScan csv [personid#,name#,age#,address#] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/E:/personOrder.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<personid:int,name:string,age:int,address:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[, int, true] as bigint)))
+- *() Project [id#, name#, age#, address#]
+- *() Filter isnotnull(id#)
+- *() FileScan csv [id#,name#,age#,address#] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/E:/person.csv], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,name:string,age:int,address:string> +--------+--------+---+--------------------+---+--------+---+--------------------+
|personid| name|age| address| id| name|age| address|
+--------+--------+---+--------------------+---+--------+---+--------------------+
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
| |person-| |address-address-a...| |person-| |address-address-a...|
+--------+--------+---+--------------------+---+--------+---+--------------------+
only showing top rows
SparkSQL Broadcast HashJoin
person.createOrReplaceTempView("temp_person");
personOrder.createOrReplaceTempView("temp_person_order"); Dataset<Row> sqlResult = sparkSession.sql(
" SELECT /*+ BROADCAST (t11) */" +
" t11.id,t11.name,t11.age,t11.address," +
" t10.personid as person_id,t10.name as persion_order_name" +
" FROM temp_person_order as t10 " +
" inner join temp_person as t11" +
" on t11.id = t10.personid ");
sqlResult.show();
sqlResult.explain();
打印日志
+---+--------+---+--------------------+---------+------------------+
| id| name|age| address|person_id|persion_order_name|
+---+--------+---+--------------------+---------+------------------+
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
| |person-| |address-address-a...| | person-|
+---+--------+---+--------------------+---------+------------------+
only showing top rows // :: INFO FileSourceStrategy: Pruning directories with:
// :: INFO FileSourceStrategy: Post-Scan Filters: isnotnull(personid#)
// :: INFO FileSourceStrategy: Output Data Schema: struct<personid: int, name: string>
// :: INFO FileSourceScanExec: Pushed Filters: IsNotNull(personid)
// :: INFO FileSourceStrategy: Pruning directories with:
// :: INFO FileSourceStrategy: Post-Scan Filters: isnotnull(id#)
// :: INFO FileSourceStrategy: Output Data Schema: struct<id: int, name: string, age: int, address: string ... more fields>
// :: INFO FileSourceScanExec: Pushed Filters: IsNotNull(id)
== Physical Plan ==
*() Project [id#, name#, age#, address#, personid# AS person_id#, name# AS persion_order_name#]
+- *() BroadcastHashJoin [personid#], [id#], Inner, BuildRight
:- *() Project [personid#, name#]
: +- *() Filter isnotnull(personid#)
: +- *() FileScan csv [personid#,name#] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/E:/personOrder.csv], PartitionFilters: [], PushedFilters: [IsNotNull(personid)], ReadSchema: struct<personid:int,name:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[, int, true] as bigint)))
+- *() Project [id#, name#, age#, address#]
+- *() Filter isnotnull(id#)
+- *() FileScan csv [id#,name#,age#,address#] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/E:/person.csv], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:int,name:string,age:int,address:string>
// :: INFO SparkContext: Invoking stop() from shutdown hook
Spark Strcutured Streaming中使用Dataset的groupBy agg 与 join 示例(java api)的更多相关文章
- Java中访问控制修饰符的详解和示例——Java学习
Java中的四个访问控制修饰符 简述 在Java中共有四个: public -- 对外部完全可见 protected -- 对本包和所有子类可见 默认(不需要修饰符)-- 对本包可见 private ...
- Spark(十六)DataSet
Spark最吸引开发者的就是简单易用.跨语言(Scala, Java, Python, and R)的API. 本文主要讲解Apache Spark 2.0中RDD,DataFrame和Dataset ...
- Spark2.2(三十三):Spark Streaming和Spark Structured Streaming更新broadcast总结(一)
背景: 需要在spark2.2.0更新broadcast中的内容,网上也搜索了不少文章,都在讲解spark streaming中如何更新,但没有spark structured streaming更新 ...
- Spark2.3(三十四):Spark Structured Streaming之withWaterMark和windows窗口是否可以实现最近一小时统计
WaterMark除了可以限定来迟数据范围,是否可以实现最近一小时统计? WaterMark目的用来限定参数计算数据的范围:比如当前计算数据内max timestamp是12::00,waterMar ...
- Spark Streaming中的操作函数分析
根据Spark官方文档中的描述,在Spark Streaming应用中,一个DStream对象可以调用多种操作,主要分为以下几类 Transformations Window Operations J ...
- Spark2.3(三十五)Spark Structured Streaming源代码剖析(从CSDN和Github中看到别人分析的源代码的文章值得收藏)
从CSDN中读取到关于spark structured streaming源代码分析不错的几篇文章 spark源码分析--事件总线LiveListenerBus spark事件总线的核心是LiveLi ...
- Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十六)Structured Streaming中ForeachSink的用法
Structured Streaming默认支持的sink类型有File sink,Foreach sink,Console sink,Memory sink. ForeachWriter实现: 以写 ...
- Spark Streaming中的操作函数讲解
Spark Streaming中的操作函数讲解 根据根据Spark官方文档中的描述,在Spark Streaming应用中,一个DStream对象可以调用多种操作,主要分为以下几类 Transform ...
- Spark2.x(六十一):在Spark2.4 Structured Streaming中Dataset是如何执行加载数据源的?
本章主要讨论,在Spark2.4 Structured Streaming读取kafka数据源时,kafka的topic数据是如何被执行的过程进行分析. 以下边例子展开分析: SparkSession ...
随机推荐
- 使用 Python 登录网站(转)
对于大部分论坛,我们想要抓取其中的帖子分析,首先需要登录,否则无法查看. 这是因为 HTTP 协议是一个无状态(Stateless)的协议,服务器如何知道当前请求连接的用户是否已经登录了呢?有两种方式 ...
- 用css解决table文字溢出控制td显示字数(转)
场景: 最左边这栏我不行让他换行,怎么办呢? 下面是解决办法: table{ width:100px; table-layout:fixed;/* 只有定义了表格的布局算法为fixed,下面td的定义 ...
- Linux下的sqlserver简单试用
微软自2017年就推出了可以在linux上使用的sql-server,最近接触到了一个用sqlserver的项目,便尝试使用了一下. 下载 为了简化安装,我还是使用的docker的方式,镜像可以直接从 ...
- 使用36-pin的STM32输出VGA, VGA output using a 36-pin STM32
使用36-pin的STM32输出VGA 手头上有个项目需要通过单片机来控制将图像显示在LCD上,在网上搜了一阵子,发现都是使用的FPGA做的, 开始自己对FPGA不是很熟,一直在用的也是ARM系列的, ...
- There are no packages available for install
解决方法: ·删除sublime Text 安装目录下Data->Packages目录下的Package Control(如果没有,略过此步骤). ·下载Package Control,下载路径 ...
- Cygwin、MinGw、mingw-w64,MSys msys2区别与联系
https://www.biaodianfu.com/cygwin-ming-msys.html http://www.mingw-w64.org/doku.php http://blog.csdn. ...
- 安装oracle10g“程序异常终止。发生内部错误。请将以下文件提供给oracle技术支持部门
发生情景:测试环境搭建的是windows 2008 r2 sp1系统 在安装Oracle 10g数据库时发生了错误,现在把解决问题的方法和原因分享给大家. * 安装出现的现象: 输入完密码后下一步时 ...
- 分析iOS Crash文件,使用命令符号化iOS Crash文件
TBMainClient.ipa改名为TBMainClient.zip并解压得到TBMainClient.app 然后将TBMainClient.app TBMainClient.app.d ...
- JAVA card 应用开发(二) 在项目添加APPLET
在上篇博文中.<JAVA card 应用开发创建第一个APPLET>.介绍了一个项目从无到有. 那么.我们建立了这个项目后,仅仅有一个应用(一个可选AID),假设我希望这个项目能够有多个应 ...
- HikariCP 脑火Failed to obtain JDBC Connection: You need to run the CLI build and you need target/classes in your classpath to run.
测试了一下 HikariCP 连接池报错,无解 十一月 16, 2017 5:31:59 下午 org.apache.catalina.core.StandardContext loadOnStart ...