Spark Structured streaming API支持的输出源有：Console、Memory、File和Foreach。其中Console在前两篇博文中已有详述，而Memory使用非常简单。本文着重介绍File和Foreach两种方式，并介绍如何在源码基本扩展新的输出方式。

1. File

　　Structured Streaming支持将数据以File形式保存起来，其中支持的文件格式有四种:json、text、csv和parquet。其使用方式也非常简单只需设置checkpointLocation和path即可。checkpointLocation是检查点保存的路径，而path是真实数据保存的路径。

如下所示的测试例子：

// Create DataFrame representing the stream of input lines from connection to host:port

val lines = spark.readStream

.format("socket")

.option("host", host)

.option("port", port)

.load()

// Split the lines into words

val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count

val wordCounts = words.groupBy("value").count()

// Start running the query that prints the running counts to the console

val query = wordCounts.writeStream

.format("json")

.option("checkpointLocation","root/jar")

.option("path","/root/jar")

.start()

注意：

File形式不能设置"compelete"模型，只能设置"Append"模型。由于Append模型不能有聚合操作，所以将数据保存到外部File时，不能有聚合操作。

2. Foreach

　　foreach输出方式只需要实现ForeachWriter抽象类，并实现三个方法，当Structured Streaming接收到数据就会执行其三个方法，如下的测试示例：

* Licensed to the Apache Software Foundation (ASF) under one or more

* contributor license agreements. See the NOTICE file distributed with

* this work for additional information regarding copyright ownership.

* The ASF licenses this file to You under the Apache License, Version 2.0

* (the "License"); you may not use this file except in compliance with

* the License. You may obtain a copy of the License at

* http://www.apache.org/licenses/LICENSE-2.0

* Unless required by applicable law or agreed to in writing, software

* distributed under the License is distributed on an "AS IS" BASIS,

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

* See the License for the specific language governing permissions and

* limitations under the License.

// scalastyle:off println

package org.apache.spark.examples.sql.streaming

import org.apache.spark.sql.SparkSession

/**

* Counts words in UTF8 encoded, '\n' delimited text received from the network.

* Usage: StructuredNetworkWordCount <hostname> <port>

* <hostname> and <port> describe the TCP server that Structured Streaming

* would connect to receive data.

* To run this on your local machine, you need to first run a Netcat server

* `$ nc -lk 9999`

* and then run the example

* `$ bin/run-example sql.streaming.StructuredNetworkWordCount

* localhost 9999`

object StructuredNetworkWordCount {

def main(args: Array[String]) {

if (args.length < 2) {

System.err.println("Usage: StructuredNetworkWordCount <hostname> <port>")

System.exit(1)

}

val host = args(0)

val port = args(1).toInt

val spark = SparkSession

.builder

.appName("StructuredNetworkWordCount")

.getOrCreate()

import spark.implicits._

// Create DataFrame representing the stream of input lines from connection to host:port

val lines = spark.readStream

.format("socket")

.option("host", host)

.option("port", port)

.load()

// Start running the query that prints the running counts to the console

val query = wordCounts.writeStream

.outputMode("append")

.foreach(new ForearchWriter[Row]{

override def open(partitionId:Long,version:Long):Boolean={

println("open")

return true

}

override def process(value:Row):Unit={

val spark = SparkSession.builder.getOrCreate()

val seq = value.mkString.split(" ")

val row = Row.fromSeq(seq)

val rowRDD:RDD[Row] = sparkContext.getOrCreate().parallelize[Row](Seq(row))

val userSchema = new StructType().add("name","String").add("age","String")

val peopleDF = spark.createDataFrame(rowRDD,userSchema)

peopleDF.createOrReplaceTempView(myTable)

spark.sql("select * from myTable").show()

}

override def close(errorOrNull:Throwable):Unit={

println("close")

}

})

.start()

query.awaitTermination()

}

// scalastyle:on println

　　上述程序是直接继承ForeachWriter类的接口，并实现了open()、process()、close()三个方法。若采用显示定义一个类来实现，需要注意Scala的泛型设计，如下所示：

class myForeachWriter[T<:Row](stream:CatalogTable) extends ForearchWriter[T]{

override def open(partionId:Long,version:Long):Boolean={

println("open")

true

}

override def process(value:T):Unit={

println(value)

}

override def close(errorOrNull:Throwable):Unit={

println("close")

}

3. 自定义

　　若上述Spark Structured Streaming API提供的数据输出源仍不能满足要求，那么还有一种方法可以使用：修改源码。

如下通过实现一种自定义的Console来介绍这种使用方式：

3.1 ConsoleSink

　　Spark有一个Sink接口，用户可以实现该接口的addBatch方法，其中的data参数是接收的数据，如下所示直接将其输出到控制台：

class ConsoleSink(streamName:String) extends Sink{

override def addBatch(batchId:Long, data;DataFrame):Unit = {

data.show()

}

3.2 DataStreamWriter

　　在用户自定义的输出形式时，并调用start()方法后，Spark框架会去调用DataStreamWriter类的start()方法。所以用户可以直接在该方法中添加自定义的输出方式，如我们向其传递上述创建的ConsoleSink类示例，如下所示：

def start():StreamingQuery={

if(source == "memory"){

...

}else if(source=="foreach"){

...

}else if(source=="consoleSink"){

val streamName:String = extraOption.get("streamName") mathc{

case Some(str):str

case None=>throw new AnalysisException("streamName option must be specified for Sink")

}

val sink = new consoleSink(streamName)

df.sparkSession.sessionState.streamingQueryManager.startQuery(

extraOption.get("queryName"),

extraOption.get("checkpointLocation"),

df,

sink,

outputMode,

useTempCheckpointLocaltion = true,

recoverFromCheckpointLocation = false,

trigger = trigger

)

}else{

...

}

3.3 Structured Streaming

　　在前两部修改和实现完成后，用户就可以按正常的Structured Streaming API方式使用了，唯一不同的是在输出形式传递的参数是"consoleSink"字符串，如下所示：

def execute(stream:CatalogTable):Unit={

val spark = SparkSession

.builder

.appName("StructuredNetworkWordCount")

.getOrCreate()

/**1. 获取数据对象DataFrame*/

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

/**2. 启动Streaming开始接受数据源的信息*/

val query:StreamingQuery = lines.writeStream

.outputMode("append")

.format("consoleSink")

.option("streamName","myStream")

.start()

query.awaitTermination()

}

4. 参考文献

[1]. Structured Streaming Programming Guide.

[2]. Kafka Integration Guide.

Spark Structured Streaming框架(3)之数据输出源详解的更多相关文章

Spark Structured Streaming框架(2)之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
Spark Structured Streaming框架（2）之数据输入源详解
Spark Structured Streaming目前的2.1.0版本只支持输入源:File.kafka和socket. 1. Socket Socket方式是最简单的数据输入源,如Quick ex ...
Spark Structured streaming框架（1）之基本使用
Spark Struntured Streaming是Spark 2.1.0版本后新增加的流计算引擎,本博将通过几篇博文详细介绍这个框架.这篇是介绍Spark Structured Streamin ...
Spark Structured Streaming框架(1)之基本用法
Spark Struntured Streaming是Spark 2.1.0版本后新增加的流计算引擎,本博将通过几篇博文详细介绍这个框架.这篇是介绍Spark Structured Streamin ...
Spark Structured Streaming框架(4)之窗口管理详解
1. 结构 1.1 概述 Structured Streaming组件滑动窗口功能由三个参数决定其功能:窗口时间.滑动步长和触发时间. 窗口时间:是指确定数据操作的长度: 滑动步长:是指窗口每次向前移 ...
Spark Structured Streaming框架(5)之进程管理
Structured Streaming提供一些API来管理Streaming对象.用户可以通过这些API来手动管理已经启动的Streaming,保证在系统中的Streaming有序执行. 1. St ...
Kafka：ZK+Kafka+Spark Streaming集群环境搭建（二十九）：推送avro格式数据到topic，并使用spark structured streaming接收topic解析avro数据
推送avro格式数据到topic 源代码:https://github.com/Neuw84/structured-streaming-avro-demo/blob/master/src/main/j ...
Kafka：ZK+Kafka+Spark Streaming集群环境搭建（十一）定制一个arvo格式文件发送到kafka的topic，通过Structured Streaming读取kafka的数据
将arvo格式数据发送到kafka的topic 第一步:定制avro schema: { "type": "record", "name": ...
DataFlow编程模型与Spark Structured streaming
流式(streaming)和批量( batch):流式数据,实际上更准确的说法应该是unbounded data(processing),也就是无边界的连续的数据的处理:对应的批量计算,更准确的说法是 ...

随机推荐

oc 跳转控制方法
1.presentViewController - (void)presentViewController:(UIViewController *)viewControllerToPresent an ...
cmd命令速查手册
CMD命令速查手册ASSOC显示或修改文件扩展名关联AT 计划在计算机上运行的命令和程序ATTRIB 显示或更改文件属性BREAK 设置或清除扩展式 CTRL+C检查CACLS显示或修改文件的访问控制 ...
Android中database所在文件夹路径（9.6）
1 sd----->data---->对应app---->databases----->创建的db 2 push到pc上,可以使用GUI工具SQLiteSpy直接查看datab ...
flume配置和说明(转)
Flume是什么收集.聚合事件流数据的分布式框架通常用于log数据采用ad-hoc方案,明显优点如下: 可靠的.可伸缩.可管理.可定制.高性能声明式配置,可以动态更新配置提供上下文路由功能 ...
ros之串口通信---imu
1.sudo apt-get install ros-kinetic-rosserial 或者sudo git clonegit://github.com/wjwwood/serial.git (开 ...
script 标签幼儿园级别的神坑。居然还让我踩到了。
这样的写法,会导致页面出现问题,就类似被中断了一样,百思不得其解还以为是代码出了问题. <script src="./Components/ProcessLine/ProcessLin ...
IntelliJ IDEA 、genymotion模拟器、Android开发环境搭建
首先打开IDEA,看到该界面,如果没有该界面,请在User/用户名/IntelliJIDEAProjects/下删除所有项目文件夹.然后重启IDEA即可看到接着开始配置jdk和sdk 然后在Proj ...
SpringCloud系列三：将微服务注册到Eureka Server上
1. 回顾通过上篇博客的讲解,我们知道硬编码提供者地址的方式有不少问题.要想解决这些问题,服务消费者需要一个强大的服务发现机制,服务消费者使用这种机制获取服务提供者的网络信息.不仅如此,即使服务提供 ...
java消息中间件之ActiveMQ初识
目录消息中间件简介解耦合和异步可靠性和高效性 JMS P2P Pub/Sub AMQP JMS和AMQP对比常见消息中间件 ActiveMQ RabbitMQ Kafka 综合比较标签(空格 ...
支付宝开放平台配置RSA(SHA1)密钥 OpenSSL配置公钥私钥对
进入到第一次配置支付宝支付服务了配置支付宝服务,需要去支付宝的开放平台申请服务需要设置一些参数其中需要在后台设置配置RSA(SHA1)密钥(公钥(注意这个子读"yao")) ...

Spark Structured Streaming框架(3)之数据输出源详解

1. File

2. Foreach

3. 自定义

3.1 ConsoleSink

3.2 DataStreamWriter

3.3 Structured Streaming

4. 参考文献

Spark Structured Streaming框架(3)之数据输出源详解的更多相关文章

随机推荐

热门专题