Flink - [03] API
使用scala编写flink api从不同的数据源(源端)读取数据,并进行无界流/有界流的数据处理,最终将处理好的数据sink到对应的目标端
一、maven配置
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>flinkapi</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.7.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.44</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
<!--<dependency>-->
<!--<groupId>org.apache.flink</groupId>-->
<!--<artifactId>flink-connector-elasticsearch6_2.11</artifactId>-->
<!--<version>1.7.2</version>-->
<!--</dependency>-->
<!--<dependency>-->
<!--<groupId>org.apache.flink</groupId>-->
<!--<artifactId>flink-statebackend-rocksdb_2.11</artifactId>-->
<!--<version>1.7.2</version>-->
<!--</dependency>-->
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--<arg>-make:transitive</arg>-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>WordCount</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
二、WordCount
(1)准备好wordcount.txt文本
hello world
dog fish
hadoop
spark
hello world
dog fish
hadoop
spark
hello world
dog fish
hadoop
spark
(2)有界数据流有状态计算(scala语言)
package com.harley.test
import org.apache.flink.api.scala._
object Demo01_ExecutionEnvironment {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val ds: DataSet[String] = env.readTextFile("src/main/resources/wordcount.txt")
val res = ds.flatMap(_.split(" "))
.map((_,1))
.groupBy(0) //元组的第一个元素
.sum(1)
//输出结果
res.print()
}
}
(3)无界数据流有状态计算
package com.harley.test
import org.apache.flink.streaming.api.scala._
object Demo02_StreamExecutionEnvironment {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val ds: DataStream[String] = env.socketTextStream("localhost", 9999)
val res = ds.flatMap(_.split(" "))
.filter(_ != "a")
.map((_, 1))
.keyBy(0)
.sum(1)
// 设置并行度
res.print("Stream print").setParallelism(2)
env.execute("Stream job")
}
}
三、整合不同的数据源
3.1、将集合作为数据源
import org.apache.flink.streaming.api.scala._
object Demo01_Source{
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val ds = env.fromCollection(List(
SensorReader("sensor_1", 1547718199, 35.80018327300259),
SensorReader("sensor_6", 1547718201, 15.402984393403084),
SensorReader("sensor_7", 1547718202, 6.720945201171228),
SensorReader("sensor_10", 1547718205, 38.101067604893444)
))
ds.print("collection source")
env.execute("source job")
}
case class SensorReader(id: String,timestamp: Long,temperature: Double)
}
3.2、Kafka作为数据源
(1)启动Kafka服务,创建topic,并启动一个producer,向producer中发送数据
nohup bin/kafka-server-config.sh config/server.properties &
bin/kafka-console-producer.sh --bootstrap-server node7-2:9092 --topic test
(2)通过Flink API服务topic中的数据
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer010}
object Demo02_KafkaSource {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 设置全局并行度
env.setParallelism(1)
val props = new Properties()
props.setProperty("bootstrap.servers","node7-2:9092")
props.setProperty("group.id","kafkasource")
props.setProperty("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
props.setProperty("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
props.setProperty("auto.offset.reset","latest")
val stream = env.addSource(new FlinkKafkaConsumer010[String](
"test", // topic
new SimpleStringSchema(),
props
))
stream.print("kafka source")
env.execute("job")
}
}
3.3、自定义Source
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala._
import scala.util.Random
object Demo03_DefinedSource {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env.addSource(new SensorSource())
stream.print("defined source")
env.execute("defined job")
}
class SensorSource extends SourceFunction[Sensor1]{
var running: Boolean = true
override def run(sourceContext: SourceFunction.SourceContext[Sensor1]): Unit = {
val random = new Random()
var currentTemp = 1.to(10).map(
i => ("sensor_"+i,random.nextGaussian()*20+60)
)
while(running){
currentTemp = currentTemp.map(tuple=>{
(tuple._1,tuple._2+random.nextGaussian())
})
val ts = System.currentTimeMillis()
currentTemp.foreach(tuple=>{
sourceContext.collect(Sensor1(tuple._1,ts,tuple._2))
})
Thread.sleep(1000)
}
}
override def cancel(): Unit = running = false
}
case class Sensor1(id:String,timestamp:Long,tempreture: Double)
}
四、Flink的方法
4.1、Transform
(1)准备文本文件sensor.txt
sensor_1,1600828704738,62.00978962180007
sensor_2,1600828704738,88.07800632412795
sensor_3,1600828704738,63.85113916269769
sensor_4,1600828704738,88.11328700513668
sensor_5,1600828704738,104.80491942566778
sensor_6,1600828704738,57.27152286624301
sensor_7,1600828704738,42.542439944867574
sensor_8,1600828704738,59.3964933103558
sensor_9,1600828704738,59.967837312594796
sensor_10,1600828704738,77.23695484678282
sensor_1,1600828705745,62.5
sensor_1,1600828705745,88.86940686134874
sensor_3,1600828705745,64.9
sensor_1,1600828705745,87.8
sensor_5,1600828705745,104.33176752272263
sensor_6,1600828705745,56.14405735923403
(2)
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala._
object Demo04_Transform {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream: DataStream[String] = env.readTextFile("src/main/resources/sensor.txt")
val value = stream.map(line => {
val splitedArrs = line.split(",")
Sensor1(splitedArrs(0), splitedArrs(1).trim.toLong, splitedArrs(2).trim.toDouble)
})
val value1 = stream.flatMap(_.split(","))
stream.filter(_.split(",")(0)=="sensor_1")
value.keyBy("id")
.reduce((s1,s2)=>Sensor1(s1.id,s1.timestamp,s2.tempreture+100))
.print("keyby")
val value2: KeyedStream[Sensor1, Tuple] = value.keyBy(0)
env.execute("job")
}
case class Sensor1(id:String,timestamp: Long,tempreture: Double)
}
4.2、Split
import com.jinghang.day13.Demo04_Transform.Sensor1
import org.apache.flink.streaming.api.scala._
object Demo05_Split {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream: DataStream[String] = env.readTextFile("src/main/resources/sensor.txt")
val ds: DataStream[Sensor1] = stream.map(line => {
val splited = line.split(",")
Sensor1(splited(0), splited(1).trim.toLong, splited(2).trim.toDouble)
})
val splitStream: SplitStream[Sensor1] = ds.split(
sensor => {
if (sensor.tempreture > 60)
Seq("high")
else
Seq("low")
})
val highStream: DataStream[Sensor1] = splitStream.select("high")
val lowStream: DataStream[Sensor1] = splitStream.select("low")
val allStream: DataStream[Sensor1] = splitStream.select("high","low")
val connStream: ConnectedStreams[Sensor1, Sensor1] = highStream.connect(lowStream)
val unionStream: DataStream[Sensor1] = highStream.union(lowStream).union(allStream)
highStream.print("high")
lowStream.print("low")
allStream.print("all")
env.execute("job")
}
}
4.3、UDF
import com.jinghang.day13.Demo04_Transform.Sensor1
import org.apache.flink.api.common.functions.{FilterFunction, RichFilterFunction, RichFlatMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
object Demo06_UDF {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream: DataStream[String] = env.readTextFile("src/main/resources/sensor.txt")
val ds: DataStream[Sensor1] = stream.map(line => {
val splitedArrs = line.split(",")
Sensor1(splitedArrs(0), splitedArrs(1).trim.toLong, splitedArrs(2).trim.toDouble)
})
val dataStream = ds.filter(new UDFFilter())
ds.filter(new RichFilterFunction[Sensor1] {
override def filter(value: Sensor1): Boolean = {
value.id == "sensor_1"
}
//可以做一些预操作
override def open(parameters: Configuration): Unit = super.open(parameters)
})
dataStream.print("filter")
env.execute("filter job")
}
class UDFFilter() extends FilterFunction[Sensor1]{
override def filter(value: Sensor1): Boolean = {
value.id == "sensor_1"
}
}
class UDFFlatMap extends RichFlatMapFunction[String,String]{
override def flatMap(value: String, out: Collector[String]): Unit = {
}
}
}
4.4、JDBC Sink
import java.sql.{Connection, DriverManager, PreparedStatement}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import org.apache.flink.streaming.api.scala._
object Demo01_JdbcSink {
def main(args: Array[String]): Unit = {
// 创建流处理的环境对象
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 设置全局并行度
env.setParallelism(1)
val stream: DataStream[String] = env.readTextFile("src/main/resources/sensor.txt")
val dataStream: DataStream[Sensor] = stream.map(line => {
val splitedArrs = line.split(",")
Sensor(splitedArrs(0), splitedArrs(1).trim.toLong, splitedArrs(2).trim.toDouble)
})
dataStream.addSink(new JDBCSink())
env.execute("job")
}
case class Sensor(id: String,timestamp: Long,temperature: Double)
class JDBCSink extends RichSinkFunction[Sensor]{
var connection: Connection = _
var insertStatement: PreparedStatement = _
var updateStatement: PreparedStatement = _
/**
* 初始化操作
* @param parameters
*/
override def open(parameters: Configuration): Unit = {
Class.forName("com.mysql.cj.jdbc.Driver")
connection = DriverManager.getConnection("jdbc:mysql:///test?serverTimezone=UTC","root","123456")
insertStatement = connection.prepareStatement("insert into temperatures (sensor,temps) values (?,?)")
println(insertStatement)
updateStatement = connection.prepareCall("update temperatures set temps = ? where sensor = ?")
println(updateStatement)
}
override def invoke(value: Sensor, context: SinkFunction.Context[_]): Unit = {
updateStatement.setDouble(1,value.temperature)
updateStatement.setString(2,value.id)
updateStatement.execute()
if (updateStatement.getUpdateCount == 0){
insertStatement.setString(1,value.id)
insertStatement.setDouble(2,value.temperature)
insertStatement.execute()
}
}
override def close(): Unit = {
insertStatement.close()
updateStatement.close()
connection.close()
}
}
}
— 业精于勤荒于嬉,行成于思毁于随 —
Flink - [03] API的更多相关文章
- 【翻译】Flink Table Api & SQL —Streaming 概念 —— 表中的模式匹配 Beta版
本文翻译自官网:Detecting Patterns in Tables Beta https://ci.apache.org/projects/flink/flink-docs-release-1 ...
- flink DataStream API使用及原理
传统的大数据处理方式一般是批处理式的,也就是说,今天所收集的数据,我们明天再把今天收集到的数据算出来,以供大家使用,但是在很多情况下,数据的时效性对于业务的成败是非常关键的. Spark 和 Flin ...
- 【翻译】Flink Table Api & SQL — 流概念
本文翻译自官网:Streaming Concepts https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/st ...
- Flink Table Api & SQL 翻译目录
Flink 官网 Table Api & SQL 相关文档的翻译终于完成,这里整理一个安装官网目录顺序一样的目录 [翻译]Flink Table Api & SQL —— Overv ...
- 【翻译】Flink Table Api & SQL — 性能调优 — 流式聚合
本文翻译自官网:Streaming Aggregation https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table ...
- 【翻译】Flink Table Api & SQL — 配置
本文翻译自官网:Configuration https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/config.h ...
- 【翻译】Flink Table Api & SQL — Hive —— 在 scala shell 中使用 Hive 连接器
本文翻译自官网:Use Hive connector in scala shell https://ci.apache.org/projects/flink/flink-docs-release-1 ...
- 【翻译】Flink Table Api & SQL — Hive —— Hive 函数
本文翻译自官网:Hive Functions https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/hive/h ...
- 【翻译】Flink Table Api & SQL — SQL客户端Beta 版
本文翻译自官网:SQL Client Beta https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sqlCl ...
- 【翻译】Flink Table Api & SQL — Hive —— 读写 Hive 表
本文翻译自官网:Reading & Writing Hive Tables https://ci.apache.org/projects/flink/flink-docs-release-1 ...
随机推荐
- C++顺序结构(3)、数据类型_____教学
一.设置域宽setw() 输出的内容所占的总宽度成为域宽,有些高级语言中称为场宽. 使用setw()前,必须包含头文件iomanip,即#include<iomanip> 头文件ioman ...
- 那些年,我们一起追的 WLB
2019年,那一年,我29岁. 那一年,"996是福报"的言论在网络上引发舆论轩然大波. 那一年,"大小周"."996"."007 ...
- C#/.NET/.NET Core技术前沿周刊 | 第 19 期(2024年12.23-12.29)
前言 C#/.NET/.NET Core技术前沿周刊,你的每周技术指南针!记录.追踪C#/.NET/.NET Core领域.生态的每周最新.最实用.最有价值的技术文章.社区动态.优质项目和学习资源等. ...
- .net core反射练习-简易版IOC容器实现
实现一个简易的IOC容器 先说一下简单思路,参考ServiceCollection,需要一个注册方法跟获取实例方法,同时支持构造函数注入.那么只需要一个地方存储注册接口跟该接口的继承类,以及根据类的构 ...
- .NET 9.0 使用 Vulkan API 编写跨平台图形应用
前言 大家好,这次我来分享一下我自己实现的一个 Vulkan 库,这个库是用 C# 实现的,主要是为了学习 Vulkan 而写的. 在学习 Vulkan 的过程中,我主要参考 veldrid,它是一个 ...
- WebSocket一篇就够了-copy
WebSocket介绍与原理 WebSocket protocol 是HTML5一种新的协议.它实现了浏览器与服务器全双工通信(full-duplex).一开始的握手需要借助HTTP请求完成. --百 ...
- Redis常用指令(详细)
# Redis 常用指令## 基础命令### 启动与连接```bash# 启动 Redis 服务redis-server# 连接 Redis 客户端redis-cli```### 基本操作```bas ...
- 《Linux shell 脚本攻略》第1章——读书笔记
目录 文件描述符及重定向 函数和参数 迭代器 算术比较 文件系统相关测试 字符串进行比较 文件描述符及重定向 echo "This is a sample text 1" > ...
- Hetemit pg walkthrough Intermediate
nmap ┌──(root㉿kali)-[~] └─# nmap -p- -A 192.168.157.117 Starting Nmap 7.94SVN ( https://nmap.org ) a ...
- RocketMQ原理—2.源码设计简单分析上
大纲 1.NameServer的启动脚本 2.NameServer启动时会解析哪些配置 3.NameServer如何初始化Netty网络服务器 4.NameServer如何启动Netty网络服务器 5 ...