使用scala编写flink api从不同的数据源(源端)读取数据,并进行无界流/有界流的数据处理,最终将处理好的数据sink到对应的目标端

一、maven配置

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion> <groupId>org.example</groupId>
<artifactId>flinkapi</artifactId>
<version>1.0-SNAPSHOT</version> <dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.7.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.7.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>1.7.2</version>
</dependency> <dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.44</version>
</dependency> <dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
<!--<dependency>-->
<!--<groupId>org.apache.flink</groupId>-->
<!--<artifactId>flink-connector-elasticsearch6_2.11</artifactId>-->
<!--<version>1.7.2</version>-->
<!--</dependency>--> <!--<dependency>-->
<!--<groupId>org.apache.flink</groupId>-->
<!--<artifactId>flink-statebackend-rocksdb_2.11</artifactId>-->
<!--<version>1.7.2</version>-->
<!--</dependency>--> </dependencies> <build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--<arg>-make:transitive</arg>-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin> <plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>WordCount</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build> </project>

二、WordCount

(1)准备好wordcount.txt文本

hello world
dog fish
hadoop
spark
hello world
dog fish
hadoop
spark
hello world
dog fish
hadoop
spark

(2)有界数据流有状态计算(scala语言)

package com.harley.test

import org.apache.flink.api.scala._

object Demo01_ExecutionEnvironment {

    def main(args: Array[String]): Unit = {

        val env = ExecutionEnvironment.getExecutionEnvironment

        val ds: DataSet[String] = env.readTextFile("src/main/resources/wordcount.txt")

        val res = ds.flatMap(_.split(" "))
.map((_,1))
.groupBy(0) //元组的第一个元素
.sum(1) //输出结果
res.print() } }

(3)无界数据流有状态计算

package com.harley.test

import org.apache.flink.streaming.api.scala._

object Demo02_StreamExecutionEnvironment {

  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val ds: DataStream[String] = env.socketTextStream("localhost", 9999)

    val res = ds.flatMap(_.split(" "))
.filter(_ != "a")
.map((_, 1))
.keyBy(0)
.sum(1) // 设置并行度
res.print("Stream print").setParallelism(2) env.execute("Stream job") } }

三、整合不同的数据源

3.1、将集合作为数据源

import org.apache.flink.streaming.api.scala._

object Demo01_Source{

  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val ds = env.fromCollection(List(
SensorReader("sensor_1", 1547718199, 35.80018327300259),
SensorReader("sensor_6", 1547718201, 15.402984393403084),
SensorReader("sensor_7", 1547718202, 6.720945201171228),
SensorReader("sensor_10", 1547718205, 38.101067604893444)
)) ds.print("collection source") env.execute("source job") }
case class SensorReader(id: String,timestamp: Long,temperature: Double)
}

3.2、Kafka作为数据源

(1)启动Kafka服务,创建topic,并启动一个producer,向producer中发送数据

nohup bin/kafka-server-config.sh config/server.properties &
bin/kafka-console-producer.sh --bootstrap-server node7-2:9092 --topic test

(2)通过Flink API服务topic中的数据

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer010} object Demo02_KafkaSource {
def main(args: Array[String]): Unit = { val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 设置全局并行度
env.setParallelism(1)
val props = new Properties()
props.setProperty("bootstrap.servers","node7-2:9092")
props.setProperty("group.id","kafkasource")
props.setProperty("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
props.setProperty("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer")
props.setProperty("auto.offset.reset","latest") val stream = env.addSource(new FlinkKafkaConsumer010[String](
"test", // topic
new SimpleStringSchema(),
props
))
stream.print("kafka source")
env.execute("job")
}
}

3.3、自定义Source

import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala._ import scala.util.Random object Demo03_DefinedSource {
def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val stream = env.addSource(new SensorSource()) stream.print("defined source")
env.execute("defined job") } class SensorSource extends SourceFunction[Sensor1]{ var running: Boolean = true override def run(sourceContext: SourceFunction.SourceContext[Sensor1]): Unit = { val random = new Random() var currentTemp = 1.to(10).map( i => ("sensor_"+i,random.nextGaussian()*20+60)
) while(running){ currentTemp = currentTemp.map(tuple=>{
(tuple._1,tuple._2+random.nextGaussian())
})
val ts = System.currentTimeMillis()
currentTemp.foreach(tuple=>{
sourceContext.collect(Sensor1(tuple._1,ts,tuple._2))
})
Thread.sleep(1000)
}
} override def cancel(): Unit = running = false
} case class Sensor1(id:String,timestamp:Long,tempreture: Double)
}

四、Flink的方法

4.1、Transform

(1)准备文本文件sensor.txt

sensor_1,1600828704738,62.00978962180007
sensor_2,1600828704738,88.07800632412795
sensor_3,1600828704738,63.85113916269769
sensor_4,1600828704738,88.11328700513668
sensor_5,1600828704738,104.80491942566778
sensor_6,1600828704738,57.27152286624301
sensor_7,1600828704738,42.542439944867574
sensor_8,1600828704738,59.3964933103558
sensor_9,1600828704738,59.967837312594796
sensor_10,1600828704738,77.23695484678282
sensor_1,1600828705745,62.5
sensor_1,1600828705745,88.86940686134874
sensor_3,1600828705745,64.9
sensor_1,1600828705745,87.8
sensor_5,1600828705745,104.33176752272263
sensor_6,1600828705745,56.14405735923403

(2)

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala._ object Demo04_Transform {
def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(1) val stream: DataStream[String] = env.readTextFile("src/main/resources/sensor.txt") val value = stream.map(line => { val splitedArrs = line.split(",")
Sensor1(splitedArrs(0), splitedArrs(1).trim.toLong, splitedArrs(2).trim.toDouble) }) val value1 = stream.flatMap(_.split(",")) stream.filter(_.split(",")(0)=="sensor_1") value.keyBy("id")
.reduce((s1,s2)=>Sensor1(s1.id,s1.timestamp,s2.tempreture+100))
.print("keyby") val value2: KeyedStream[Sensor1, Tuple] = value.keyBy(0)
env.execute("job") }
case class Sensor1(id:String,timestamp: Long,tempreture: Double)
}

4.2、Split

import com.jinghang.day13.Demo04_Transform.Sensor1
import org.apache.flink.streaming.api.scala._ object Demo05_Split {
def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment env.setParallelism(1)
val stream: DataStream[String] = env.readTextFile("src/main/resources/sensor.txt") val ds: DataStream[Sensor1] = stream.map(line => {
val splited = line.split(",")
Sensor1(splited(0), splited(1).trim.toLong, splited(2).trim.toDouble) }) val splitStream: SplitStream[Sensor1] = ds.split( sensor => { if (sensor.tempreture > 60)
Seq("high")
else
Seq("low") }) val highStream: DataStream[Sensor1] = splitStream.select("high")
val lowStream: DataStream[Sensor1] = splitStream.select("low")
val allStream: DataStream[Sensor1] = splitStream.select("high","low") val connStream: ConnectedStreams[Sensor1, Sensor1] = highStream.connect(lowStream)
val unionStream: DataStream[Sensor1] = highStream.union(lowStream).union(allStream) highStream.print("high")
lowStream.print("low")
allStream.print("all") env.execute("job")
}
}

4.3、UDF

import com.jinghang.day13.Demo04_Transform.Sensor1
import org.apache.flink.api.common.functions.{FilterFunction, RichFilterFunction, RichFlatMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector object Demo06_UDF {
def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1) val stream: DataStream[String] = env.readTextFile("src/main/resources/sensor.txt") val ds: DataStream[Sensor1] = stream.map(line => {
val splitedArrs = line.split(",")
Sensor1(splitedArrs(0), splitedArrs(1).trim.toLong, splitedArrs(2).trim.toDouble)
}) val dataStream = ds.filter(new UDFFilter())
ds.filter(new RichFilterFunction[Sensor1] {
override def filter(value: Sensor1): Boolean = {
value.id == "sensor_1"
} //可以做一些预操作
override def open(parameters: Configuration): Unit = super.open(parameters)
}) dataStream.print("filter") env.execute("filter job") } class UDFFilter() extends FilterFunction[Sensor1]{
override def filter(value: Sensor1): Boolean = {
value.id == "sensor_1"
}
} class UDFFlatMap extends RichFlatMapFunction[String,String]{
override def flatMap(value: String, out: Collector[String]): Unit = { }
}
}

4.4、JDBC Sink

import java.sql.{Connection, DriverManager, PreparedStatement}

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import org.apache.flink.streaming.api.scala._ object Demo01_JdbcSink {
def main(args: Array[String]): Unit = {
// 创建流处理的环境对象
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// 设置全局并行度
env.setParallelism(1) val stream: DataStream[String] = env.readTextFile("src/main/resources/sensor.txt") val dataStream: DataStream[Sensor] = stream.map(line => {
val splitedArrs = line.split(",")
Sensor(splitedArrs(0), splitedArrs(1).trim.toLong, splitedArrs(2).trim.toDouble) }) dataStream.addSink(new JDBCSink()) env.execute("job")
} case class Sensor(id: String,timestamp: Long,temperature: Double)
class JDBCSink extends RichSinkFunction[Sensor]{ var connection: Connection = _
var insertStatement: PreparedStatement = _
var updateStatement: PreparedStatement = _ /**
* 初始化操作
* @param parameters
*/
override def open(parameters: Configuration): Unit = { Class.forName("com.mysql.cj.jdbc.Driver")
connection = DriverManager.getConnection("jdbc:mysql:///test?serverTimezone=UTC","root","123456")
insertStatement = connection.prepareStatement("insert into temperatures (sensor,temps) values (?,?)")
println(insertStatement)
updateStatement = connection.prepareCall("update temperatures set temps = ? where sensor = ?")
println(updateStatement)
} override def invoke(value: Sensor, context: SinkFunction.Context[_]): Unit = {
updateStatement.setDouble(1,value.temperature)
updateStatement.setString(2,value.id)
updateStatement.execute() if (updateStatement.getUpdateCount == 0){
insertStatement.setString(1,value.id)
insertStatement.setDouble(2,value.temperature)
insertStatement.execute()
}
}
override def close(): Unit = {
insertStatement.close()
updateStatement.close()
connection.close()
}
}
}

— 业精于勤荒于嬉,行成于思毁于随 —

Flink - [03] API的更多相关文章

  1. 【翻译】Flink Table Api & SQL —Streaming 概念 —— 表中的模式匹配 Beta版

    本文翻译自官网:Detecting Patterns in Tables Beta  https://ci.apache.org/projects/flink/flink-docs-release-1 ...

  2. flink DataStream API使用及原理

    传统的大数据处理方式一般是批处理式的,也就是说,今天所收集的数据,我们明天再把今天收集到的数据算出来,以供大家使用,但是在很多情况下,数据的时效性对于业务的成败是非常关键的. Spark 和 Flin ...

  3. 【翻译】Flink Table Api & SQL — 流概念

    本文翻译自官网:Streaming Concepts  https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/st ...

  4. Flink Table Api & SQL 翻译目录

    Flink 官网 Table Api & SQL  相关文档的翻译终于完成,这里整理一个安装官网目录顺序一样的目录 [翻译]Flink Table Api & SQL —— Overv ...

  5. 【翻译】Flink Table Api & SQL — 性能调优 — 流式聚合

    本文翻译自官网:Streaming Aggregation  https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table ...

  6. 【翻译】Flink Table Api & SQL — 配置

    本文翻译自官网:Configuration https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/config.h ...

  7. 【翻译】Flink Table Api & SQL — Hive —— 在 scala shell 中使用 Hive 连接器

    本文翻译自官网:Use Hive connector in scala shell  https://ci.apache.org/projects/flink/flink-docs-release-1 ...

  8. 【翻译】Flink Table Api & SQL — Hive —— Hive 函数

    本文翻译自官网:Hive Functions  https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/hive/h ...

  9. 【翻译】Flink Table Api & SQL — SQL客户端Beta 版

    本文翻译自官网:SQL Client Beta  https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sqlCl ...

  10. 【翻译】Flink Table Api & SQL — Hive —— 读写 Hive 表

    本文翻译自官网:Reading & Writing Hive Tables  https://ci.apache.org/projects/flink/flink-docs-release-1 ...

随机推荐

  1. C++顺序结构(3)、数据类型_____教学

    一.设置域宽setw() 输出的内容所占的总宽度成为域宽,有些高级语言中称为场宽. 使用setw()前,必须包含头文件iomanip,即#include<iomanip> 头文件ioman ...

  2. 那些年,我们一起追的 WLB

    2019年,那一年,我29岁. 那一年,"996是福报"的言论在网络上引发舆论轩然大波. 那一年,"大小周"."996"."007 ...

  3. C#/.NET/.NET Core技术前沿周刊 | 第 19 期(2024年12.23-12.29)

    前言 C#/.NET/.NET Core技术前沿周刊,你的每周技术指南针!记录.追踪C#/.NET/.NET Core领域.生态的每周最新.最实用.最有价值的技术文章.社区动态.优质项目和学习资源等. ...

  4. .net core反射练习-简易版IOC容器实现

    实现一个简易的IOC容器 先说一下简单思路,参考ServiceCollection,需要一个注册方法跟获取实例方法,同时支持构造函数注入.那么只需要一个地方存储注册接口跟该接口的继承类,以及根据类的构 ...

  5. .NET 9.0 使用 Vulkan API 编写跨平台图形应用

    前言 大家好,这次我来分享一下我自己实现的一个 Vulkan 库,这个库是用 C# 实现的,主要是为了学习 Vulkan 而写的. 在学习 Vulkan 的过程中,我主要参考 veldrid,它是一个 ...

  6. WebSocket一篇就够了-copy

    WebSocket介绍与原理 WebSocket protocol 是HTML5一种新的协议.它实现了浏览器与服务器全双工通信(full-duplex).一开始的握手需要借助HTTP请求完成. --百 ...

  7. Redis常用指令(详细)

    # Redis 常用指令## 基础命令### 启动与连接```bash# 启动 Redis 服务redis-server# 连接 Redis 客户端redis-cli```### 基本操作```bas ...

  8. 《Linux shell 脚本攻略》第1章——读书笔记

    目录 文件描述符及重定向 函数和参数 迭代器 算术比较 文件系统相关测试 字符串进行比较 文件描述符及重定向 echo "This is a sample text 1" > ...

  9. Hetemit pg walkthrough Intermediate

    nmap ┌──(root㉿kali)-[~] └─# nmap -p- -A 192.168.157.117 Starting Nmap 7.94SVN ( https://nmap.org ) a ...

  10. RocketMQ原理—2.源码设计简单分析上

    大纲 1.NameServer的启动脚本 2.NameServer启动时会解析哪些配置 3.NameServer如何初始化Netty网络服务器 4.NameServer如何启动Netty网络服务器 5 ...