flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)
1. 案例
用户ID,活动ID,时间,事件类型,省份
u001,A1,2019-09-02 10:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,2,北京市
u002,A1,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,2,北京市 事件类型:
0:曝光
1:点击
2:参与 需求:统计点击、参与某个活动的人数和次数
- 方案一:使用ValueState结合HashSet实现
具体代码如下
ActivityCountAdv1

package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.LocalStreamEnvironment;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.util.HashSet; public class ActivityCountAdv1 {
public static void main(String[] args) throws Exception {
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStreamSource<String> lines = env.socketTextStream("feng05", 8888);
// 对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, Integer, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, Integer, String>>() {
@Override
public Tuple5<String, String, String, Integer, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String activityID = fields[1];
String date = fields[2];
Integer type = Integer.parseInt(fields[3]);
String prince = fields[4];
return Tuple5.of(uid, activityID, date, type, prince);
}
});
// 按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, Integer, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, Integer, String>, Tuple4<String, Integer, Integer, Integer>>() {
//保存去重后用户ID的HashSet
private transient ValueState<HashSet<String>> uidState; //保存次数的Integer类型
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
// 定义一个状态描述器
ValueStateDescriptor<HashSet<String>> stateDescriptor1 = new ValueStateDescriptor<HashSet<String>>(
"uid-state",
TypeInformation.of(new TypeHint<HashSet<String>>(){})
);
// 定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
);
// 获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
} @Override
public void processElement(Tuple5<String, String, String, Integer, String> value, Context ctx, Collector<Tuple4<String, Integer, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
Integer type = value.f3;
//使用HashSet进行判断去重,更新uidState
HashSet<String> hashSet = uidState.value();
if(hashSet == null){
hashSet = new HashSet<>();
}
hashSet.add(uid);
uidState.update(hashSet);
// 计算人数
Integer count = countState.value();
if(count == null) {
count = 0;
}
count += 1;
countState.update(count);
out.collect(Tuple4.of(aid,type,hashSet.size(), count));
}
}).print();
env.execute();
}
}
如果使用HashSet去重,用户实例较大,会大量消耗资源,导致性能变低,甚至内存溢出
- 方案二:改进,使用BloomFilter存储用户的ID,BloomFilter可以判断用户一定不存在,使用的内存极少。但是使用BloomFilter没有计数器,就必须额外定义一个状态,存储去重的人数
ActivityCountAdv2

package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.hash.BloomFilter;
import org.apache.flink.shaded.guava18.com.google.common.hash.Funnels;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.HashSet; public class ActivityCountAdv2 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); //对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, String, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, String, String>>() {
@Override
public Tuple5<String, String, String, String, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String time = fields[2];
String type = fields[3];
String province = fields[4];
return Tuple5.of(uid, aid, time, type, province);
}
}); //按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, String, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, String, String>, Tuple4<String, String, Integer, Integer>>() { //保存去重后用户ID的HashSet
private transient ValueState<BloomFilter> uidState; //保存用户ID去重的次数的Integer类型
private transient ValueState<Integer> uidCountState; //保存次数的Integer类型(未去重的)
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
//定义一个状态描述器
ValueStateDescriptor<BloomFilter> stateDescriptor1 = new ValueStateDescriptor<BloomFilter>(
"uid-state",
TypeInformation.of(new TypeHint<BloomFilter>(){})
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor3 = new ValueStateDescriptor<Integer>(
"uid-count-state",
Integer.class
);
//获取状态
//获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
uidCountState = getRuntimeContext().getState(stateDescriptor3);
} @Override
public void processElement(Tuple5<String, String, String, String, String> value, Context ctx, Collector<Tuple4<String, String, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
String type = value.f3;
//使用HashSet进行判断去重
BloomFilter bloomFilter = uidState.value();
Integer uidCount = uidCountState.value(); //人数
Integer count = countState.value(); //次数
if(count == null) {
count = 0;
}
if(bloomFilter == null) {
bloomFilter = BloomFilter.create(Funnels.unencodedCharsFunnel(), 10000000);
uidCount = 0;
}
if(!bloomFilter.mightContain(uid)) {
bloomFilter.put(uid); //添加到BloomFilter中
uidCount += 1;
}
count += 1;
countState.update(count);
uidState.update(bloomFilter);
uidCountState.update(uidCount);
out.collect(Tuple4.of(aid, type, uidCount, count));
}
}).print(); env.execute(); }
}
2. 活动指标多维度统计
此处要进行多次key操作(一中维度就需要keyBy一次),相当繁琐。此处是通过将数据存入redis,所以不需要使用flink中的state,具体见代码
ActivityCountWithMultiDimension

package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import sun.awt.geom.AreaOp; public class ActivityCountWithMultiDimension { public static void main(String[] args) throws Exception{ ParameterTool parameters = ParameterTool.fromPropertiesFile(args[0]); StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameters); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); SingleOutputStreamOperator<ActivityBean> beanStream = lines.map(new MapFunction<String, ActivityBean>() { @Override
public ActivityBean map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String date = fields[2].split(" ")[0];
String type = fields[3];
String province = fields[4];
return ActivityBean.of(uid, aid, date, type, province);
}
}); SingleOutputStreamOperator<ActivityBean> res1 = beanStream.keyBy("aid", "type").sum("count"); SingleOutputStreamOperator<ActivityBean> res2 = beanStream.keyBy("aid", "type", "date").sum("count"); SingleOutputStreamOperator<ActivityBean> res3 = beanStream.keyBy("aid", "type", "date", "province").sum("count"); res1.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.ACTIVITY_COUNT +"-"+ value.aid, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res2.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res3.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.PROVINCE_DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date + "-" + value.province, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); env.execute();
}
}
Constant

package cn._51doit.flink.day08;
public class Constant {
public static final String ACTIVITY_COUNT = "ACTIVITY_COUNT";
public static final String DAILY_ACTIVITY_COUNT = "DAILY_ACTIVITY_COUNT";
public static final String PROVINCE_DAILY_ACTIVITY_COUNT = "PROVINCE_DAILY_ACTIVITY_COUNT";
}
ActivityBean

package cn._51doit.flink.day08;
public class ActivityBean {
public String uid;
public String aid;
public String date;
public String type;
public String province;
public Long count = 1L;
public ActivityBean() {}
public ActivityBean(String uid, String aid, String date, String type, String province) {
this.uid = uid;
this.aid = aid;
this.date = date;
this.type = type;
this.province = province;
}
public static ActivityBean of(String uid, String aid, String date, String type, String province) {
return new ActivityBean(uid, aid, date, type, province);
}
@Override
public String toString() {
return "ActivityBean{" +
"uid='" + uid + '\'' +
", aid='" + aid + '\'' +
", date='" + date + '\'' +
", type='" + type + '\'' +
", province='" + province + '\'' +
'}';
}
}
MyRedisSink

package cn._51doit.flink.day08; import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import redis.clients.jedis.Jedis; public class MyRedisSink extends RichSinkFunction<Tuple3<String, String, String>> { private transient Jedis jedis; @Override
public void open(Configuration parameters) throws Exception {
ParameterTool params = (ParameterTool) getRuntimeContext()
.getExecutionConfig()
.getGlobalJobParameters();
String host = params.getRequired("redis.host");
String password = params.getRequired("redis.password");
int port = params.getInt("redis.port", 6379);
int db = params.getInt("redis.db", 0);
Jedis jedis = new Jedis(host, port);
jedis.auth(password);
jedis.select(db);
this.jedis = jedis;
} @Override
public void invoke(Tuple3<String, String, String> value, Context context) throws Exception {
if (!jedis.isConnected()) {
jedis.connect();
}
jedis.hset(value.f0, value.f1, value.f2);
} @Override
public void close() throws Exception {
jedis.close();
}
}
flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)的更多相关文章
- 5.Flink实时项目之业务数据准备
1. 流程介绍 在上一篇文章中,我们已经把客户端的页面日志,启动日志,曝光日志分别发送到kafka对应的主题中.在本文中,我们将把业务数据也发送到对应的kafka主题中. 通过maxwell采集业务数 ...
- 9.Flink实时项目之订单宽表
1.需求分析 订单是统计分析的重要的对象,围绕订单有很多的维度统计需求,比如用户.地区.商品.品类.品牌等等.为了之后统计计算更加方便,减少大表之间的关联,所以在实时计算过程中将围绕订单的相关数据整合 ...
- 7.Flink实时项目之独立访客开发
1.架构说明 在上6节当中,我们已经完成了从ods层到dwd层的转换,包括日志数据和业务数据,下面我们开始做dwm层的任务. DWM 层主要服务 DWS,因为部分需求直接从 DWD 层到DWS 层中间 ...
- 10.Flink实时项目之订单维度表关联
1. 维度查询 在上一篇中,我们已经把订单和订单明细表join完,本文将关联订单的其他维度数据,维度关联实际上就是在流中查询存储在 hbase 中的数据表.但是即使通过主键的方式查询,hbase 速度 ...
- 11.Flink实时项目之支付宽表
支付宽表 支付宽表的目的,最主要的原因是支付表没有到订单明细,支付金额没有细分到商品上, 没有办法统计商品级的支付状况. 所以本次宽表的核心就是要把支付表的信息与订单明细关联上. 解决方案有两个 一个 ...
- 3.Flink实时项目之流程分析及环境搭建
1. 流程分析 前面已经将日志数据(ods_base_log)及业务数据(ods_base_db_m)发送到kafka,作为ods层,接下来要做的就是通过flink消费kafka 的ods数据,进行简 ...
- 4.Flink实时项目之数据拆分
1. 摘要 我们前面采集的日志数据已经保存到 Kafka 中,作为日志数据的 ODS 层,从 kafka 的ODS 层读取的日志数据分为 3 类, 页面日志.启动日志和曝光日志.这三类数据虽然都是用户 ...
- 6.Flink实时项目之业务数据分流
在上一篇文章中,我们已经获取到了业务数据的输出流,分别是dim层维度数据的输出流,及dwd层事实数据的输出流,接下来我们要做的就是把这些输出流分别再流向对应的数据介质中,dim层流向hbase中,dw ...
- 1.Flink实时项目前期准备
1.日志生成项目 日志生成机器:hadoop101 jar包:mock-log-0.0.1-SNAPSHOT.jar gmall_mock |----mock_common |----mock ...
随机推荐
- 第02课 OpenGL 多边形
你的第一个多边形: 在第一个教程的基础上,我们添加了一个三角形和一个四边形.也许你认为这很简单,但你已经迈出了一大步,要知道任何在OpenGL中绘制的模型都会被分解为这两种简单的图形.读完了这一课,你 ...
- 【Docker】Maven打包SpringBoot项目成Docker镜像并上传到Harbor仓库(Eclipse、STS、IDEA、Maven通用)
写在前面 最近,在研究如何使用Maven将SpringBoot项目打包成Docker镜像并发布到Harbor仓库,网上翻阅了很多博客和资料,发现大部分都是在复制粘贴别人的东西,没有经过实践的检验,根本 ...
- redis 的主从模式哨兵模式
原理理解 1,哨兵的作用就是检测redis主服务的状态,如果主服务器挂了,从服务就自动切换为主服务器,变为master.哨兵是一个独立的进程,作为进程,它会独立运行.其原理是哨兵通过发送命令,等待Re ...
- 多线程 | 03 | CAS机制
Compare and swap(CAS) 当前的处理器基本都支持CAS,只不过每个厂家所实现的算法并不一样罢了,每一个CAS操作过程都包含三个参数:一个内存地址V,一个期望的值A和一个新值B,操作的 ...
- let that = this用法解析
这种情况就是在一个代码片段里this有可能代表不同的对象,而编码者希望this代表最初的对象
- 关于Thread的interrupt
关于Thread的interrupt Thread的interrupt方法会引发线程中断. 主要有以下几个作用: 如Object的wait方法,Thread的sleep等等这些能够抛出Interrup ...
- 【JAVA】编程(2)---时间管理
作业要求: 定义一个名为 MyTime 的类,其中私有属性包括天数,时,分,秒:定义一个可以初始化时,分,秒的构造方法,并对初始化数值加以限定,以防出现bug:定义一个方法,可以把第几天,时,分,秒打 ...
- 大一C语言学习笔记(6)---自省篇--流程控制;break,continue,return间的异同;数组应用到循环语句中需要注意的问题;++i 和 i++的异同等。
下面是傻瓜博主"曾经"犯过的错和一些心得: ༼ つ ◕_◕ ༽つ 1.要想流程控制学好,一定要学会化繁为简,举栗子: 三目运算符 (略?略:略)---就是一个数字嘛, ...
- Go语言核心36讲(Go语言实战与应用七)--学习笔记
29 | 原子操作(上) 我们在前两篇文章中讨论了互斥锁.读写锁以及基于它们的条件变量,先来总结一下. 互斥锁是一个很有用的同步工具,它可以保证每一时刻进入临界区的 goroutine 只有一个.读写 ...
- Centos8 部署 ElasticSearch 集群并搭建 ELK,基于Logstash同步MySQL数据到ElasticSearch
Centos8安装Docker 1.更新一下yum [root@VM-24-9-centos ~]# yum -y update 2.安装containerd.io # centos8默认使用podm ...