flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)
1. 案例
用户ID,活动ID,时间,事件类型,省份
u001,A1,2019-09-02 10:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,1,北京市
u001,A1,2019-09-02 14:10:11,2,北京市
u002,A1,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 14:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,1,北京市
u002,A2,2019-09-02 15:10:11,2,北京市 事件类型:
0:曝光
1:点击
2:参与 需求:统计点击、参与某个活动的人数和次数
- 方案一:使用ValueState结合HashSet实现
具体代码如下
ActivityCountAdv1

package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.LocalStreamEnvironment;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.util.HashSet; public class ActivityCountAdv1 {
public static void main(String[] args) throws Exception {
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStreamSource<String> lines = env.socketTextStream("feng05", 8888);
// 对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, Integer, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, Integer, String>>() {
@Override
public Tuple5<String, String, String, Integer, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String activityID = fields[1];
String date = fields[2];
Integer type = Integer.parseInt(fields[3]);
String prince = fields[4];
return Tuple5.of(uid, activityID, date, type, prince);
}
});
// 按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, Integer, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, Integer, String>, Tuple4<String, Integer, Integer, Integer>>() {
//保存去重后用户ID的HashSet
private transient ValueState<HashSet<String>> uidState; //保存次数的Integer类型
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
// 定义一个状态描述器
ValueStateDescriptor<HashSet<String>> stateDescriptor1 = new ValueStateDescriptor<HashSet<String>>(
"uid-state",
TypeInformation.of(new TypeHint<HashSet<String>>(){})
);
// 定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
);
// 获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
} @Override
public void processElement(Tuple5<String, String, String, Integer, String> value, Context ctx, Collector<Tuple4<String, Integer, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
Integer type = value.f3;
//使用HashSet进行判断去重,更新uidState
HashSet<String> hashSet = uidState.value();
if(hashSet == null){
hashSet = new HashSet<>();
}
hashSet.add(uid);
uidState.update(hashSet);
// 计算人数
Integer count = countState.value();
if(count == null) {
count = 0;
}
count += 1;
countState.update(count);
out.collect(Tuple4.of(aid,type,hashSet.size(), count));
}
}).print();
env.execute();
}
}
如果使用HashSet去重,用户实例较大,会大量消耗资源,导致性能变低,甚至内存溢出
- 方案二:改进,使用BloomFilter存储用户的ID,BloomFilter可以判断用户一定不存在,使用的内存极少。但是使用BloomFilter没有计数器,就必须额外定义一个状态,存储去重的人数
ActivityCountAdv2

package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.hash.BloomFilter;
import org.apache.flink.shaded.guava18.com.google.common.hash.Funnels;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector; import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.util.HashSet; public class ActivityCountAdv2 { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); //对数据进行切分整理
SingleOutputStreamOperator<Tuple5<String, String, String, String, String>> tpDataStream = lines.map(new MapFunction<String, Tuple5<String, String, String, String, String>>() {
@Override
public Tuple5<String, String, String, String, String> map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String time = fields[2];
String type = fields[3];
String province = fields[4];
return Tuple5.of(uid, aid, time, type, province);
}
}); //按照活动ID和事件类型分组
KeyedStream<Tuple5<String, String, String, String, String>, Tuple> keyed = tpDataStream.keyBy(1, 3); keyed.process(new KeyedProcessFunction<Tuple, Tuple5<String, String, String, String, String>, Tuple4<String, String, Integer, Integer>>() { //保存去重后用户ID的HashSet
private transient ValueState<BloomFilter> uidState; //保存用户ID去重的次数的Integer类型
private transient ValueState<Integer> uidCountState; //保存次数的Integer类型(未去重的)
private transient ValueState<Integer> countState; @Override
public void open(Configuration parameters) throws Exception {
//定义一个状态描述器
ValueStateDescriptor<BloomFilter> stateDescriptor1 = new ValueStateDescriptor<BloomFilter>(
"uid-state",
TypeInformation.of(new TypeHint<BloomFilter>(){})
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor2 = new ValueStateDescriptor<Integer>(
"count-state",
Integer.class
); //定义一个状态描述器
ValueStateDescriptor<Integer> stateDescriptor3 = new ValueStateDescriptor<Integer>(
"uid-count-state",
Integer.class
);
//获取状态
//获取状态
uidState = getRuntimeContext().getState(stateDescriptor1);
countState = getRuntimeContext().getState(stateDescriptor2);
uidCountState = getRuntimeContext().getState(stateDescriptor3);
} @Override
public void processElement(Tuple5<String, String, String, String, String> value, Context ctx, Collector<Tuple4<String, String, Integer, Integer>> out) throws Exception {
String uid = value.f0;
String aid = value.f1;
String type = value.f3;
//使用HashSet进行判断去重
BloomFilter bloomFilter = uidState.value();
Integer uidCount = uidCountState.value(); //人数
Integer count = countState.value(); //次数
if(count == null) {
count = 0;
}
if(bloomFilter == null) {
bloomFilter = BloomFilter.create(Funnels.unencodedCharsFunnel(), 10000000);
uidCount = 0;
}
if(!bloomFilter.mightContain(uid)) {
bloomFilter.put(uid); //添加到BloomFilter中
uidCount += 1;
}
count += 1;
countState.update(count);
uidState.update(bloomFilter);
uidCountState.update(uidCount);
out.collect(Tuple4.of(aid, type, uidCount, count));
}
}).print(); env.execute(); }
}
2. 活动指标多维度统计
此处要进行多次key操作(一中维度就需要keyBy一次),相当繁琐。此处是通过将数据存入redis,所以不需要使用flink中的state,具体见代码
ActivityCountWithMultiDimension

package cn._51doit.flink.day08; import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import sun.awt.geom.AreaOp; public class ActivityCountWithMultiDimension { public static void main(String[] args) throws Exception{ ParameterTool parameters = ParameterTool.fromPropertiesFile(args[0]); StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setGlobalJobParameters(parameters); //u001,A1,2019-09-02 10:10:11,1,北京市
DataStreamSource<String> lines = env.socketTextStream("localhost", 8888); SingleOutputStreamOperator<ActivityBean> beanStream = lines.map(new MapFunction<String, ActivityBean>() { @Override
public ActivityBean map(String line) throws Exception {
String[] fields = line.split(",");
String uid = fields[0];
String aid = fields[1];
String date = fields[2].split(" ")[0];
String type = fields[3];
String province = fields[4];
return ActivityBean.of(uid, aid, date, type, province);
}
}); SingleOutputStreamOperator<ActivityBean> res1 = beanStream.keyBy("aid", "type").sum("count"); SingleOutputStreamOperator<ActivityBean> res2 = beanStream.keyBy("aid", "type", "date").sum("count"); SingleOutputStreamOperator<ActivityBean> res3 = beanStream.keyBy("aid", "type", "date", "province").sum("count"); res1.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.ACTIVITY_COUNT +"-"+ value.aid, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res2.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); res3.map(new MapFunction<ActivityBean, Tuple3<String, String, String>>() {
@Override
public Tuple3<String, String, String> map(ActivityBean value) throws Exception {
return Tuple3.of(Constant.PROVINCE_DAILY_ACTIVITY_COUNT + "-" + value.aid + "-" + value.date + "-" + value.province, value.type, value.count.toString());
}
}).addSink(new MyRedisSink()); env.execute();
}
}
Constant

package cn._51doit.flink.day08;
public class Constant {
public static final String ACTIVITY_COUNT = "ACTIVITY_COUNT";
public static final String DAILY_ACTIVITY_COUNT = "DAILY_ACTIVITY_COUNT";
public static final String PROVINCE_DAILY_ACTIVITY_COUNT = "PROVINCE_DAILY_ACTIVITY_COUNT";
}
ActivityBean

package cn._51doit.flink.day08;
public class ActivityBean {
public String uid;
public String aid;
public String date;
public String type;
public String province;
public Long count = 1L;
public ActivityBean() {}
public ActivityBean(String uid, String aid, String date, String type, String province) {
this.uid = uid;
this.aid = aid;
this.date = date;
this.type = type;
this.province = province;
}
public static ActivityBean of(String uid, String aid, String date, String type, String province) {
return new ActivityBean(uid, aid, date, type, province);
}
@Override
public String toString() {
return "ActivityBean{" +
"uid='" + uid + '\'' +
", aid='" + aid + '\'' +
", date='" + date + '\'' +
", type='" + type + '\'' +
", province='" + province + '\'' +
'}';
}
}
MyRedisSink

package cn._51doit.flink.day08; import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import redis.clients.jedis.Jedis; public class MyRedisSink extends RichSinkFunction<Tuple3<String, String, String>> { private transient Jedis jedis; @Override
public void open(Configuration parameters) throws Exception {
ParameterTool params = (ParameterTool) getRuntimeContext()
.getExecutionConfig()
.getGlobalJobParameters();
String host = params.getRequired("redis.host");
String password = params.getRequired("redis.password");
int port = params.getInt("redis.port", 6379);
int db = params.getInt("redis.db", 0);
Jedis jedis = new Jedis(host, port);
jedis.auth(password);
jedis.select(db);
this.jedis = jedis;
} @Override
public void invoke(Tuple3<String, String, String> value, Context context) throws Exception {
if (!jedis.isConnected()) {
jedis.connect();
}
jedis.hset(value.f0, value.f1, value.f2);
} @Override
public void close() throws Exception {
jedis.close();
}
}
flink-----实时项目---day04-------1. 案例:统计点击、参与某个活动的人数和次数 2. 活动指标多维度统计(自定义redisSink)的更多相关文章
- 5.Flink实时项目之业务数据准备
1. 流程介绍 在上一篇文章中,我们已经把客户端的页面日志,启动日志,曝光日志分别发送到kafka对应的主题中.在本文中,我们将把业务数据也发送到对应的kafka主题中. 通过maxwell采集业务数 ...
- 9.Flink实时项目之订单宽表
1.需求分析 订单是统计分析的重要的对象,围绕订单有很多的维度统计需求,比如用户.地区.商品.品类.品牌等等.为了之后统计计算更加方便,减少大表之间的关联,所以在实时计算过程中将围绕订单的相关数据整合 ...
- 7.Flink实时项目之独立访客开发
1.架构说明 在上6节当中,我们已经完成了从ods层到dwd层的转换,包括日志数据和业务数据,下面我们开始做dwm层的任务. DWM 层主要服务 DWS,因为部分需求直接从 DWD 层到DWS 层中间 ...
- 10.Flink实时项目之订单维度表关联
1. 维度查询 在上一篇中,我们已经把订单和订单明细表join完,本文将关联订单的其他维度数据,维度关联实际上就是在流中查询存储在 hbase 中的数据表.但是即使通过主键的方式查询,hbase 速度 ...
- 11.Flink实时项目之支付宽表
支付宽表 支付宽表的目的,最主要的原因是支付表没有到订单明细,支付金额没有细分到商品上, 没有办法统计商品级的支付状况. 所以本次宽表的核心就是要把支付表的信息与订单明细关联上. 解决方案有两个 一个 ...
- 3.Flink实时项目之流程分析及环境搭建
1. 流程分析 前面已经将日志数据(ods_base_log)及业务数据(ods_base_db_m)发送到kafka,作为ods层,接下来要做的就是通过flink消费kafka 的ods数据,进行简 ...
- 4.Flink实时项目之数据拆分
1. 摘要 我们前面采集的日志数据已经保存到 Kafka 中,作为日志数据的 ODS 层,从 kafka 的ODS 层读取的日志数据分为 3 类, 页面日志.启动日志和曝光日志.这三类数据虽然都是用户 ...
- 6.Flink实时项目之业务数据分流
在上一篇文章中,我们已经获取到了业务数据的输出流,分别是dim层维度数据的输出流,及dwd层事实数据的输出流,接下来我们要做的就是把这些输出流分别再流向对应的数据介质中,dim层流向hbase中,dw ...
- 1.Flink实时项目前期准备
1.日志生成项目 日志生成机器:hadoop101 jar包:mock-log-0.0.1-SNAPSHOT.jar gmall_mock |----mock_common |----mock ...
随机推荐
- udev 使用方法
原文地址 http://blog.163.com/againinput4@yeah/blog/static/122764271200962305339483/ 最近有在研究SD卡设备节点自动创建及挂载 ...
- CentOS7自动备份oracle数据库
1.环境 操作系统:CentOS 7 数据库:11.2.0.1.0 2.登录服务器 切换oracle用户,备份需要在oracle用户下进行 #su - oracle 在oracle家目录下创建bin目 ...
- spring-cloud-square源码速读(retrofit + okhttp篇)
欢迎访问我的GitHub 这里分类和汇总了欣宸的全部原创(含配套源码):https://github.com/zq2599/blog_demos spring-cloud-square系列文章 五分钟 ...
- PicGo+Gitee(码云)中的404错误解决方案
今天在用PicGo配置Gitee时,出现了404问题,记录一下解决方案. 安装与配置 PicGo默认是不支持Gitee的,只能通过安装插件来进行支持.我这里安装的插件是Gitee. 在图床设置---& ...
- 设计模式学习-使用go实现建造者模式
建造者模式 定义 适用范围 与工厂模式的区别 优点 缺点 参考 建造者模式 定义 Builder 模式,中文翻译为建造者模式或者构建者模式,也有人叫它生成器模式. 建造者模式(Builder Patt ...
- 微信小程序(七)
组件: 组件是视图层的基本组成单元 组件自带一些功能与微信风格的样式 一个组件通常包括:开始标签和结束标签,属性用来修饰这个组件,内容在两个标签之内. 媒体组件 地图 开放能力 画布 视图容器 vie ...
- 菜鸡的Java笔记 第二十六 - java 内部类
/* innerClass 从实际的开发来看,真正写到内部类的时候是在很久以后了,短期内如果是自己编写代码,几乎是见不到内部类出现的 讲解它的目的第一个是为了解释概念 ...
- 一文分析 Android现状及发展前景
Coding这些年,一直低头"搬砖",好像从未仔细审视过Android的发展现状,亦未好好思考Android的发展前景."低头干活,还要抬头看路",写一篇文章简 ...
- Nginx server_name翻译
http://nginx.org/en/docs/http/server_names.html#regex_names 匹配优先顺序 精确名称,无通配符,无正则. 以星号开头的最长的通配符名称,例如& ...
- APM监控--(三)zipkin部署手册
一,基础知识储备分布式跟踪的目标一个分布式系统由若干分布式服务构成,每一个请求会经过多个业务系统并留下足迹,但是这些分散的数据对于问题排查,或是流程优化都很有限,要能做到追踪每个请求的完整链路调用,收 ...