Flink统计当日的UV、PV

　　Flink 统计当日的UV、PV

　　测试环境：

　　　　flink 1.7.2

　　1、数据流程

　　　　a.模拟数据生成，发送到kafka（json 格式）　　

　　　　b.flink 读取数据，count

　　　　c. 输出数据到kafka（为了方便查看，输出了一份到控制台）

　　2、模拟数据生成器

　　　　数据格式如下： {"id" : 1, "createTime" : "2019-05-24 10:36:43.707"}

　　　　id 为数据生成的序号（累加），时间为数据时间（默认为数据生成时间）

　　模拟数据生成器代码如下：

/**

  * test data maker

  */

object CurrentDayMaker {

  var minute : Int = 1

  val calendar: Calendar = Calendar.getInstance()

  /**

    * 一天时间比较长，不方便观察，将时间改为当前时间，

    * 每次累加10分钟，这样一天只需要144次循环，也就是144秒

    * @return

    */

  def getCreateTime(): String = {

//    minute = minute + 1

    calendar.add(Calendar.MINUTE, 10)

    sdf.format(calendar.getTime)

  }

  val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")

  def main(args: Array[String]): Unit = {

    val producer = new KafkaProducer[String, String](Common.getProp)
　　// 初始化开始时间为当前时间

    calendar.setTime(new Date())

    println(sdf.format(calendar.getTime))

    var i =0;

    while (true) {

//      val map = Map("id"-> i, "createTime"-> sdf.format(System.currentTimeMillis()))

      val map = Map("id"-> i, "createTime"-> getCreateTime())

      val jsonObject: JSONObject = new JSONObject(map)

      println(jsonObject.toString())

　　　　// topic current_day

      val msg = new ProducerRecord[String, String]("current_day", jsonObject.toString())

      producer.send(msg)

      producer.flush()
　　　　// 控制数据频率

      Thread.sleep(1000)

      i = i + 1

    }

  }

}

　　生成数据如下：　　

{"id" : 0, "createTime" : "2019-05-24 18:02:26.292"}

{"id" : 1, "createTime" : "2019-05-24 18:12:26.292"}

{"id" : 2, "createTime" : "2019-05-24 18:22:26.292"}

{"id" : 3, "createTime" : "2019-05-24 18:32:26.292"}

{"id" : 4, "createTime" : "2019-05-24 18:42:26.292"}

　　3、flink 程序　

package com.venn.stream.api.dayWindow

import java.io.File

import java.text.SimpleDateFormat

import com.venn.common.Common

import com.venn.source.TumblingEventTimeWindows

import org.apache.flink.api.common.functions.ReduceFunction

import org.apache.flink.api.common.serialization.SimpleStringSchema

import org.apache.flink.api.scala._

import org.apache.flink.contrib.streaming.state.RocksDBStateBackend

import org.apache.flink.formats.json.JsonNodeDeserializationSchema

import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.node.ObjectNode

import org.apache.flink.streaming.api.TimeCharacteristic

import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor

import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment

import org.apache.flink.streaming.api.windowing.time.Time

import org.apache.flink.streaming.api.windowing.triggers.{ContinuousEventTimeTrigger}

import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaProducer}

/**

  * Created by venn on 19-5-23.

  *

  * use TumblingEventTimeWindows count current day pv

  * for test, update day window to minute window

  *

  * .windowAll(TumblingEventTimeWindows.of(Time.minutes(1), Time.seconds(0)))

  * TumblingEventTimeWindows can ensure count o minute event,

  * and time start at 0 second (like : 00:00:00 to 00:00:59)

  *

  */

object CurrentDayPvCount {

  def main(args: Array[String]): Unit = {

    println(1558886400000L - (1558886400000L - 8 + 86400000) % 86400000)

    // environment

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    env.setParallelism(1)

    if ("\\".equals(File.pathSeparator)) {

      val rock = new RocksDBStateBackend(Common.CHECK_POINT_DATA_DIR)

      env.setStateBackend(rock)

      // checkpoint interval

      env.enableCheckpointing(10000)

    }

    val topic = "current_day"

    val sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")

    val kafkaSource = new FlinkKafkaConsumer[ObjectNode](topic, new JsonNodeDeserializationSchema(), Common.getProp)

    val sink = new FlinkKafkaProducer[String](topic + "_out", new SimpleStringSchema(), Common.getProp)

    sink.setWriteTimestampToKafka(true)

    val stream = env.addSource(kafkaSource)

      .map(node => {

        Event(node.get("id").asText(), node.get("createTime").asText())

      })

      //            .assignAscendingTimestamps(event => sdf.parse(event.createTime).getTime)

      .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[Event](Time.seconds(60)) {

      override def extractTimestamp(element: Event): Long = {

        sdf.parse(element.createTime).getTime

      }

    })

      // window is one minute, start at 0 second

      //.windowAll(TumblingEventTimeWindows.of(Time.minutes(1), Time.seconds(0)))

      // window is one hour, start at 0 second 注意事件时间，需要事件触发，在窗口结束的时候可能没有数据，有数据的时候，已经是下一个窗口了

      //      .windowAll(TumblingEventTimeWindows.of(Time.hours(1), Time.seconds(0)))

      // window is one day, start at 0 second, todo there have a bug(FLINK-11326), can't use negative number, 1.8 修复

      //      .windowAll(TumblingEventTimeWindows.of(Time.days(1)))

      .windowAll(TumblingEventTimeWindows.of(Time.days(1), Time.hours(-8)))

      // every event one minute

      //      .trigger(ContinuousEventTimeTrigger.of(Time.seconds(3800)))

      // every process one minute

      //      .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(10)))

      // every event, export current value,

      //      .trigger(CountTrigger.of(1))

      .reduce(new ReduceFunction[Event] {

      override def reduce(event1: Event, event2: Event): Event = {

        // 将结果中，id的最小值和最大值输出

        new Event(event1.id, event2.id, event1.count + event2.count)

      }

    })

    // format output even, connect min max id, add current timestamp

    //      .map(event => Event(event.id + "-" + event.createTime, sdf.format(System.currentTimeMillis()), event.count))

    stream.print("result : ")

    // execute job

    env.execute("CurrentDayCount")

  }

}

case class Event(id: String, createTime: String, count: Int = 1) {}

　　4、运行结果

　　测试数据如下：　　　　

{"id" : 0, "createTime" : "2019-05-24 20:29:49.102"}
{"id" : 1, "createTime" : "2019-05-24 20:39:49.102"}
...
{"id" : 20, "createTime" : "2019-05-24 23:49:49.102"}
{"id" : 21, "createTime" : "2019-05-24 23:59:49.102"}
{"id" : 22, "createTime" : "2019-05-25 00:09:49.102"}
{"id" : 23, "createTime" : "2019-05-25 00:19:49.102"}
...
{"id" : 163, "createTime" : "2019-05-25 23:39:49.102"}
{"id" : 164, "createTime" : "2019-05-25 23:49:49.102"}
{"id" : 165, "createTime" : "2019-05-25 23:59:49.102"}
{"id" : 166, "createTime" : "2019-05-26 00:09:49.102"}
...
{"id" : 308, "createTime" : "2019-05-26 23:49:49.102"}
{"id" : 309, "createTime" : "2019-05-26 23:59:49.102"}
{"id" : 310, "createTime" : "2019-05-27 00:09:49.102"}

0 - 21 是 24号

22 - 165 是 25 号

166 - 309 是 26 号

输出结果（程序中reduce 方法，将窗口中第一条和最后一条数据的id，都放到 Event中）如下：

与测试数据对应

5、说明

　　很多人会错误的以为，窗口时间的开始时间会是程序启动（初始化）的时间。事实上，窗口（以TumblingEventTimeWindows为例）的定义有两个重载的方法：包含两个参数，窗口的长度和窗口的offset（默认为0）　

源码：org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows ：　

@PublicEvolving

public class TumblingEventTimeWindows extends WindowAssigner<Object, TimeWindow> {

    private static final long serialVersionUID = 1L;

    private final long size;

    private final long offset;

    protected TumblingEventTimeWindows(long size, long offset) {

        if (Math.abs(offset) >= size) {

            throw new IllegalArgumentException("TumblingEventTimeWindows parameters must satisfy abs(offset) < size");

        }

        this.size = size;

        this.offset = offset;

    }

    @Override

    public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {

        if (timestamp > Long.MIN_VALUE) {

            // Long.MIN_VALUE is currently assigned when no timestamp is present

            long start = TimeWindow.getWindowStartWithOffset(timestamp, offset, size);

            System.out.println("start : " + start + ", end : " + (start+size));

            String startStr =new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(start);

            String endStar =new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(start + size);

            System.out.println("window start: " + new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(start));

            System.out.println("window end: " + new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").format(start + size));

            return Collections.singletonList(new TimeWindow(start, start + size));

        } else {

            throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). " +

                    "Is the time characteristic set to 'ProcessingTime', or did you forget to call " +

                    "'DataStream.assignTimestampsAndWatermarks(...)'?");

        }

    }/**

     * Creates a new {@code TumblingEventTimeWindows} {@link WindowAssigner} that assigns

     * elements to time windows based on the element timestamp.

     *

     * @param size The size of the generated windows.

     * @return The time policy.

     */

    public static TumblingEventTimeWindows of(Time size) {

        return new TumblingEventTimeWindows(size.toMilliseconds(), 0);

    }

    /**

     * Creates a new {@code TumblingEventTimeWindows} {@link WindowAssigner} that assigns

     * elements to time windows based on the element timestamp and offset.

     *

     * <p>For example, if you want window a stream by hour,but window begins at the 15th minutes

     * of each hour, you can use {@code of(Time.hours(1),Time.minutes(15))},then you will get

     * time windows start at 0:15:00,1:15:00,2:15:00,etc.

     *

     * <p>Rather than that,if you are living in somewhere which is not using UTC±00:00 time,

     * such as China which is using UTC+08:00,and you want a time window with size of one day,

     * and window begins at every 00:00:00 of local time,you may use {@code of(Time.days(1),Time.hours(-8))}.

     * The parameter of offset is {@code Time.hours(-8))} since UTC+08:00 is 8 hours earlier than UTC time.

     *

     * @param size The size of the generated windows.

     * @param offset The offset which window start would be shifted by.

     * @return The time policy.

     */

    public static TumblingEventTimeWindows of(Time size, Time offset) {

        return new TumblingEventTimeWindows(size.toMilliseconds(), offset.toMilliseconds());

    }

}

每条数据都会触发： assignWindows 方法

计算函数如下：

public static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) {

        return timestamp - (timestamp - offset + windowSize) % windowSize;

    }

dubug 如下：

　　6、特别说明

　　　　FLink 1.6.3/1.7.1/1.7.2 在 TumblingEventTimeWindows 构造器上有个bug：offset 不能小于0，但是of 方法中又说明，可以使用： of(Time.days(1),Time.hours(-8)) 表示在中国的 0 点开始的一天窗口。

JIRA ：FLINK-11326 ，jira 上注明1.8.0 修复。(我本来准备提个bug的，有人先下手了)

这个bug 可以通过自己创建一个相同包的相同类，将对应代码修改即可。

flink 1.7.2 源码：

protected TumblingEventTimeWindows(long size, long offset) {

        if (offset < 0 || offset >= size) {

            throw new IllegalArgumentException("TumblingEventTimeWindows parameters must satisfy 0 <= offset < size");

        }

        this.size = size;

        this.offset = offset;

    }

Flink统计当日的UV、PV的更多相关文章

有关“数据统计”的一些概念 -- PV UV VV IP跳出率等
有关"数据统计"的一些概念 -- PV UV VV IP跳出率等版权声明:本文为博主原创文章,未经博主允许不得转载. 此文是本人工作中碰到的,随时记下来的零散概念,特此整理一下. ...
淘宝中的UV,PV,IPV
1. UV & PV UV: 店铺各页面的访问人数,一个用户在一天内多次访问店铺被记为一个访客(去重) ; Unique visitors PV: 店铺内所有页面的浏览总量(次数累加); p ...
MySQL格式化时间戳统计当日，第二天，第三天，3个工作日以后的数据
mysql 查询出来的处理时间和开始时间都是13位的时间戳 SELECT `END_TIME`,`CREATE_TIME` FROM t_table 需求是统计当日,第二天,第三天,3个工作日以后的时 ...
怎么区分PV、IV、UV以及网站统计名词解释（pv、曝光、点击）
PV(Page View)访问量,即页面访问量,每打开一次页面PV计数+1,刷新页面也是. IV(Internet Protocol)访问量指独立IP访问数,计算是以一个独立的IP在一个计算时段内访问 ...
程序员修仙之路--优雅快速的统计千万级别uv（留言送书）
菜菜,咱们网站现在有多少PV和UV了? Y总,咱们没有统计pv和uv的系统,预估大约有一千万uv吧写一个统计uv和pv的系统吧网上有现成的,直接接入一个不行吗? 别人的不太放心,毕竟自己写的,自己 ...
程序员修仙之路--优雅快速的统计千万级别uv
菜菜,咱们网站现在有多少PV和UV了? Y总,咱们没有统计pv和uv的系统,预估大约有一千万uv吧写一个统计uv和pv的系统吧网上有现成的,直接接入一个不行吗? 别人的不太放心,毕竟自己写的,自己 ...
快速的统计千万级别uv
菜菜,咱们网站现在有多少PV和UV了? Y总,咱们没有统计pv和uv的系统,预估大约有一千万uv吧写一个统计uv和pv的系统吧网上有现成的,直接接入一个不行吗? 别人的不太放心,毕竟自己写的,自己 ...
NGINX结合SHELL统计用户的UV及IP汇总
公司新的需求,怀疑PV和IP的比例不对,看是不是有恶意的东东: NGINX配置: log_format main '$remote_addr - $remote_user [$time_local] ...
百度统计数据的UV和IP为什么不一样？
相信网站站长们在每天查看百度统计数据时会发现网站的IP和UV数据时大时小,有时候IP比UV大,有时候UV比IP大,站长们可能对这些情况感到奇怪.今天就和大家分享一下UV和IP的知识,帮助大家更好地做好 ...

随机推荐

Java入门程序HelloWord
Java程序开发三步骤:编写,编译,运行编译器(编译):javac.exe 解释器(运行):java.exe 编译:把我们能看得懂的java代码(xxx.java)翻译成jvm可以运行的java字节 ...
php自定义函数之内部函数
内部函数,是指在函数内部又声明了一个函数. 注意事项: 1.内部函数名,不能是已存在的函数名 2.假设在函数a里面定义了一个内部函数,不能定用两次函数a. 我们下面来看代码,你将很快的学习会: < ...
sql server 中的表值函数和标量值函数
顾名思义:表值函数返回的是表,而标量值函数可以返回基类型一.表值函数用户定义表值函数返回 table 数据类型.对于内联表值函数,没有函数主体:表是单个 SELECT 语句的结果集. 以下示例 ...
Kubernetes 学习26 基于kubernetes的Paas概述
一.概述 1.通过以往的学习应该可以了解到k8s 和以往提到的devops概念更容易落地了.比如我们说的CI,CD,CD a.CI(Continuous Integration):持续集成 b.CD( ...
洛谷P1650赛马与codevs 2181 田忌赛马
洛谷P1650 赛马题目描述我国历史上有个著名的故事: 那是在2300年以前.齐国的大将军田忌喜欢赛马.他经常和齐王赛马.他和齐王都有三匹马:常规马,上级马,超级马.一共赛三局,每局的胜者可以从负 ...
2019 ACM-ICPC 西安全国邀请赛 E-Tree 树链剖分+线段树
题意给一颗带点权的树,三种操作 $1~s~t$ 修改从1到s的路径上的所有点,$a[i]=a[i]|t$ $2~s~t$ 修改从1到s的路径上的所有点,\(a[i]=a[i]\& ...
GoCN每日新闻(2019-09-28)
GoCN每日新闻(2019-09-28) 1. 可视化Go程序的调用图 https://truefurby.github.io/go-callvis/2. Go modules编写和发布官方教程 h ...
Note_4.7
2019/4/7 奇奇怪怪的笔记狄利克雷卷积 $μ∗1=ϵ$,莫比乌斯反演 $Id=φ∗1⇒φ=μ∗Id$ $d=1∗1⇒1=μ∗d$ $σ=Id∗1⇒Id=μ∗σ$ \(σ=φ∗ ...
[CTF]抓住那只猫(XCTF 4th-WHCTF-2017)
原作者:darkless 题目描述:抓住那只猫思路: 打开页面,有个输入框输入域名,输入baidu.com进行测试发现无任何回显,输入127.0.0.1进行测试. 发现已经执行成功,执行的是一个p ...
svn乌龟怎么用
0601 首先右键SVN-checkout 0602 其他地方可以不用修改,Version处可以修改,表示从指定版本号开始,点击OK. 0603 就会直接下载,如果改变的话,就会由绿色变成红色. 06 ...

Flink统计当日的UV、PV

Flink统计当日的UV、PV的更多相关文章

随机推荐

热门专题