https://calcite.apache.org/docs/stream.html

 

Calcite’s SQL is an extension to standard SQL, not another ‘SQL-like’ language. The distinction is important, for several reasons:

  • Streaming SQL is easy to learn for anyone who knows regular SQL.
  • The semantics are clear, because we aim to produce the same results on a stream as if the same data were in a table.
  • You can write queries that combine streams and tables (or the history of a stream, which is basically an in-memory table).
  • Lots of existing tools can generate standard SQL.

If you don’t use the STREAM keyword, you are back in regular standard SQL.

只是对于标准sql的扩展,StreamingSQL只是多个Stream关键词

 

An example schema

Our streaming SQL examples use the following schema:

  • Orders (rowtime, productId, orderId, units) - a stream and a table
  • Products (rowtime, productId, name) - a table
  • Shipments (rowtime, orderId) - a stream

以简单的订单,商品,发货为例子

可以看到这里可以同时处理,流式表和静态表;对于order即是流式表也是静态表,意思是实时数据在流式表中,而历史数据在静态表中

简单的加上STREAM就可以对流式表Orders进行查询,结果是unbounded的;如果不带STREAM就是对静态表进行查询,结果是bounded

 

Windows支持

  • tumbling window (GROUP BY)
  • hopping window (multi GROUP BY)
  • sliding window (window functions)
  • cascading window (window functions)

 

Tumbling windows

在sql中,所谓window就是对于时间的group

比如下面的例子,以小时为时间窗口

How did Calcite know that the 10:00:00 sub-totals were complete at 11:00:00, so that it could emit them? It knows that rowtime is increasing, and it knows that CEIL(rowtime TO HOUR) is also increasing. So, once it has seen a row at or after 11:00:00, it will never see a row that will contribute to a 10:00:00 total.

A column or expression that is increasing or decreasing is said to bemonotonic.

有个问题是如何知道11点之前的数据都已经到,这个取决于rowtime是单调递增的

所以对于group by时,必须要有一个column是单调递增的,monotonic

If column or expression has values that are slightly out of order, and the stream has a mechanism (such as punctuation or watermarks) to declare that a particular value will never be seen again, then the column or expression is said to be quasi-monotonic.

当然rowtime可能不是严格单调的,所以我们可以用watermark来限定一个时间段,在这个时间范围上是单调的;这样称为quasi-monotonic,拟单调

更优雅些,我们可以使用TUMBLE关键字

上面的例子是30分钟的时间窗口,但是非整点,而是有12分钟的偏移,alignment time

 

Hopping windows

Hopping windows are a generalization of tumbling windows that allow data to be kept in a window for a longer than the emit interval.

其实就是滑动窗口

以1小时为滑动,3小时为窗口大小

 

HAVING

聚合后的过滤

 

Sliding windows

非groupby方式的sliding windows,

Standard SQL features so-called “analytic functions” that can be used in the SELECT clause.
Unlike GROUP BY, these do not collapse records. For each record that goes in, one record comes out. But the aggregate function is based on a window of many rows.

SELECT STREAM rowtime,
productId,
units,
SUM(units) OVER (ORDER BY rowtime RANGE INTERVAL '1' HOUR PRECEDING) unitsLastHour
FROM Orders;

这个和groupby的区别在于,窗口触发的时机

对于groupby,时间整点触发,会将窗口里records计算成一个值

而OVER,是record by record,每来一条record都会触发一次计算,上面的例子是,对每条record都会触发一次前一个小时的sum

这里更加复杂,

先声明Window product,表示order by rowtime,partition by productId

再基于product,OVER生成7天和10分钟的AVG(units)

 

 

子查询

The previous HAVING query can be expressed using a WHERE clause on a sub-query:

having也可以实现成子查询的形式

 

Since then, SQL has become a mathematically closed language, which means that any operation you can perform on a table can also perform on a query.

The closure property of SQL is extremely powerful. Not only does it render HAVING obsolete (or, at least, reduce it to syntactic sugar), it makes views possible:

sql具有闭合特性,即任何可以在table上执行的操作,也同样可以在query上执行,因为query的结果也是一个关系表

所以上面通过create view创建子查询

Many people find that nested queries and views are even more useful on streams than they are on relations.

Streaming queries are pipelines of operators all running continuously, and often those pipelines get quite long. Nested queries and views help to express and manage those pipelines.

嵌套查询对于Streaming非常有用,因为流其实就是一组operators的pipelines;以嵌套查询或view的方式去表示会很方便

 

And, by the way, a WITH clause can accomplish the same as a sub-query or a view:

With关键词,用于实现子查询或view

 

Sorting

 

Joining streams to tables

A stream-to-table join is straightforward if the contents of the table are not changing.

这个很直接,但有个问题是,静态表是会变化的,当数据record流过来时,我们需要和record发生时静态表做join,但如果静态表已经变化了,我们只能取到最新值

要解决这个问题,我们需要为静态表,创建版本表,保存每个时间的版本

One way to implement this is to have a table that keeps every version with a start and end effective date, ProductVersions in the following example:

当前会从productVersion里面,根据record rowtime找出包含这个时间的版本

 

Joining streams to streams

 

DML

It’s not only queries that make sense against streams; it also makes sense to run DML statements (INSERT, UPDATE, DELETE, and also their rarer cousins UPSERT and REPLACE) against streams.

 

DML is useful because it allows you do materialize streams or tables based on streams, and therefore save effort when values are used often.

Calcite - StreamingSQL的更多相关文章

  1. kylin(二): Calcite

    Apache Calcite是面向Hadoop新的查询引擎,它提供了标准的SQL语言.多种查询优化和连接各种数据源的能力,除此之外,Calcite还提供了OLAP和流处理的查询引擎.Calcite之前 ...

  2. calcite 理论

    https://blog.csdn.net/yunlong34574/article/details/46375733 https://cloud.tencent.com/developer/arti ...

  3. Flink SQL与 SQL Parser ,calcite

    http://vinoyang.com/2017/06/12/flink-table-sql-source/ Flink Table&Sql 如何结合Apache Calcite http:/ ...

  4. calcite介绍

    前言 calcite是一个可以将任意数据查询转换成基于sql查询的引擎,引擎特性也有很多,比如支持sql树的解析,udf的扩展,sql执行优化器的扩展等等.目前已经被很多顶级apache项目引用,比如 ...

  5. Flink table&Sql中使用Calcite

    Apache Calcite是什么东东 Apache Calcite面向Hadoop新的sql引擎,它提供了标准的SQL语言.多种查询优化和连接各种数据源的能力.除此之外,Calcite还提供了OLA ...

  6. Apache顶级项目 Calcite使用介绍

    什么是Calcite Apache Calcite是一个动态数据管理框架,它具备很多典型数据库管理系统的功能,比如SQL解析.SQL校验.SQL查询优化.SQL生成以及数据连接查询等,但是又省略了一些 ...

  7. Apache Calcite项目简介

    文章导读: 什么是Calcite? Calcite的主要功能? 如何快速使用Calcite? 什么是Calcite Apache Calcite是一个动态数据管理框架,它具备很多典型数据库管理系统的功 ...

  8. Calcite分析 - RelTrait

    RelTrait 表示RelNode的物理属性 由RelTraitDef代表RelTrait的类型 /** * RelTrait represents the manifestation of a r ...

  9. Calcite分析 - Rule

    Calcite源码分析,参考: http://matt33.com/2019/03/07/apache-calcite-process-flow/ https://matt33.com/2019/03 ...

随机推荐

  1. java框架篇---hibernate(多对多)映射关系

    以学生和老师为例的来讲解多对多映射. 实体类: Student package cn.itcast.g_hbm_manyToMany; import java.util.HashSet; import ...

  2. JVM——Java HotSpot VM Options

    JVM常用参数 参数名称 含义 默认值  描述 -Xms 初始堆大小 物理内存的1/64(<1GB) 默认(MinHeapFreeRatio参数可以调整)空余堆内存小于40%时,JVM就会增大堆 ...

  3. 创建shell脚本

    1.写一个脚本 a) 用touch命令创建一个文件:touch my_script b) 用vim编辑器打开my_script文件:vi my_script c) 用vim编辑器编辑my_script ...

  4. Openlayers离线载入天地图

    概述: 经过一个春节的休整,今天最终開始了! 任何时候.都不要忘记学习.学习是一辈子的事情!今天,我来说说怎样实现天地图的离线以及Openlayers载入离线数据实现天地图数据的展示. 实现: 1.获 ...

  5. Java如何找到一个单词的每一次匹配?

    在Java编程中,如何查找字符串中特定单词的最后一个索引? 以下示例演示如何使用Matlass类的matchet.find()方法和Pattern类的Patter.compile()方法查找字符串中指 ...

  6. Linux系统排查4——网络篇

    用于排查Linux系统的网络故障. 网络排查一般是有一定的思路和顺序的,其实排查的思路就是根据具体的问题逐段排除故障可能发生的地方,最终确定问题. 所以首先要问一问,网络问题是什么,是不通,还是慢? ...

  7. 【转】QT Graphics-View官方介绍(中文翻译)

    一.GraphicsView框架简介 QT4.2开始引入了Graphics View框架用来取代QT3中的Canvas模块,并作出了改进,Graphics View框架实现了模型-视图结构的图形管理, ...

  8. C#自定义Winform无边框窗体

    C#自定义Winform无边框窗体 在实际项目中,WinForm窗体或者控件不能满足要求,所以就需要自己设计窗体等,当然设计界面可以用的东西很多,例如WPF.或者一些第三方的库等.本例中将采用WinF ...

  9. @Cacheable注解式缓存不起作用的情形

    @Cacheable注解式缓存使用的要点:正确的注解式缓存配置,注解对象为spring管理的hean,调用者为另一个对象.有些情形下注解式缓存是不起作用的:同一个bean内部方法调用,子类调用父类中有 ...

  10. Kubernetes – Ingress

    用户在 Kubernetes 上部署的服务一般运行于私有网络,Pod和Service 提供了 hostPort,NodePort等参数用于暴露这些服务端口到K8S节点上,供使用者访问.这样的方法有明显 ...