Where did we come from?

With the 0.9.0-milestone1 release, Apache Flink added an API to process relational data with SQL-like expressions called the Table API. The central concept of this API is a Table, a structured data set or stream on which relational operations can be applied. The Table API is tightly integrated with the DataSet and DataStream API. A Table can be easily created from a DataSet or DataStream and can also be converted back into a DataSet or DataStream as the following example shows

从0.9开始,引入Table API来支持关系型操作,

val execEnv = ExecutionEnvironment.getExecutionEnvironment
val tableEnv = TableEnvironment.getTableEnvironment(execEnv) // obtain a DataSet from somewhere
val tempData: DataSet[(String, Long, Double)] = // convert the DataSet to a Table
val tempTable: Table = tempData.toTable(tableEnv, 'location, 'time, 'tempF)
// compute your result
val avgTempCTable: Table = tempTable
.where('location.like("room%"))
.select(
('time / (3600 * 24)) as 'day,
'Location as 'room,
(('tempF - 32) * 0.556) as 'tempC
)
.groupBy('day, 'room)
.select('day, 'room, 'tempC.avg as 'avgTempC)
// convert result Table back into a DataSet and print it
avgTempCTable.toDataSet[Row].print()

可以看到可以很简单的把dataset转换为Table,指定其元数据即可

然后对于table就可以进行各种关系型操作,

最后还可以把Table再转换回dataset

Although the example shows Scala code, there is also an equivalent Java version of the Table API. The following picture depicts the original architecture of the Table API.

对于table的关系型操作,最终通过code generation还是会转换为dataset的逻辑

 

Table API joining forces with SQL

the community was also well aware of the multitude of dedicated “SQL-on-Hadoop” solutions in the open source landscape (Apache Hive, Apache Drill,Apache Impala, Apache Tajo, just to name a few).

Given these alternatives, we figured that time would be better spent improving Flink in other ways than implementing yet another SQL-on-Hadoop solution.

What we came up with was a revised architecture for a Table API that supports SQL (and Table API) queries on streaming and static data sources.

We did not want to reinvent the wheel and decided to build the new Table API on top of Apache Calcite, a popular SQL parser and optimizer framework. Apache Calcite is used by many projects including Apache Hive, Apache Drill, Cascading, and many more. Moreover, the Calcite community put SQL on streams on their roadmap which makes it a perfect fit for Flink’s SQL interface.

虽然社区已经有很多的Sql-on-Hadoop方案,flink希望把时间花在更有价值的地方,而不是再实现一套

但是当前这样的需要非常强烈,所以在revise Table API的基础上实现对SQL的支持 
对于SQL的支持,借助于Calcite,并且Calcite已经把SQL on streams放在roadmap上,有希望成为streaming sql的标准

Calcite is central in the new design as the following architecture sketch shows:

The new architecture features two integrated APIs to specify relational queries, the Table API and SQL.

Queries of both APIs are validated against a catalog of registered tables and converted into Calcite’s representation for logical plans.

In this representation, stream and batch queries look exactly the same.

Next, Calcite’s cost-based optimizer applies transformation rules and optimizes the logical plans.

Depending on the nature of the sources (streaming or static) we use different rule sets.

Finally, the optimized plan is translated into a regular Flink DataStream or DataSet program. This step involves again code generation to compile relational expressions into Flink functions.

这里Table API和SQL都统一的转换为Calcite的逻辑plans,然后再通过Calcite Optimizer进行优化,最终通过code generation转换为Flink的函数

With this effort, we are adding SQL support for both streaming and static data to Flink.

However, we do not want to see this as a competing solution to dedicated, high-performance SQL-on-Hadoop solutions, such as Impala, Drill, and Hive.

Instead, we see the sweet spot of Flink’s SQL integration primarily in providing access to streaming analytics to a wider audience.

In addition, it will facilitate integrated applications that use Flink’s API’s as well as SQL while being executed on a single runtime engine

再次说明,支持SQL并不是为了再造一个专用的SQL-on-Hadoop solutions;而是为了让更多的人可以来使用Flink,说白了,这块不是当前的战略重点

 

How will Flink’s SQL on streams look like?

So far we discussed the motivation for and architecture of Flink’s stream SQL interface, but how will it actually look like?

// get environments
val execEnv = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv = TableEnvironment.getTableEnvironment(execEnv) // configure Kafka connection
val kafkaProps = ...
// define a JSON encoded Kafka topic as external table
val sensorSource = new KafkaJsonSource[(String, Long, Double)](
"sensorTopic",
kafkaProps,
("location", "time", "tempF")) // register external table
tableEnv.registerTableSource("sensorData", sensorSource) // define query in external table
val roomSensors: Table = tableEnv.sql(
"SELECT STREAM time, location AS room, (tempF - 32) * 0.556 AS tempC " +
"FROM sensorData " +
"WHERE location LIKE 'room%'"
) // define a JSON encoded Kafka topic as external sink
val roomSensorSink = new KafkaJsonSink(...) // define sink for room sensor data and execute query
roomSensors.toSink(roomSensorSink)
execEnv.execute()

跟Table API相比,可以通过纯粹的SQL来做相应的操作

 

当前SQL不支持,windows aggregation,

但是Calcite的Streaming SQL是支持的,比如,

SELECT STREAM
TUMBLE_END(time, INTERVAL '1' DAY) AS day,
location AS room,
AVG((tempF - 32) * 0.556) AS avgTempC
FROM sensorData
WHERE location LIKE 'room%'
GROUP BY TUMBLE(time, INTERVAL '1' DAY), location

可以用Table API实现,

val avgRoomTemp: Table = tableEnv.ingest("sensorData")
.where('location.like("room%"))
.partitionBy('location)
.window(Tumbling every Days(1) on 'time as 'w)
.select('w.end, 'location, , (('tempF - 32) * 0.556).avg as 'avgTempCs)

 

What’s up next?

The Flink community is actively working on SQL support for the next minor version Flink 1.1.0. In the first version, SQL (and Table API) queries on streams will be limited to selection, filter, and union operators. Compared to Flink 1.0.0, the revised Table API will support many more scalar functions and be able to read tables from external sources and write them back to external sinks. A lot of work went into reworking the architecture of the Table API and integrating Apache Calcite.

In Flink 1.2.0, the feature set of SQL on streams will be significantly extended. Among other things, we plan to support different types of window aggregates and maybe also streaming joins. For this effort, we want to closely collaborate with the Apache Calcite community and help extending Calcite’s support for relational operations on streaming data when necessary.

1.2会有window aggregates和streaming joins,值得期待。。。

Stream Processing for Everyone with SQL and Apache Flink的更多相关文章

  1. Building real-time dashboard applications with Apache Flink, Elasticsearch, and Kibana

    https://www.elastic.co/cn/blog/building-real-time-dashboard-applications-with-apache-flink-elasticse ...

  2. Apache Flink 为什么能够成为新一代大数据计算引擎?

    众所周知,Apache Flink(以下简称 Flink)最早诞生于欧洲,2014 年由其创始团队捐赠给 Apache 基金会.如同其他诞生之初的项目,它新鲜,它开源,它适应了快速转的世界中更重视的速 ...

  3. 腾讯大数据平台Oceanus: A one-stop platform for real time stream processing powered by Apache Flink

    January 25, 2019Use Cases, Apache Flink The Big Data Team at Tencent     In recent years, the increa ...

  4. Stream Processing 101: From SQL to Streaming SQL in 10 Minutes

    转自:https://wso2.com/library/articles/2018/02/stream-processing-101-from-sql-to-streaming-sql-in-ten- ...

  5. Stream processing with Apache Flink and Minio

    转自:https://blog.minio.io/stream-processing-with-apache-flink-and-minio-10da85590787 Modern technolog ...

  6. Apache Samza - Reliable Stream Processing atop Apache Kafka and Hadoop YARN

    http://engineering.linkedin.com/data-streams/apache-samza-linkedins-real-time-stream-processing-fram ...

  7. 13 Stream Processing Patterns for building Streaming and Realtime Applications

    原文:https://iwringer.wordpress.com/2015/08/03/patterns-for-streaming-realtime-analytics/ Introduction ...

  8. Introducing KSQL: Streaming SQL for Apache Kafka

    Update: KSQL is now available as a component of the Confluent Platform. I’m really excited to announ ...

  9. Storm(2) - Log Stream Processing

    Introduction This chapter will present an implementation recipe for an enterprise log storage and a ...

随机推荐

  1. zjoi2016 day1【bzoj4455】【bzoj4456】

    首先做了T2的旅行者,看到bz上面过的人数比较多.. 考试的时候完全没有想太多.一闪而过了分块思想,然后就没有然后了.. 大视野上面有题解,竟然是一个初中生写的..? 正解其实是“分治”,每次选择中轴 ...

  2. Java解析HTML之HTMLParser使用与详解

    http://blog.csdn.net/jediael_lu/article/details/26285951

  3. Codeforces Round #348(VK Cup 2016 - Round 2)

    A - Little Artem and Presents (div2) 1 2 1 2这样加就可以了 #include <bits/stdc++.h> typedef long long ...

  4. BFS(染色) LA 3977 Summits

    题目传送门 题意:题意坑爹.问符合条件的的山顶个数 分析:降序排序后从每个点出发,假设为山顶,如果四周的点的高度>h - d那么可以走,如果走到已经走过的点且染色信息(山顶高度)不匹配那么就不是 ...

  5. 使用python实现栈和队列

    1.使用python实现栈: class stack(): def __init__(self): self.stack = [] def empty(self): return self.stack ...

  6. Oralce 常用语句

    注:大写代表需要替换掉额 --更新字段名 alter table TABLE rename column COL_OLD to COL_NEW --添加字段名 alter table TABLE ad ...

  7. Codeforces 581F Zublicanes and Mumocrates(树形DP)

    题目大概说有一棵树要给结点染色0或1,要求所有度为1的结点一半是0一半是1,然后问怎么染色,使两端点颜色不一样的边最少. dp[0/1][u][x]表示以u结点为根的子树中u结点是0/1色 且其子树有 ...

  8. ajax请求 json格式和数组格式总结

    php echo json_encode($data); $.ajax({ url:APP+"?a=total&c=collection", //请求的页面 type:&q ...

  9. HDU 3090 (贪心)

    题目链接: http://acm.hdu.edu.cn/showproblem.php?pid=3090 题目大意:一共n段路.每段路每千米都会被抢劫一定数量,可以雇佣武士护卫m千米.问最少被抢劫数量 ...

  10. 洛谷 P1967 货车运输 Label: 倍增LCA && 最小瓶颈路

    题目描述 A 国有 n 座城市,编号从 1 到 n,城市之间有 m 条双向道路.每一条道路对车辆都有重量限制,简称限重.现在有 q 辆货车在运输货物, 司机们想知道每辆车在不超过车辆限重的情况下,最多 ...