【翻译】Flink Table Api & SQL — SQL客户端Beta 版

本文翻译自官网：SQL Client Beta https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/sqlClient.html

Flink的Table＆SQL API使使用SQL语言编写的查询成为可能，但是这些查询需要嵌入用Java或Scala编写的表程序中。此外，在将这些程序提交给集群之前，需要将它们与构建工具打包在一起。这或多或少地将Flink的使用限制为Java / Scala程序员。

SQL客户端旨在提供一种简单的方法来编写，调试和提交表程序到Flink集群，而无需一行Java或Scala代码。 SQL Client CLI允许从命令行上正在运行的分布式应用程序检索和可视化实时结果。

注：动图，请查看源网页

注意：SQL Client 处于早期开发阶段。即该应用程序尚未投入生产，它对于原型制作和使用Flink SQL还是一个非常有用的工具。将来，社区计划通过提供基于 REST 的 SQL Client Gateway 来扩展其功能。

入门

本节介绍如何从命令行设置和运行第一个Flink SQL程序。

SQL客户端捆绑在常规的Flink发行版中，因此可以直接运行。它只需要一个运行中的Flink集群即可在其中执行表程序。有关设置Flink群集的更多信息，请参见“ 群集和部署”部分。如果只想试用SQL Client，也可以使用以下命令以一个工作程序启动本地群集：

./bin/start-cluster.sh

启动SQL客户端CLI

SQL Client脚本也位于Flink的二进制目录中。将来，用户可以通过启动嵌入式独立进程或连接到远程SQL Client Gateway来启动SQL Client CLI的两种可能性。目前仅embedded支持该模式。您可以通过以下方式启动CLI：

./bin/sql-client.sh embedded

默认情况下，SQL客户端将从位于中的环境文件中读取其配置./conf/sql-client-defaults.yaml。有关环境文件的结构的更多信息，请参见配置部分。

运行SQL查询

CLI启动后，您可以使用HELP命令列出所有可用的SQL语句。为了验证您的设置和集群连接，您可以输入第一个SQL查询，输入Enter按键执行它：

SELECT 'Hello World';

该查询不需要表源，并且只产生一行结果。 CLI将从群集中检索结果并将其可视化。您可以通过按Q键关闭结果视图。

CLI支持两种用于维护和可视化结果的模式。

表格模式将结果具体化到内存中，并以规则的分页表格表示形式将其可视化。可以通过在CLI中执行以下命令来启用它：

SET execution.result-mode=table;

更改日志模式不会具体化结果，并且无法可视化由包含插入（+）和撤回（-）的连续查询产生的结果流。

SET execution.result-mode=changelog;

您可以使用以下查询来查看两种结果模式的运行情况：

SELECT name, COUNT(*) AS cnt FROM (VALUES ('Bob'), ('Alice'), ('Greg'), ('Bob')) AS NameTable(name) GROUP BY name;

此查询执行一个有限字数示例。

在变更日志模式下，可视化的变更日志类似于：

+ Bob, 1

+ Alice, 1

+ Greg, 1

- Bob, 1

+ Bob, 2

在表格模式下，可视化结果表格将不断更新，直到表格程序以以下内容结束：

Bob, 2

Alice, 1

Greg, 1

这两种结果模式在SQL查询的原型制作过程中都非常有用。在这两种模式下，结果都存储在SQL Client的Java堆内存中。为了保持CLI界面的响应性，更改日志模式仅显示最新的1000个更改。表格模式允许浏览更大的结果，这些结果仅受可用主存储器和配置的最大行数（最大表结果行）限制。

注意：在批处理环境中执行的查询只能使用表结果模式来检索。

定义查询后，可以将其作为长期运行的独立Flink作业提交给集群。为此，需要使用INSERT INTO语句指定存储结果的目标系统。配置部分说明如何声明用于读取数据的表源，如何声明用于写入数据的表接收器以及如何配置其他表程序属性。

配置

可以使用以下可选的CLI命令启动SQL Client。在随后的段落中将详细讨论它们。

./bin/sql-client.sh embedded --help

Mode "embedded" submits Flink jobs from the local machine.

  Syntax: embedded [OPTIONS]

  "embedded" mode options:

     -d,--defaults <environment file>      The environment properties with which

                                           every new session is initialized.

                                           Properties might be overwritten by

                                           session properties.

     -e,--environment <environment file>   The environment properties to be

                                           imported into the session. It might

                                           overwrite default environment

                                           properties.

     -h,--help                             Show the help message with

                                           descriptions of all options.

     -j,--jar <JAR file>                   A JAR file to be imported into the

                                           session. The file might contain

                                           user-defined classes needed for the

                                           execution of statements such as

                                           functions, table sources, or sinks.

                                           Can be used multiple times.

     -l,--library <JAR directory>          A JAR file directory with which every

                                           new session is initialized. The files

                                           might contain user-defined classes

                                           needed for the execution of

                                           statements such as functions, table

                                           sources, or sinks. Can be used

                                           multiple times.

     -s,--session <session identifier>     The identifier for a session.

                                           'default' is the default identifier.

环境文件

SQL查询需要在其中执行配置环境。所谓的 环境文件 定义了可用的目录，表 source 和 sink，用户定义的函数以及执行和部署所需的其他属性。

每个环境文件都是常规的YAML文件。下面提供了此类文件的示例。

# Define tables here such as sources, sinks, views, or temporal tables.

tables:

  - name: MyTableSource

    type: source-table

    update-mode: append

    connector:

      type: filesystem

      path: "/path/to/something.csv"

    format:

      type: csv

      fields:

        - name: MyField1

          type: INT

        - name: MyField2

          type: VARCHAR

      line-delimiter: "\n"

      comment-prefix: "#"

    schema:

      - name: MyField1

        type: INT

      - name: MyField2

        type: VARCHAR

  - name: MyCustomView

    type: view

    query: "SELECT MyField2 FROM MyTableSource"

# Define user-defined functions here.

functions:

  - name: myUDF

    from: class

    class: foo.bar.AggregateUDF

    constructor:

      - 7.6

      - false

# Define available catalogs

catalogs:

   - name: catalog_1

     type: hive

     property-version: 1

     hive-conf-dir: ...

   - name: catalog_2

     type: hive

     property-version: 1

     default-database: mydb2

     hive-conf-dir: ...

     hive-version: 1.2.1

# Properties that change the fundamental execution behavior of a table program.

execution:

  planner: old                      # optional: either 'old' (default) or 'blink'

  type: streaming                   # required: execution mode either 'batch' or 'streaming'

  result-mode: table                # required: either 'table' or 'changelog'

  max-table-result-rows: 1000000    # optional: maximum number of maintained rows in

                                    #   'table' mode (1000000 by default, smaller 1 means unlimited)

  time-characteristic: event-time   # optional: 'processing-time' or 'event-time' (default)

  parallelism: 1                    # optional: Flink's parallelism (1 by default)

  periodic-watermarks-interval: 200 # optional: interval for periodic watermarks (200 ms by default)

  max-parallelism: 16               # optional: Flink's maximum parallelism (128 by default)

  min-idle-state-retention: 0       # optional: table program's minimum idle state time

  max-idle-state-retention: 0       # optional: table program's maximum idle state time

  current-catalog: catalog_1        # optional: name of the current catalog of the session ('default_catalog' by default)

  current-database: mydb1           # optional: name of the current database of the current catalog

                                    #   (default database of the current catalog by default)

  restart-strategy:                 # optional: restart strategy

    type: fallback                  #   "fallback" to global restart strategy by default

# Configuration options for adjusting and tuning table programs.

# A full list of options and their default values can be found

# on the dedicated "Configuration" page.

configuration:

  table.optimizer.join-reorder-enabled: true

  table.exec.spill-compression.enabled: true

  table.exec.spill-compression.block-size: 128kb

# Properties that describe the cluster to which table programs are submitted to.

deployment:

  response-timeout: 5000

配置：

使用表源MyTableSource定义环境，该表源从CSV文件读取
定义一个视图MyCustomView，该视图使用SQL查询声明一个虚拟表
定义一个用户定义的函数myUDF，该函数可以使用类名和两个构造函数参数进行实例化，
连接到两个Hive catalog，并使用catalog_1作为当前 catalog，使用mydb1作为该 catalog 的当前数据库
使用旧 planner 以流模式运行具有事件时间特征和并行度为1的语句
在表结果模式下运行探索性查询
并通过配置选项围绕联接的重新排序和溢出进行一些 planner 调整。

根据使用情况，可以将配置拆分为多个文件。因此，可以出于一般目的（使用--defaults使用默认环境文件）以及基于每个会话（使用--environment使用会话环境文件）来创建环境文件。每个CLI会话均使用默认属性初始化，后跟会话属性。例如，默认环境文件可以指定在每个会话中都可用于查询的所有表源，而会话环境文件仅声明特定的状态保留时间和并行性。启动CLI应用程序时，可以传递默认环境文件和会话环境文件。如果未指定默认环境文件，则SQL客户端会在Flink的配置目录中搜索./conf/sql-client-defaults.yaml。

注意：在CLI会话中设置的属性（例如，使用SET命令）具有最高优先级：

CLI commands > session environment file > defaults environment file

重启策略

重启策略控制在发生故障时如何重新启动Flink作业。与Flink群集的全局重启策略类似，可以在环境文件中声明更细粒度的重启配置。

支持以下策略：

execution:

  # falls back to the global strategy defined in flink-conf.yaml

  restart-strategy:

    type: fallback

  # job fails directly and no restart is attempted

  restart-strategy:

    type: none

  # attempts a given number of times to restart the job

  restart-strategy:

    type: fixed-delay

    attempts: 3      # retries before job is declared as failed (default: Integer.MAX_VALUE)

    delay: 10000     # delay in ms between retries (default: 10 s)

  # attempts as long as the maximum number of failures per time interval is not exceeded

  restart-strategy:

    type: failure-rate

    max-failures-per-interval: 1   # retries in interval until failing (default: 1)

    failure-rate-interval: 60000   # measuring interval in ms for failure rate

    delay: 10000                   # delay in ms between retries (default: 10 s)

依赖关系

SQL客户端不需要使用Maven或SBT设置Java项目。相反，您可以将依赖项作为常规JAR文件传递，然后将其提交给集群。您可以单独指定每个JAR文件（使用--jar），也可以定义整个库目录（使用--library）。对于外部系统（例如Apache Kafka）和相应数据格式（例如JSON）的连接器，Flink提供了现成的JAR捆绑包。可以从Maven中央存储库为每个发行版下载这些JAR文件。

提供的SQL JAR的完整列表以及有关如何使用它们的文档可以在与外部系统的连接页面上找到。

以下示例显示了一个环境文件，该文件定义了一个表源，该表源从Apache Kafka读取JSON数据。

tables:

  - name: TaxiRides

    type: source-table

    update-mode: append

    connector:

      property-version: 1

      type: kafka

      version: "0.11"

      topic: TaxiRides

      startup-mode: earliest-offset

      properties:

        - key: zookeeper.connect

          value: localhost:2181

        - key: bootstrap.servers

          value: localhost:9092

        - key: group.id

          value: testGroup

    format:

      property-version: 1

      type: json

      schema: "ROW<rideId LONG, lon FLOAT, lat FLOAT, rideTime TIMESTAMP>"

    schema:

      - name: rideId

        type: LONG

      - name: lon

        type: FLOAT

      - name: lat

        type: FLOAT

      - name: rowTime

        type: TIMESTAMP

        rowtime:

          timestamps:

            type: "from-field"

            from: "rideTime"

          watermarks:

            type: "periodic-bounded"

            delay: "60000"

      - name: procTime

        type: TIMESTAMP

        proctime: true

TaxiRide表的结果模式包含JSON模式的大多数字段。此外，它添加了行时间属性rowTime和处理时间属性procTime。

连接器和格式都允许定义属性版本（当前为版本1），以便将来向后兼容。

用户定义的函数

SQL客户端允许用户创建要在SQL查询中使用的自定义用户定义函数。当前，这些函数仅限于以编程方式在Java / Scala类中定义。

为了提供用户定义的函数，您需要首先实现并编译扩展ScalarFunction，AggregateFunction或TableFunction的函数类（请参阅用户定义的函数）。然后可以将一个或多个函数打包到SQL客户端的依赖项JAR中。

在调用之前，必须在环境文件中声明所有函数。对于函数列表中的每一项，必须指定

函数注册的名称
使用的函数源（目前仅限于类）
指示函数的完全限定的类名称的类，以及用于实例化的可选构造函数参数列表。

functions:

  - name: ...               # required: name of the function

    from: class             # required: source of the function (can only be "class" for now)

    class: ...              # required: fully qualified class name of the function

    constructor:            # optimal: constructor parameters of the function class

      - ...                 # optimal: a literal parameter with implicit type

      - class: ...          # optimal: full class name of the parameter

        constructor:        # optimal: constructor parameters of the parameter's class

          - type: ...       # optimal: type of the literal parameter

            value: ...      # optimal: value of the literal parameter

确保指定参数的顺序和类型严格匹配函数类的构造函数之一。

函数构造参数

根据用户定义的函数，可能有必要在SQL语句中使用实现之前对其进行参数化。

如前面的示例所示，在声明用户定义的函数时，可以通过以下三种方式之一使用构造函数参数来配置类：

具有隐式类型的文字值： SQL Client将根据文字值本身自动派生类型。目前，仅支持 BOOLEAN，INT，DOUBLE和VARCHAR。如果自动派生无法按预期进行（例如，您需要VARCHAR false），请改用显式类型。

- true         # -> BOOLEAN (case sensitive)

- 42           # -> INT

- 1234.222     # -> DOUBLE

- foo          # -> VARCHAR

具有显式类型的文字值：使用类型属性显式声明参数type和value属性。

- type: DECIMAL

  value: 11111111111111111

下表说明了受支持的Java参数类型和相应的SQL类型字符串。

Java type	SQL type
`java.math.BigDecimal`	`DECIMAL`
`java.lang.Boolean`	`BOOLEAN`
`java.lang.Byte`	`TINYINT`
`java.lang.Double`	`DOUBLE`
`java.lang.Float`	`REAL`, `FLOAT`
`java.lang.Integer`	`INTEGER`, `INT`
`java.lang.Long`	`BIGINT`
`java.lang.Short`	`SMALLINT`
`java.lang.String`	`VARCHAR`

目前尚不支持更多类型（例如TIMESTAMP或ARRAY），基本类型和null。

（嵌套的）类实例：除了文字值，还可以通过指定class和constructor属性来为构造函数参数创建（嵌套的）类实例。可以递归执行此过程，直到所有构造函数参数都用文字值表示为止。

- class: foo.bar.paramClass

  constructor:

    - StarryName

    - class: java.lang.Integer

      constructor:

        - class: java.lang.String

          constructor:

            - type: VARCHAR

              value: 3

Catalogs

可以将 catalog 定义为一组YAML属性，并在启动SQL Client时自动将其注册到环境中。

用户可以在SQL CLI中指定当前 catalog ，以及要用作当前数据库的 catalog 的数据库。

catalogs:

   - name: catalog_1

     type: hive

     property-version: 1

     default-database: mydb2

     hive-version: 1.2.1

     hive-conf-dir: <path of Hive conf directory>

   - name: catalog_2

     type: hive

     property-version: 1

     hive-conf-dir: <path of Hive conf directory>

execution:

   ...

   current-catalog: catalog_1

   current-database: mydb1

有关catalog 的更多信息，请参见 catalog。

分离的SQL查询

为了定义端到端的SQL管道，可以使用SQL的INSERT INTO语句向Flink集群提交长时间运行的分离查询。这些查询将其结果生成到外部系统而不是SQL Client中。这允许处理更高的并行度和更大数量的数据。提交后，CLI本身对分离的查询没有任何控制权。

INSERT INTO MyTableSink SELECT * FROM MyTableSource

表接收器MyTableSink必须在环境文件中声明。有关支持的外部系统及其配置的更多信息，请参见连接页面。下面显示了Apache Kafka table sink 的示例。

tables:

  - name: MyTableSink

    type: sink-table

    update-mode: append

    connector:

      property-version: 1

      type: kafka

      version: "0.11"

      topic: OutputTopic

      properties:

        - key: zookeeper.connect

          value: localhost:2181

        - key: bootstrap.servers

          value: localhost:9092

        - key: group.id

          value: testGroup

    format:

      property-version: 1

      type: json

      derive-schema: true

    schema:

      - name: rideId

        type: LONG

      - name: lon

        type: FLOAT

      - name: lat

        type: FLOAT

      - name: rideTime

        type: TIMESTAMP

SQL客户端确保语句已成功提交到群集。提交查询后，CLI将显示有关Flink作业的信息。

[INFO] Table update statement has been successfully submitted to the cluster:

Cluster ID: StandaloneClusterId

Job ID: 6f922fe5cba87406ff23ae4a7bb79044

Web interface: http://localhost:8081

注意：提交后，SQL客户端不会跟踪正在运行的Flink作业的状态。提交后可以关闭CLI进程，而不会影响分离的查询。 Flink的重启策略可确保容错能力。可以使用Flink的 web 界面，命令行或REST API取消查询。

SQL视图

视图允许通过SQL查询定义虚拟表。视图定义被立即解析和验证。但是，实际执行是在提交常规INSERT INTO或SELECT语句期间访问视图时发生的。

可以在环境文件中或在CLI会话中定义视图。

以下示例显示如何在一个文件中定义多个视图。按照在环境文件中定义的顺序注册视图。支持诸如视图A依赖于视图B依赖于视图C的引用链。

tables:

  - name: MyTableSource

    # ...

  - name: MyRestrictedView

    type: view

    query: "SELECT MyField2 FROM MyTableSource"

  - name: MyComplexView

    type: view

    query: >

      SELECT MyField2 + 42, CAST(MyField1 AS VARCHAR)

      FROM MyTableSource

      WHERE MyField2 > 200

与表源和接收器相似，会话环境文件中定义的视图具有最高优先级。

也可以使用CREATE VIEW语句在CLI会话中创建视图：

CREATE VIEW MyNewView AS SELECT MyField2 FROM MyTableSource;

也可以使用DROP VIEW语句删除在CLI会话中创建的视图：

DROP VIEW MyNewView;

注意：CLI中视图的定义仅限于上述语法。将来的版本将支持为视图定义表名或在表名中转义空格。

时态表

时态表允许在变化的历史记录表上进行（参数化）视图，该视图返回表在特定时间点的内容。这对于在特定时间戳将一个表与另一个表的内容连接起来特别有用。在时态表联接页面中可以找到更多信息。

以下示例显示如何定义时态表SourceTemporalTable：

tables:

  # Define the table source (or view) that contains updates to a temporal table

  - name: HistorySource

    type: source-table

    update-mode: append

    connector: # ...

    format: # ...

    schema:

      - name: integerField

        type: INT

      - name: stringField

        type: VARCHAR

      - name: rowtimeField

        type: TIMESTAMP

        rowtime:

          timestamps:

            type: from-field

            from: rowtimeField

          watermarks:

            type: from-source

  # Define a temporal table over the changing history table with time attribute and primary key

  - name: SourceTemporalTable

    type: temporal-table

    history-table: HistorySource

    primary-key: integerField

    time-attribute: rowtimeField  # could also be a proctime field

如示例中所示，表源，视图和时态表的定义可以相互混合。按照在环境文件中定义的顺序注册它们。例如，时态表可以引用一个视图，该视图可以依赖于另一个视图或表源。

局限与未来

当前的SQL Client实现处于非常早期的开发阶段，作为更大的Flink改进提案24（FLIP-24）的一部分，将来可能会更改。随时加入有关您发现有用的错误和功能的讨论并公开发表问题。

欢迎关注Flink菜鸟公众号，会不定期更新Flink（开发技术）相关的推文