Hive集成表引擎

Hive引擎允许对HDFS Hive表执行 SELECT 查询。目前它支持如下输入格式:

-文本:只支持简单的标量列类型，除了 Binary

ORC:支持简单的标量列类型，除了char; 只支持 array 这样的复杂类型
Parquet:支持所有简单标量列类型;只支持 array 这样的复杂类型

创建表

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]

(

    name1 [type1] [ALIAS expr1],

    name2 [type2] [ALIAS expr2],

    ...

) ENGINE = Hive('thrift://host:port', 'database', 'table');

PARTITION BY expr

表的结构可以与原来的Hive表结构有所不同:

列名应该与原来的Hive表相同，但你可以使用这些列中的一些，并以任何顺序，你也可以使用一些从其他列计算的别名列。
列类型与原Hive表的列类型保持一致。
“Partition by expression”应与原Hive表保持一致，“Partition by expression”中的列应在表结构中。

引擎参数

thrift://host:port — Hive Metastore 地址
database — 远程数据库名.
table — 远程数据表名.

使用示例

如何使用HDFS文件系统的本地缓存

我们强烈建议您为远程文件系统启用本地缓存。基准测试显示，如果使用缓存，它的速度会快两倍。

在使用缓存之前，请将其添加到 config.xml

<local_cache_for_remote_fs>

    <enable>true</enable>

    <root_dir>local_cache</root_dir>

    <limit_size>559096952</limit_size>

    <bytes_read_before_flush>1048576</bytes_read_before_flush>

</local_cache_for_remote_fs>

enable: 开启后，ClickHouse将为HDFS (远程文件系统)维护本地缓存。
root_dir: 必需的。用于存储远程文件系统的本地缓存文件的根目录。
limit_size: 必需的。本地缓存文件的最大大小(单位为字节)。
bytes_read_before_flush: 从远程文件系统下载文件时，刷新到本地文件系统前的控制字节数。缺省值为1MB。

当ClickHouse为远程文件系统启用了本地缓存时，用户仍然可以选择不使用缓存，并在查询中设置 use_local_cache_for_remote_storage = 0, use_local_cache_for_remote_storage 默认为 1。

查询 ORC 输入格式的Hive 表

在 Hive 中建表

hive > CREATE TABLE `test`.`test_orc`(

  `f_tinyint` tinyint,

  `f_smallint` smallint,

  `f_int` int,

  `f_integer` int,

  `f_bigint` bigint,

  `f_float` float,

  `f_double` double,

  `f_decimal` decimal(10,0),

  `f_timestamp` timestamp,

  `f_date` date,

  `f_string` string,

  `f_varchar` varchar(100),

  `f_bool` boolean,

  `f_binary` binary,

  `f_array_int` array<int>,

  `f_array_string` array<string>,

  `f_array_float` array<float>,

  `f_array_array_int` array<array<int>>,

  `f_array_array_string` array<array<string>>,

  `f_array_array_float` array<array<float>>)

PARTITIONED BY (

  `day` string)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

STORED AS INPUTFORMAT

  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

LOCATION

  'hdfs://testcluster/data/hive/test.db/test_orc'

OK

Time taken: 0.51 seconds

hive > insert into test.test_orc partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

OK

Time taken: 36.025 seconds

hive > select * from test.test_orc;

OK

1   2   3   4   5   6.11    7.22    8   2021-11-05 12:38:16.314 2021-11-05  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18

Time taken: 0.295 seconds, Fetched: 1 row(s)

在 ClickHouse 中建表

ClickHouse中的表，从上面创建的Hive表中获取数据:

CREATE TABLE test.test_orc

(

    `f_tinyint` Int8,

    `f_smallint` Int16,

    `f_int` Int32,

    `f_integer` Int32,

    `f_bigint` Int64,

    `f_float` Float32,

    `f_double` Float64,

    `f_decimal` Float64,

    `f_timestamp` DateTime,

    `f_date` Date,

    `f_string` String,

    `f_varchar` String,

    `f_bool` Bool,

    `f_binary` String,

    `f_array_int` Array(Int32),

    `f_array_string` Array(String),

    `f_array_float` Array(Float32),

    `f_array_array_int` Array(Array(Int32)),

    `f_array_array_string` Array(Array(String)),

    `f_array_array_float` Array(Array(Float32)),

    `day` String

)

ENGINE = Hive('thrift://localhost:9083', 'test', 'test_orc')

PARTITION BY day

SELECT * FROM test.test_orc settings input_format_orc_allow_missing_columns = 1\G

SELECT *

FROM test.test_orc

SETTINGS input_format_orc_allow_missing_columns = 1

Query id: c3eaffdc-78ab-43cd-96a4-4acc5b480658

Row 1:

──────

f_tinyint:            1

f_smallint:           2

f_int:                3

f_integer:            4

f_bigint:             5

f_float:              6.11

f_double:             7.22

f_decimal:            8

f_timestamp:          2021-12-04 04:00:44

f_date:               2021-12-03

f_string:             hello world

f_varchar:            hello world

f_bool:               true

f_binary:             hello world

f_array_int:          [1,2,3]

f_array_string:       ['hello world','hello world']

f_array_float:        [1.1,1.2]

f_array_array_int:    [[1,2],[3,4]]

f_array_array_string: [['a','b'],['c','d']]

f_array_array_float:  [[1.11,2.22],[3.33,4.44]]

day:                  2021-09-18

1 rows in set. Elapsed: 0.078 sec.

查询 Parquest 输入格式的Hive 表

在 Hive 中建表

hive >

CREATE TABLE `test`.`test_parquet`(

  `f_tinyint` tinyint,

  `f_smallint` smallint,

  `f_int` int,

  `f_integer` int,

  `f_bigint` bigint,

  `f_float` float,

  `f_double` double,

  `f_decimal` decimal(10,0),

  `f_timestamp` timestamp,

  `f_date` date,

  `f_string` string,

  `f_varchar` varchar(100),

  `f_char` char(100),

  `f_bool` boolean,

  `f_binary` binary,

  `f_array_int` array<int>,

  `f_array_string` array<string>,

  `f_array_float` array<float>,

  `f_array_array_int` array<array<int>>,

  `f_array_array_string` array<array<string>>,

  `f_array_array_float` array<array<float>>)

PARTITIONED BY (

  `day` string)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

STORED AS INPUTFORMAT

  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

LOCATION

  'hdfs://testcluster/data/hive/test.db/test_parquet'

OK

Time taken: 0.51 seconds

hive >  insert into test.test_parquet partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

OK

Time taken: 36.025 seconds

hive > select * from test.test_parquet;

OK

1   2   3   4   5   6.11    7.22    8   2021-12-14 17:54:56.743 2021-12-14  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18

Time taken: 0.766 seconds, Fetched: 1 row(s)

在 ClickHouse 中建表

ClickHouse 中的表，从上面创建的Hive表中获取数据:

CREATE TABLE test.test_parquet

(

    `f_tinyint` Int8,

    `f_smallint` Int16,

    `f_int` Int32,

    `f_integer` Int32,

    `f_bigint` Int64,

    `f_float` Float32,

    `f_double` Float64,

    `f_decimal` Float64,

    `f_timestamp` DateTime,

    `f_date` Date,

    `f_string` String,

    `f_varchar` String,

    `f_char` String,

    `f_bool` Bool,

    `f_binary` String,

    `f_array_int` Array(Int32),

    `f_array_string` Array(String),

    `f_array_float` Array(Float32),

    `f_array_array_int` Array(Array(Int32)),

    `f_array_array_string` Array(Array(String)),

    `f_array_array_float` Array(Array(Float32)),

    `day` String

)

ENGINE = Hive('thrift://localhost:9083', 'test', 'test_parquet')

PARTITION BY day

SELECT * FROM test.test_parquet settings input_format_parquet_allow_missing_columns = 1\G

SELECT *

FROM test_parquet

SETTINGS input_format_parquet_allow_missing_columns = 1

Query id: 4e35cf02-c7b2-430d-9b81-16f438e5fca9

Row 1:

──────

f_tinyint:            1

f_smallint:           2

f_int:                3

f_integer:            4

f_bigint:             5

f_float:              6.11

f_double:             7.22

f_decimal:            8

f_timestamp:          2021-12-14 17:54:56

f_date:               2021-12-14

f_string:             hello world

f_varchar:            hello world

f_char:               hello world

f_bool:               true

f_binary:             hello world

f_array_int:          [1,2,3]

f_array_string:       ['hello world','hello world']

f_array_float:        [1.1,1.2]

f_array_array_int:    [[1,2],[3,4]]

f_array_array_string: [['a','b'],['c','d']]

f_array_array_float:  [[1.11,2.22],[3.33,4.44]]

day:                  2021-09-18

1 rows in set. Elapsed: 0.357 sec.

查询文本输入格式的Hive表

在Hive 中建表

hive >

CREATE TABLE `test`.`test_text`(

  `f_tinyint` tinyint,

  `f_smallint` smallint,

  `f_int` int,

  `f_integer` int,

  `f_bigint` bigint,

  `f_float` float,

  `f_double` double,

  `f_decimal` decimal(10,0),

  `f_timestamp` timestamp,

  `f_date` date,

  `f_string` string,

  `f_varchar` varchar(100),

  `f_char` char(100),

  `f_bool` boolean,

  `f_binary` binary,

  `f_array_int` array<int>,

  `f_array_string` array<string>,

  `f_array_float` array<float>,

  `f_array_array_int` array<array<int>>,

  `f_array_array_string` array<array<string>>,

  `f_array_array_float` array<array<float>>)

PARTITIONED BY (

  `day` string)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT

  'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://testcluster/data/hive/test.db/test_text'

Time taken: 0.1 seconds, Fetched: 34 row(s)

hive >  insert into test.test_text partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

OK

Time taken: 36.025 seconds

hive > select * from test.test_text;

OK

1   2   3   4   5   6.11    7.22    8   2021-12-14 18:11:17.239 2021-12-14  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18

Time taken: 0.624 seconds, Fetched: 1 row(s)

在 ClickHouse 中建表

ClickHouse中的表，从上面创建的Hive表中获取数据:

CREATE TABLE test.test_text

(

    `f_tinyint` Int8,

    `f_smallint` Int16,

    `f_int` Int32,

    `f_integer` Int32,

    `f_bigint` Int64,

    `f_float` Float32,

    `f_double` Float64,

    `f_decimal` Float64,

    `f_timestamp` DateTime,

    `f_date` Date,

    `f_string` String,

    `f_varchar` String,

    `f_char` String,

    `f_bool` Bool,

    `day` String

)

ENGINE = Hive('thrift://localhost:9083', 'test', 'test_text')

PARTITION BY day

SELECT * FROM test.test_text settings input_format_skip_unknown_fields = 1, input_format_with_names_use_header = 1, date_time_input_format = 'best_effort'\G

SELECT *

FROM test.test_text

SETTINGS input_format_skip_unknown_fields = 1, input_format_with_names_use_header = 1, date_time_input_format = 'best_effort'

Query id: 55b79d35-56de-45b9-8be6-57282fbf1f44

Row 1:

──────

f_tinyint:   1

f_smallint:  2

f_int:       3

f_integer:   4

f_bigint:    5

f_float:     6.11

f_double:    7.22

f_decimal:   8

f_timestamp: 2021-12-14 18:11:17

f_date:      2021-12-14

f_string:    hello world

f_varchar:   hello world

f_char:      hello world

f_bool:      true

day:         2021-09-18

资料分享

ClickHouse经典中文文档分享

参考文章

ClickHouse(19)ClickHouse集成Hive表引擎详细解析的更多相关文章

ClickHouse(10)ClickHouse合并树MergeTree家族表引擎之ReplacingMergeTree详细解析
目录建表语法数据处理策略资料分享参考文章 MergeTree拥有主键,但是它的主键却没有唯一键的约束.这意味着即便多行数据的主键相同,它们还是能够被正常写入.在某些使用场合,用户并不希望数据表 ...
ClickHouse(11)ClickHouse合并树MergeTree家族表引擎之SummingMergeTree详细解析
目录建表语法数据处理汇总的通用规则 AggregateFunction 列中的汇总嵌套结构数据的处理资料分享参考文章 SummingMergeTree引擎继承自MergeTree.区别在于 ...
ClickHouse(12)ClickHouse合并树MergeTree家族表引擎之AggregatingMergeTree详细解析
目录建表语法查询和插入数据数据处理逻辑 ClickHouse相关资料分享 AggregatingMergeTree引擎继承自 MergeTree,并改变了数据片段的合并逻辑.ClickHouse ...
ClickHouse(13)ClickHouse合并树MergeTree家族表引擎之CollapsingMergeTree详细解析
目录建表折叠数据算法资料分享参考文章该引擎继承于MergeTree,并在数据块合并算法中添加了折叠行的逻辑.CollapsingMergeTree会异步的删除(折叠)这些除了特定列Sig ...
ClickHouse入门：表引擎-HDFS
前言插件及服务器版本服务器:ubuntu 16.04Hadoop:2.6ClickHouse:20.9.3.45 文章目录简介引擎配置 HDFS表引擎的两种使用形式引用简介 ClickHous ...
Clickhouse表引擎之MergeTree
1.概述在Clickhouse中有多种表引擎,不同的表引擎拥有不同的功能,它直接决定了数据如何读写.是否能够并发读写.是否支持索引.数据是否可备份等等.本篇博客笔者将为大家介绍Clickhouse中 ...
UniqueMergeTree：支持实时更新删除的 ClickHouse 表引擎
UniqueMergeTree 开发的业务背景首先,我们看一下哪些场景需要用到实时更新. 我们总结了三类场景: 第一类是业务需要对它的交易类数据进行实时分析,需要把数据流同步到 ClickHouse ...
Clickhouse表引擎探究-ReplacingMergeTree
作者:耿宏宇 1 表引擎简述 1.1 官方描述 MergeTree 系列的引擎被设计用于插入极大量的数据到一张表当中.数据可以以数据片段的形式一个接着一个的快速写入,数据片段在后台按照一定的规则进行合 ...
ClickHouse(07)ClickHouse数据库引擎解析
目录 Atomic 建表语句特性 Table UUID RENAME TABLES DROP/DETACH TABLES EXCHANGE TABLES ReplicatedMergeTree in ...
clickhouse 19.14.m.n简单测试
ClickHouse is a column-oriented database management system (DBMS) for online analytical processing o ...

随机推荐

LeetCode952三部曲之一：解题思路和初级解法(137ms，超39%)
欢迎访问我的GitHub 这里分类和汇总了欣宸的全部原创(含配套源码):https://github.com/zq2599/blog_demos 题目描述难度:困难编程语言:Java 给定一个由不 ...
MySQL高级9-锁
一.简介锁是计算机协调多个进程或线程并发访问某一资源的机制.在数据库中,除了传统的计算资源(CPU.RAM.i/O)的挣用以外,数据也是一种供许多用户共享的资源.如何保证数据并发访问的一致性,有效性 ...
WPF 在MVVM模式下应用动画
一个简单的需求:当程序发生异常时候,在界面上动画显示异常信息. 这个需求看似简单,只需要try--catch到异常,然后把异常的信息写入界面就OK了. 但在MVVM时,就不是这么简单了.MVVM模式下 ...
入门篇-其之二-Java基础知识
目录对第一个Java程序的思考外层结构--类内层结构--main方法输出语句注释单行注释多行注释文档注释文档注释常用标签使用javadoc命令生成网页风格的文档阿里巴巴Java开 ...
文心一言 VS 讯飞星火 VS chatgpt （91）-- 算法导论8.3 4题
四.用go语言,说明如何在 O(n) 时间内,对 0 到 $n^3-1$ 区间内的n个整数进行排序. 文心一言: 在 O(n) 时间内对 0 到 n^3-1 区间内的 n 个整数进行排序,可以使用基数 ...
【c#版本Openfeign】Net8 自带OpenFeign实现远程接口调用
引言相信巨硬,我们便一直硬.Net版本到现在已经出了7了,8也已经在预览版了,相信在一个半月就会正式发布,其中也有很多拭目以待的新功能了,不仅仅有Apm和Tap的结合,TaskToAscynResu ...
千呼万唤始出来 JDK 21 LTS, 久等了
平地起惊雷!!! 目录英雄的迟暮大人时代变了 JDK 21 LTS 前 JAVA并发编程模型 JDK 21 LTS 中的 JAVA 并发编程模型虚拟线程 VS 线程池 The Last 你可以称 ...
Aho-Corasick 算法 AC自动机实现
敏感词过滤在社区发帖.网站检索.短信发送等场景下是很常见的需求,尤其是在高并发场景下如何实现敏感词过滤,都对过滤算法提出了更高的性能要求,Ahocorasick算法能够实现毫秒级的万字过滤匹配,能够很 ...
2023-09-30：用go语言，给你一个整数数组 nums 和一个整数 k 。 nums 仅包含 0 和 1，每一次移动，你可以选择相邻两个数字并将它们交换。请你返回使 nums 中包含 k
2023-09-30:用go语言,给你一个整数数组 nums 和一个整数 k . nums 仅包含 0 和 1, 每一次移动,你可以选择相邻两个数字并将它们交换. 请你返回使 nums 中包含 k ...
[GKCTF 2020]cve版签到
通过题目的提示可知,这是一个CVE(cve-2020-7066)的复现点击进之后也无回显看了这个cve之后,知道这个cve就是这个get_headers()会截断URL中空字符后的内容就根据cv ...

ClickHouse(19)ClickHouse集成Hive表引擎详细解析

Hive集成表引擎

创建表

使用示例

如何使用HDFS文件系统的本地缓存

查询 ORC 输入格式的Hive 表

在 Hive 中建表

在 ClickHouse 中建表

查询 Parquest 输入格式的Hive 表

在 Hive 中建表

在 ClickHouse 中建表

查询文本输入格式的Hive表

在Hive 中建表

在 ClickHouse 中建表

资料分享

参考文章

ClickHouse(19)ClickHouse集成Hive表引擎详细解析的更多相关文章

随机推荐

热门专题