Hive集成表引擎

Hive引擎允许对HDFS Hive表执行 SELECT 查询。目前它支持如下输入格式:

-文本:只支持简单的标量列类型，除了 Binary

ORC:支持简单的标量列类型，除了char; 只支持 array 这样的复杂类型
Parquet:支持所有简单标量列类型;只支持 array 这样的复杂类型

创建表

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]

(

    name1 [type1] [ALIAS expr1],

    name2 [type2] [ALIAS expr2],

    ...

) ENGINE = Hive('thrift://host:port', 'database', 'table');

PARTITION BY expr

表的结构可以与原来的Hive表结构有所不同:

列名应该与原来的Hive表相同，但你可以使用这些列中的一些，并以任何顺序，你也可以使用一些从其他列计算的别名列。
列类型与原Hive表的列类型保持一致。
“Partition by expression”应与原Hive表保持一致，“Partition by expression”中的列应在表结构中。

引擎参数

thrift://host:port — Hive Metastore 地址
database — 远程数据库名.
table — 远程数据表名.

使用示例

如何使用HDFS文件系统的本地缓存

我们强烈建议您为远程文件系统启用本地缓存。基准测试显示，如果使用缓存，它的速度会快两倍。

在使用缓存之前，请将其添加到 config.xml

<local_cache_for_remote_fs>

    <enable>true</enable>

    <root_dir>local_cache</root_dir>

    <limit_size>559096952</limit_size>

    <bytes_read_before_flush>1048576</bytes_read_before_flush>

</local_cache_for_remote_fs>

enable: 开启后，ClickHouse将为HDFS (远程文件系统)维护本地缓存。
root_dir: 必需的。用于存储远程文件系统的本地缓存文件的根目录。
limit_size: 必需的。本地缓存文件的最大大小(单位为字节)。
bytes_read_before_flush: 从远程文件系统下载文件时，刷新到本地文件系统前的控制字节数。缺省值为1MB。

当ClickHouse为远程文件系统启用了本地缓存时，用户仍然可以选择不使用缓存，并在查询中设置 use_local_cache_for_remote_storage = 0, use_local_cache_for_remote_storage 默认为 1。

查询 ORC 输入格式的Hive 表

在 Hive 中建表

hive > CREATE TABLE `test`.`test_orc`(

  `f_tinyint` tinyint,

  `f_smallint` smallint,

  `f_int` int,

  `f_integer` int,

  `f_bigint` bigint,

  `f_float` float,

  `f_double` double,

  `f_decimal` decimal(10,0),

  `f_timestamp` timestamp,

  `f_date` date,

  `f_string` string,

  `f_varchar` varchar(100),

  `f_bool` boolean,

  `f_binary` binary,

  `f_array_int` array<int>,

  `f_array_string` array<string>,

  `f_array_float` array<float>,

  `f_array_array_int` array<array<int>>,

  `f_array_array_string` array<array<string>>,

  `f_array_array_float` array<array<float>>)

PARTITIONED BY (

  `day` string)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

STORED AS INPUTFORMAT

  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

LOCATION

  'hdfs://testcluster/data/hive/test.db/test_orc'

OK

Time taken: 0.51 seconds

hive > insert into test.test_orc partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

OK

Time taken: 36.025 seconds

hive > select * from test.test_orc;

OK

1   2   3   4   5   6.11    7.22    8   2021-11-05 12:38:16.314 2021-11-05  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18

Time taken: 0.295 seconds, Fetched: 1 row(s)

在 ClickHouse 中建表

ClickHouse中的表，从上面创建的Hive表中获取数据:

CREATE TABLE test.test_orc

(

    `f_tinyint` Int8,

    `f_smallint` Int16,

    `f_int` Int32,

    `f_integer` Int32,

    `f_bigint` Int64,

    `f_float` Float32,

    `f_double` Float64,

    `f_decimal` Float64,

    `f_timestamp` DateTime,

    `f_date` Date,

    `f_string` String,

    `f_varchar` String,

    `f_bool` Bool,

    `f_binary` String,

    `f_array_int` Array(Int32),

    `f_array_string` Array(String),

    `f_array_float` Array(Float32),

    `f_array_array_int` Array(Array(Int32)),

    `f_array_array_string` Array(Array(String)),

    `f_array_array_float` Array(Array(Float32)),

    `day` String

)

ENGINE = Hive('thrift://localhost:9083', 'test', 'test_orc')

PARTITION BY day

SELECT * FROM test.test_orc settings input_format_orc_allow_missing_columns = 1\G

SELECT *

FROM test.test_orc

SETTINGS input_format_orc_allow_missing_columns = 1

Query id: c3eaffdc-78ab-43cd-96a4-4acc5b480658

Row 1:

──────

f_tinyint:            1

f_smallint:           2

f_int:                3

f_integer:            4

f_bigint:             5

f_float:              6.11

f_double:             7.22

f_decimal:            8

f_timestamp:          2021-12-04 04:00:44

f_date:               2021-12-03

f_string:             hello world

f_varchar:            hello world

f_bool:               true

f_binary:             hello world

f_array_int:          [1,2,3]

f_array_string:       ['hello world','hello world']

f_array_float:        [1.1,1.2]

f_array_array_int:    [[1,2],[3,4]]

f_array_array_string: [['a','b'],['c','d']]

f_array_array_float:  [[1.11,2.22],[3.33,4.44]]

day:                  2021-09-18

1 rows in set. Elapsed: 0.078 sec.

查询 Parquest 输入格式的Hive 表

在 Hive 中建表

hive >

CREATE TABLE `test`.`test_parquet`(

  `f_tinyint` tinyint,

  `f_smallint` smallint,

  `f_int` int,

  `f_integer` int,

  `f_bigint` bigint,

  `f_float` float,

  `f_double` double,

  `f_decimal` decimal(10,0),

  `f_timestamp` timestamp,

  `f_date` date,

  `f_string` string,

  `f_varchar` varchar(100),

  `f_char` char(100),

  `f_bool` boolean,

  `f_binary` binary,

  `f_array_int` array<int>,

  `f_array_string` array<string>,

  `f_array_float` array<float>,

  `f_array_array_int` array<array<int>>,

  `f_array_array_string` array<array<string>>,

  `f_array_array_float` array<array<float>>)

PARTITIONED BY (

  `day` string)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

STORED AS INPUTFORMAT

  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

LOCATION

  'hdfs://testcluster/data/hive/test.db/test_parquet'

OK

Time taken: 0.51 seconds

hive >  insert into test.test_parquet partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

OK

Time taken: 36.025 seconds

hive > select * from test.test_parquet;

OK

1   2   3   4   5   6.11    7.22    8   2021-12-14 17:54:56.743 2021-12-14  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18

Time taken: 0.766 seconds, Fetched: 1 row(s)

在 ClickHouse 中建表

ClickHouse 中的表，从上面创建的Hive表中获取数据:

CREATE TABLE test.test_parquet

(

    `f_tinyint` Int8,

    `f_smallint` Int16,

    `f_int` Int32,

    `f_integer` Int32,

    `f_bigint` Int64,

    `f_float` Float32,

    `f_double` Float64,

    `f_decimal` Float64,

    `f_timestamp` DateTime,

    `f_date` Date,

    `f_string` String,

    `f_varchar` String,

    `f_char` String,

    `f_bool` Bool,

    `f_binary` String,

    `f_array_int` Array(Int32),

    `f_array_string` Array(String),

    `f_array_float` Array(Float32),

    `f_array_array_int` Array(Array(Int32)),

    `f_array_array_string` Array(Array(String)),

    `f_array_array_float` Array(Array(Float32)),

    `day` String

)

ENGINE = Hive('thrift://localhost:9083', 'test', 'test_parquet')

PARTITION BY day

SELECT * FROM test.test_parquet settings input_format_parquet_allow_missing_columns = 1\G

SELECT *

FROM test_parquet

SETTINGS input_format_parquet_allow_missing_columns = 1

Query id: 4e35cf02-c7b2-430d-9b81-16f438e5fca9

Row 1:

──────

f_tinyint:            1

f_smallint:           2

f_int:                3

f_integer:            4

f_bigint:             5

f_float:              6.11

f_double:             7.22

f_decimal:            8

f_timestamp:          2021-12-14 17:54:56

f_date:               2021-12-14

f_string:             hello world

f_varchar:            hello world

f_char:               hello world

f_bool:               true

f_binary:             hello world

f_array_int:          [1,2,3]

f_array_string:       ['hello world','hello world']

f_array_float:        [1.1,1.2]

f_array_array_int:    [[1,2],[3,4]]

f_array_array_string: [['a','b'],['c','d']]

f_array_array_float:  [[1.11,2.22],[3.33,4.44]]

day:                  2021-09-18

1 rows in set. Elapsed: 0.357 sec.

查询文本输入格式的Hive表

在Hive 中建表

hive >

CREATE TABLE `test`.`test_text`(

  `f_tinyint` tinyint,

  `f_smallint` smallint,

  `f_int` int,

  `f_integer` int,

  `f_bigint` bigint,

  `f_float` float,

  `f_double` double,

  `f_decimal` decimal(10,0),

  `f_timestamp` timestamp,

  `f_date` date,

  `f_string` string,

  `f_varchar` varchar(100),

  `f_char` char(100),

  `f_bool` boolean,

  `f_binary` binary,

  `f_array_int` array<int>,

  `f_array_string` array<string>,

  `f_array_float` array<float>,

  `f_array_array_int` array<array<int>>,

  `f_array_array_string` array<array<string>>,

  `f_array_array_float` array<array<float>>)

PARTITIONED BY (

  `day` string)

ROW FORMAT SERDE

  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT

  'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://testcluster/data/hive/test.db/test_text'

Time taken: 0.1 seconds, Fetched: 34 row(s)

hive >  insert into test.test_text partition(day='2021-09-18') select 1, 2, 3, 4, 5, 6.11, 7.22, 8.333, current_timestamp(), current_date(), 'hello world', 'hello world', 'hello world', true, 'hello world', array(1, 2, 3), array('hello world', 'hello world'), array(float(1.1), float(1.2)), array(array(1, 2), array(3, 4)), array(array('a', 'b'), array('c', 'd')), array(array(float(1.11), float(2.22)), array(float(3.33), float(4.44)));

OK

Time taken: 36.025 seconds

hive > select * from test.test_text;

OK

1   2   3   4   5   6.11    7.22    8   2021-12-14 18:11:17.239 2021-12-14  hello world hello world hello world                                                                                             true    hello world [1,2,3] ["hello world","hello world"]   [1.1,1.2]   [[1,2],[3,4]]   [["a","b"],["c","d"]]   [[1.11,2.22],[3.33,4.44]]   2021-09-18

Time taken: 0.624 seconds, Fetched: 1 row(s)

在 ClickHouse 中建表

ClickHouse中的表，从上面创建的Hive表中获取数据:

CREATE TABLE test.test_text

(

    `f_tinyint` Int8,

    `f_smallint` Int16,

    `f_int` Int32,

    `f_integer` Int32,

    `f_bigint` Int64,

    `f_float` Float32,

    `f_double` Float64,

    `f_decimal` Float64,

    `f_timestamp` DateTime,

    `f_date` Date,

    `f_string` String,

    `f_varchar` String,

    `f_char` String,

    `f_bool` Bool,

    `day` String

)

ENGINE = Hive('thrift://localhost:9083', 'test', 'test_text')

PARTITION BY day

SELECT * FROM test.test_text settings input_format_skip_unknown_fields = 1, input_format_with_names_use_header = 1, date_time_input_format = 'best_effort'\G

SELECT *

FROM test.test_text

SETTINGS input_format_skip_unknown_fields = 1, input_format_with_names_use_header = 1, date_time_input_format = 'best_effort'

Query id: 55b79d35-56de-45b9-8be6-57282fbf1f44

Row 1:

──────

f_tinyint:   1

f_smallint:  2

f_int:       3

f_integer:   4

f_bigint:    5

f_float:     6.11

f_double:    7.22

f_decimal:   8

f_timestamp: 2021-12-14 18:11:17

f_date:      2021-12-14

f_string:    hello world

f_varchar:   hello world

f_char:      hello world

f_bool:      true

day:         2021-09-18

资料分享

ClickHouse经典中文文档分享

参考文章

ClickHouse(19)ClickHouse集成Hive表引擎详细解析的更多相关文章

ClickHouse(10)ClickHouse合并树MergeTree家族表引擎之ReplacingMergeTree详细解析
目录建表语法数据处理策略资料分享参考文章 MergeTree拥有主键,但是它的主键却没有唯一键的约束.这意味着即便多行数据的主键相同,它们还是能够被正常写入.在某些使用场合,用户并不希望数据表 ...
ClickHouse(11)ClickHouse合并树MergeTree家族表引擎之SummingMergeTree详细解析
目录建表语法数据处理汇总的通用规则 AggregateFunction 列中的汇总嵌套结构数据的处理资料分享参考文章 SummingMergeTree引擎继承自MergeTree.区别在于 ...
ClickHouse(12)ClickHouse合并树MergeTree家族表引擎之AggregatingMergeTree详细解析
目录建表语法查询和插入数据数据处理逻辑 ClickHouse相关资料分享 AggregatingMergeTree引擎继承自 MergeTree,并改变了数据片段的合并逻辑.ClickHouse ...
ClickHouse(13)ClickHouse合并树MergeTree家族表引擎之CollapsingMergeTree详细解析
目录建表折叠数据算法资料分享参考文章该引擎继承于MergeTree,并在数据块合并算法中添加了折叠行的逻辑.CollapsingMergeTree会异步的删除(折叠)这些除了特定列Sig ...
ClickHouse入门：表引擎-HDFS
前言插件及服务器版本服务器:ubuntu 16.04Hadoop:2.6ClickHouse:20.9.3.45 文章目录简介引擎配置 HDFS表引擎的两种使用形式引用简介 ClickHous ...
Clickhouse表引擎之MergeTree
1.概述在Clickhouse中有多种表引擎,不同的表引擎拥有不同的功能,它直接决定了数据如何读写.是否能够并发读写.是否支持索引.数据是否可备份等等.本篇博客笔者将为大家介绍Clickhouse中 ...
UniqueMergeTree：支持实时更新删除的 ClickHouse 表引擎
UniqueMergeTree 开发的业务背景首先,我们看一下哪些场景需要用到实时更新. 我们总结了三类场景: 第一类是业务需要对它的交易类数据进行实时分析,需要把数据流同步到 ClickHouse ...
Clickhouse表引擎探究-ReplacingMergeTree
作者:耿宏宇 1 表引擎简述 1.1 官方描述 MergeTree 系列的引擎被设计用于插入极大量的数据到一张表当中.数据可以以数据片段的形式一个接着一个的快速写入,数据片段在后台按照一定的规则进行合 ...
ClickHouse(07)ClickHouse数据库引擎解析
目录 Atomic 建表语句特性 Table UUID RENAME TABLES DROP/DETACH TABLES EXCHANGE TABLES ReplicatedMergeTree in ...
clickhouse 19.14.m.n简单测试
ClickHouse is a column-oriented database management system (DBMS) for online analytical processing o ...

随机推荐

java循环自动生成简单图片
import java.awt.*; import java.awt.font.FontRenderContext; import java.awt.geom.Rectangle2D; import ...
如何创建可引导的 macOS Sonoma 安装介质
2023 年 9 月 26 日(北京时间 27 日凌晨)macOS Sonoma 正式版现已发布. 如何创建可引导的 macOS Sonoma 安装介质如何创建可引导的 macOS 安装器 | 如何 ...
linux常用命令(八) tar 打包、压缩、解包、解压缩
Linux 常用的压缩与解压缩命令有:tar.gzip.gunzip.bzip2.bunzip2.compress .uncompress. zip. unzip.rar.unrar 等. 首先要弄清 ...
Android历史版本
目录 [隐藏] 1 测试版 2 版本列表 2.1 Android 1.0 2.2 Android 1.1 2.3 Android 1.5 Cupcake 2.4 Android 1.6 Donut ...
解密Prompt系列17. LLM对齐方案再升级 WizardLM & BackTranslation & SELF-ALIGN
话接上文的指令微调的样本优化方案,上一章是通过多样性筛选和质量过滤,对样本量进行缩减,主打经济实惠.这一章是通过扩写,改写,以及回译等半监督样本挖掘方案对种子样本进行扩充,提高种子指令样本的多样性和复 ...
使用 Kubernetes 简化平台工程
平台工程在现代应用程序开发和部署中发挥的作用至关重要.随着软件应用程序变得越来越复杂和分散,对稳健且可扩展的基础设施的需求变得越来越重要.这就是平台工程的作用所在,它是支持整个软件开发生命周期的支柱. ...
Fox and Minimal path 题解
Fox and Minimal path 题目大意构造一张无向图,使得从 \(1\) 到 \(2\) 的最短路数量为 \(k\). 思路分析我们首先可以发现当 \(k = 2^t\) 时的构造方式 ...
《最新出炉》系列初窥篇-Python+Playwright自动化测试-22-处理select下拉框-上篇
1.简介在实际自动化测试过程中,我们也避免不了会遇到下拉框选择的测试,因此宏哥在这里直接分享和介绍一下,希望小伙伴或者童鞋们在以后工作中遇到可以有所帮助.今天,我们讲下playwright的下拉框怎 ...
JS深入之内存详解，数据结构，存储方式
理解了本文,就知道深拷贝和浅拷贝的底层,了解赋值的底层原理. 可以结合另一篇文章一起食用:深拷贝与浅拷贝的区别,实现深拷贝的方法介绍. 以下是正文: 栈数据结构栈的结构就是后进先出(LIFO),如果 ...
如何在Notepad++中轻松删除包含指定文本的字符串
如果你需要在大量文本中删除指定模式的字符串,可以使用Notepad++中的正则表达式功能.下面是一个示例,让你可以快速学会如何删除包含指定文本的字符串.我们将使用以下示例字符串: This is a ...

ClickHouse(19)ClickHouse集成Hive表引擎详细解析

Hive集成表引擎

创建表

使用示例

如何使用HDFS文件系统的本地缓存

查询 ORC 输入格式的Hive 表

在 Hive 中建表

在 ClickHouse 中建表

查询 Parquest 输入格式的Hive 表

在 Hive 中建表

在 ClickHouse 中建表

查询文本输入格式的Hive表

在Hive 中建表

在 ClickHouse 中建表

资料分享

参考文章

ClickHouse(19)ClickHouse集成Hive表引擎详细解析的更多相关文章

随机推荐

热门专题