Hive 表数据的存储和压缩格式

SerDe

* 按行存储

* 按列存储

file_format:

  :

  | SEQUENCEFILE	序列化(行存储)

  | TEXTFILE  		文本格式(行存储)- (Default, depending on hive.default.fileformat configuration)

  | RCFILE    		列式存储文件(列存储)- (Note: Available in Hive 0.6.0 and later)

  | ORC 	  常用 	优化列式存储文件(列存储)- (Note: Available in Hive 0.11.0 and later)

  | PARQUET   常用	(列存储)- (Note: Available in Hive 0.13.0 and later)

  | AVRO     		- (Note: Available in Hive 0.14.0 and later)

  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

ORC列存储结构如图所示，个人理解：每一个字段存为一列，index data 保存了最大值和最小值，只有当索引范围合适时，才去扫描对应的列数据查找匹配项

As shown in the diagram, each stripe in an ORC file holds index data, row data, and a stripe footer.

The stripe footer contains a directory of stream locations. Row data is used in table scans.

Index data includes min and max values for each column and the row positions within each column.索引数据包含了对应的列(column)数据的最大值和最小值，以及列中数据的相应行的位置。 Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

来自 hortonworks 的性能评估

二、各种数据存储格式比较

For example, creating an ORC stored table without compression:

官网例子

create table Addresses (

  name string,

  street string,

  city string,

  state string,

  zip int

) stored as orc tblproperties ("orc.compress"="NONE");

个人测试用例：TEXTFILE

create table bf_log(

user_ip string,

a string,

time string,

get2host string,

status1 string,

status2 string,

b string,

from_url string,

equipment_info string,

c string,

url_now string

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' stored as TEXTFILE;

导入数据

load data local inpath '/opt/datas/xxx.log' into table bf_log;

数据大小

2.0 M  /user/hive/warehouse/bf_log_orc/moodle.ibeifeng.access.log

使用 ORC 存储

create table bf_log_orc(

user_ip string,

a string,

time string,

get2host string,

status1 string,

status2 string,

b string,

from_url string,

equipment_info string,

c string,

url_now string

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' stored as orc;

导入数据

insert into table bf_log_orc select * from bf_log;

数据大小

62.5 K  /user/hive/warehouse/bf_log_orc/000000_0

使用 PARQUET

create table bf_log_parquet(

user_ip string,

a string,

time string,

get2host string,

status1 string,

status2 string,

b string,

from_url string,

equipment_info string,

c string,

url_now string

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' stored as PARQUET;

导入数据

insert into table bf_log_parquet select * from bf_log;

数据大小

360.1 K  /user/hive/warehouse/bf_log_parquet/000000_0

三、进行数据文件的压缩(主要只支持三种，详情见底部图片)

1.查看支持的数据压缩格式

cen@hostname-ubuntu:/opt/softwares/hadoop-2.8.0$ bin/hadoop checknative

17/07/17 05:34:32 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native

17/07/17 05:34:32 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

Native library checking:

hadoop:  true /opt/softwares/hadoop-2.8.0/lib/native/libhadoop.so.1.0.0

zlib:    true /lib/x86_64-linux-gnu/libz.so.1

snappy:  true /usr/lib/x86_64-linux-gnu/libsnappy.so.1

lz4:     true revision:10301

bzip2:   true /lib/x86_64-linux-gnu/libbz2.so.1

openssl: true /usr/lib/x86_64-linux-gnu/libcrypto.so

2.使用 snappy 压缩

create table bf_log_orc_snappy(

user_ip string,

a string,

time string,

get2host string,

status1 string,

status2 string,

b string,

from_url string,

equipment_info string,

c string,

url_now string

) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' stored as PARQUET tblproperties ("orc.compress"="snappy");

导入数据

insert into table bf_log_orc_snappy select * from bf_log;

数据大小

360.1 K  /user/hive/warehouse/bf_log_orc_snappy/000000_0

#！！！什么？居然搞大了，为什么？

#因为原本默认使用的是 ZLIB (接近bzip2)压缩

图片来源

四、总结

最常用的解决方案：

数据格式：orc | parquet
压缩格式：sanppy