1. 生成测试数据
在TPC-H的官网http://www.tpc.org/tpch/上下载dbgen工具,生成数据http://www.tpc.org/tpch/spec/tpch_2_17_0.zip

[root@ip---- tpch]# wget http://www.tpc.org/tpch/spec/tpch_2_17_0.zip

解压,到dbgen目录下,复制makefile.suite到makefile并作如下修改

[root@ip---- tpch]# yum install unzip
[root@ip---- tpch]# unzip tpch_2_17_0.zip 
[root@ip---- tpch]# ls
__MACOSX tpch_2_17_0 tpch_2_17_0.zip
[root@ip---- tpch]# cd tpch_2_17_0
[root@ip---- tpch_2_17_0]# ls
dbgen dev-tools ref_data
[root@ip---- tpch_2_17_0]# cd dbgen/
[root@ip---- dbgen]# ls
BUGS README bcd2.h check_answers dbgen.dsp dss.ddl dsstypes.h permute.c qgen.c reference rnd.h shared.h text.c tpch.sln variants
HISTORY answers bm_utils.c column_split.sh dists.dss dss.h load_stub.c permute.h qgen.vcproj release.h rng64.c speed_seed.c tpcd.h tpch.vcproj varsub.c
PORTING.NOTES bcd2.c build.c config.h driver.c dss.ri makefile.suite print.c queries rnd.c rng64.h tests tpch.dsw update_release.sh
[root@ip---- dbgen]# cp makefile.suite makefile

[root@ip-172-31-10-151 dbgen]# vi makefile

################
## CHANGE NAME OF ANSI COMPILER HERE
################
CC = gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
# SQLSERVER, SYBASE, ORACLE, VECTORWISE
# Current values for MACHINE are: ATT, DOS, HP, IBM, ICL, MVS,
# SGI, SUN, U2200, VMS, LINUX, WIN32
# Current values for WORKLOAD are: TPCH
DATABASE= ORACLE
MACHINE = LINUX
WORKLOAD = TPCH

编译代码:

make

编译完成之后会在当前目录下生成dbgen

运行./dbgen -help查看如何使用

jfp4-:/mnt/disk1/tpch_2_17_0/dbgen # ./dbgen -help
TPC-H Population Generator (Version 2.17. build )
Copyright Transaction Processing Performance Council -
USAGE:
dbgen [-{vf}][-T {pcsoPSOL}]
[-s <scale>][-C <procs>][-S <step>]
dbgen [-v] [-O m] [-s <scale>] [-U <updates>] Basic Options
===========================
-C <n> -- separate data set into <n> chunks (requires -S, default: )
-f -- force. Overwrite existing files
-h -- display this message
-q -- enable QUIET mode
-s <n> -- set Scale Factor (SF) to <n> (default: )
-S <n> -- build the <n>th step of the data/update set (used with -C or -U)
-U <n> -- generate <n> update sets
-v -- enable VERBOSE mode Advanced Options
===========================
-b <s> -- load distributions for <s> (default: dists.dss)
-d <n> -- split deletes between <n> files (requires -U)
-i <n> -- split inserts between <n> files (requires -U)
-T c -- generate cutomers ONLY
-T l -- generate nation/region ONLY
-T L -- generate lineitem ONLY
-T n -- generate nation ONLY
-T o -- generate orders/lineitem ONLY
-T O -- generate orders ONLY
-T p -- generate parts/partsupp ONLY
-T P -- generate parts ONLY
-T r -- generate region ONLY
-T s -- generate suppliers ONLY
-T S -- generate partsupp ONLY To generate the SF= (1GB), validation database population, use:
dbgen -vf -s To generate updates for a SF= (1GB), use:
dbgen -v -U -s

运行./dbgen -s 1024生成1TB数据

jfp4-:/mnt/disk1/tpch_2_17_0/dbgen # ll *.tbl
-rw-r--r-- root root Jul : customer.tbl
-rw-r--r-- root root Jul : lineitem.tbl
-rw-r--r-- root root Jul : nation.tbl
-rw-r--r-- root root Jul : orders.tbl
-rw-r--r-- root root Jul : part.tbl
-rw-r--r-- root root Jul : partsupp.tbl
-rw-r--r-- root root Jul : region.tbl
-rw-r--r-- root root Jul : supplier.tbl

将数据移动到一个单独的目录

mkdir ../data1024g

mv *.tbl ../data1024g

2.下载impala版本的TPCH-H脚本

建立原始表linetext,为text文件:大小776GB

jfp4-:/mnt/disk1/tpch_2_17_0/dbgen # hdfs dfs -du /shaochen/tpch
/shaochen/tpch/customer
/shaochen/tpch/lineitem
/shaochen/tpch/nation
/shaochen/tpch/orders
/shaochen/tpch/part
/shaochen/tpch/partsupp
/shaochen/tpch/region
/shaochen/tpch/supplier
Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'  LOCATION '/shaochen/tpch/lineitem';

从原始text表中统计记录条数:

[jfp4-:] > select count(*) from lineitem;
Query: select count(*) from lineitem
+------------+
| count(*) |
+------------+
| |
+------------+
Returned row(s) in .47s

在脚本运行过程中,观察到Cluster Disk IO速度平均接近1GB,原始数据为776GB,由于是IO密集型操作,估算应该在776GB/1GB/s=800s完成。符合预期

将lineitem表保存为parquet格式:

[jfp4-:] > insert overwrite lineitem_parquet select * from lineitem;
Query: insert overwrite lineitem_parquet select * from lineitem
Inserted rows in .52s

在脚本运行过程中,该SQL为由于涉及到parquet文件的转换和Snappy压缩,属于混合型(IO密集+CPU密集),观察到Cluster Disk IO中读速率均值约为210M,估算在776/0.2=3800秒左右完成。符合预期。
根据写速率为140兆,parquet文件大小约为3800*0.14=532GB,再除以复制因子3,为180GB。

jfp4-:/mnt/disk1/tpch_2_17_0/dbgen # hdfs dfs -du -h /user/hive/warehouse/tpch.db
200.9 G /user/hive/warehouse/tpch.db/lineitem_parquet
/user/hive/warehouse/tpch.db/q1_pricing_summary_report

真实的parquet文件大小为200G,符合预期。

再次统计记录条数:

[jfp4-:] > select count(*) from lineitem_parquet;
Query: select count(*) from lineitem_parquet
+------------+
| count(*) |
+------------+
| |
+------------+
Returned row(s) in .04s

在text文件格式上运行Q1:

[jfp4-:] > -- the query
> INSERT OVERWRITE TABLE q1_pricing_summary_report
> SELECT
> L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)*(+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT() as int)
> FROM
> lineitem
> WHERE
> L_SHIPDATE<='1998-09-02'
> GROUP BY L_RETURNFLAG, L_LINESTATUS
> ORDER BY L_RETURNFLAG, L_LINESTATUS
> LIMIT ;
Query: INSERT OVERWRITE TABLE q1_pricing_summary_report SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)*(+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT() as int) FROM lineitem WHERE L_SHIPDATE<='1998-09-02' GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS LIMIT
^C[jfp4-:] > INSERT OVERWRITE TABLE q1_pricing_summary_report
> SELECT
> L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)*(+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT() as int)
> FROM
> lineitem
> WHERE
> L_SHIPDATE<='1998-09-02'
> GROUP BY L_RETURNFLAG, L_LINESTATUS
> ORDER BY L_RETURNFLAG, L_LINESTATUS
> LIMIT ;
Query: insert OVERWRITE TABLE q1_pricing_summary_report SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)*(+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT() as int) FROM lineitem WHERE L_SHIPDATE<='1998-09-02' GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS LIMIT
Inserted rows in .57s

查询查询计划:

[jfp4-:] > explain INSERT OVERWRITE TABLE q1_pricing_summary_report
> SELECT
> L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)*(+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT() as int)
> FROM
> lineitem
> WHERE
> L_SHIPDATE<='1998-09-02'
> GROUP BY L_RETURNFLAG, L_LINESTATUS
> ORDER BY L_RETURNFLAG, L_LINESTATUS
> LIMIT ;
Query: explain INSERT OVERWRITE TABLE q1_pricing_summary_report SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)), SUM(L_EXTENDEDPRICE*(-L_DISCOUNT)*(+L_TAX)), AVG(L_QUANTITY), AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), cast(COUNT() as int) FROM lineitem WHERE L_SHIPDATE<='1998-09-02' GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG, L_LINESTATUS LIMIT
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Explain String |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=.13GB VCores= |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| tpch.lineitem |
| |
| WRITE TO HDFS [tpch.q1_pricing_summary_report, OVERWRITE=true] |
| | partitions= |
| | |
| :TOP-N [LIMIT=] |
| | order by: L_RETURNFLAG ASC, L_LINESTATUS ASC |
| | |
| :EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
| :TOP-N [LIMIT=] |
| | order by: L_RETURNFLAG ASC, L_LINESTATUS ASC |
| | |
| :AGGREGATE [MERGE FINALIZE] |
| | output: sum(sum(L_QUANTITY)), sum(sum(L_EXTENDEDPRICE)), sum(sum(L_EXTENDEDPRICE * (1.0 - L_DISCOUNT))), sum(sum(L_EXTENDEDPRICE * (1.0 - L_DISCOUNT) * (1.0 + L_TAX))), sum(count(L_QUANTITY)), sum(count(L_EXTENDEDPRICE)), sum(sum(L_DISCOUNT)), sum(count(L_DISCOUNT)), sum(count()) |
| | group by: L_RETURNFLAG, L_LINESTATUS |
| | |
| :EXCHANGE [PARTITION=HASH(L_RETURNFLAG,L_LINESTATUS)] |
| | |
| :AGGREGATE |
| | output: sum(L_QUANTITY), sum(L_EXTENDEDPRICE), sum(L_EXTENDEDPRICE * (1.0 - L_DISCOUNT)), sum(L_EXTENDEDPRICE * (1.0 - L_DISCOUNT) * (1.0 + L_TAX)), count(L_QUANTITY), count(L_EXTENDEDPRICE), sum(L_DISCOUNT), count(L_DISCOUNT), count() |
| | group by: L_RETURNFLAG, L_LINESTATUS |
| | |
| :SCAN HDFS [tpch.lineitem] |
| partitions=/ size=.30GB |
| predicates: L_SHIPDATE <= '1998-09-02' |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Returned row(s) in .15s

计算一下表的统计信息:

[jfp4-:] > compute stats lineitem;
Query: compute stats lineitem
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated partition(s) and column(s). |
+------------------------------------------+
Returned row(s) in .34s

根据执行结果,发现compute stats 原来是如此花费时间!观察执行过程中,前15分钟的DISK IO是非常高,达到900M/s左右,基本上是集群中所有的磁盘都在满负荷的读文件的。之后的IO也保持在130M/s左右。看来compute status是一个昂贵的操作

在parquet表上统计一下:

[jfp4-1:21000] > compute stats lineitem_parquet;
Query: compute stats lineitem_parquet
Query aborted.
[jfp4-1:21000] > SET
> NUM_SCANNER_THREADS=2
> ;
NUM_SCANNER_THREADS set to 2
[jfp4-1:21000] > compute stats lineitem_parquet;
Query: compute stats lineitem_parquet
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 1 partition(s) and 16 column(s). |
+------------------------------------------+
Returned 1 row(s) in 5176.29s
[jfp4-1:21000] >

注意需要设置NUM_SCANNER_THREAD,才能成功

查看snappy压缩对parquet表的压缩和查询效率的影响:

[jfp4-:] > set PARQUET_COMPRESSION_CODEC=snappy;
PARQUET_COMPRESSION_CODEC set to snappy
[jfp4-:] > create table lineitem_parquet_snappy (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT SOMMENT STRING) STORED AS PARQUET;
Query: create table lineitem_parquet_snappy (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SING) STORED AS PARQUET Returned row(s) in .30s

[jfp4-1:21000] > insert overwrite lineitem_parquet_snappy select * from lineitem;
Query: insert overwrite lineitem_parquet_snappy select * from lineitem
Inserted 6144008876 rows in 3836.99s

查看snappy表的大小:

jfp4-:~ # hdfs dfs -du -h /user/hive/warehouse/tpch.db
200.9 G /user/hive/warehouse/tpch.db/lineitem_parquet
200.9 G /user/hive/warehouse/tpch.db/lineitem_parquet_snappy
/user/hive/warehouse/tpch.db/q1_pricing_summary_report

发现lineitem_parquet_snappy和lineitem_parquet大小是一样的,可见默认情况下,impala的parquet表默认是用snappy压缩的

[jfp4-:] > insert overwrite lineitem_parquet_raw select * from lineitem;
Query: insert overwrite lineitem_parquet_raw select * from lineitem
Inserted rows in .22s

snappy + parquet在写数据上比不压缩的parquet还是要节省了一些时间的!

看看raw parquet的大小:

jfp4-:~ # hdfs dfs -du -h /user/hive/warehouse/tpch.db
200.9 G /user/hive/warehouse/tpch.db/lineitem_parquet
319.2 G /user/hive/warehouse/tpch.db/lineitem_parquet_raw
200.9 G /user/hive/warehouse/tpch.db/lineitem_parquet_snappy
/user/hive/warehouse/tpch.db/q1_pricing_summary_report

看看gzip+snappy的效果:

[jfp4-:] > set PARQUET_COMPRESSION_CODEC=gzip;
PARQUET_COMPRESSION_CODEC set to gzip
[jfp4-:] > create table lineitem_parquet_gzip (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) STORED AS PARQUET;
Query: create table lineitem_parquet_gzip (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING) STORED AS PARQUET Returned row(s) in .26s
[jfp4-:] > insert overwrite lineitem_parquet_gzip select * from lineitem;
Query: insert overwrite lineitem_parquet_gzip select * from lineitem
Inserted rows in .71s
jfp4-:~ # hdfs dfs -du -h /user/hive/warehouse/tpch.db
200.9 G /user/hive/warehouse/tpch.db/lineitem_parquet
155.1 G /user/hive/warehouse/tpch.db/lineitem_parquet_gzip
319.2 G /user/hive/warehouse/tpch.db/lineitem_parquet_raw
200.9 G /user/hive/warehouse/tpch.db/lineitem_parquet_snappy
/user/hive/warehouse/tpch.db/q1_pricing_summary_report
[jfp4-:] > select count(*) from lineitem_parquet_gzip;
Query: select count(*) from lineitem_parquet_gzip
+------------+
| count(*) |
+------------+
| |
+------------+
Returned row(s) in .54s

TPCH Benchmark with Impala的更多相关文章

  1. CIB Training Scripts For TPC-H Benchmark

    http://52.11.56.155:7180/http://52.11.56.155:8888/ impala-shell -i 172.31.25.244 sudo -u hdfs hdfs d ...

  2. 运行impala tpch

    1.安装git和下载tpc-h-impala脚步 [root@ip-172-31-34-31 ~]# yum install git [root@ip-172-31-34-31 ~]# git clo ...

  3. 在Linux下将TPC-H数据导入到MySQL

    一.下载TPC-H 下载地址:http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp .从这个页面中找 ...

  4. Impala:新一代开源大数据分析引擎--转载

    原文地址:http://www.parallellabs.com/2013/08/25/impala-big-data-analytics/ 文 / 耿益锋 陈冠诚 大数据处理是云计算中非常重要的问题 ...

  5. Greenplum 源码安装教程 —— 以 CentOS 平台为例

    Greenplum 源码安装教程 作者:Arthur_Qin 禾众 Greenplum 主体以及orca ( 新一代优化器 ) 的代码以可以从 Github 上下载.如果不打算查看代码,想下载编译好的 ...

  6. TPC-H is a Decision Support Benchmark

    TPC-H is a Decision Support Benchmark http://www.dba-oracle.com/t_tpc_benchmarks.htm

  7. 【原创】大数据基础之Benchmark(4)TPC-DS测试结果(hive/hive on spark/spark sql/impala/presto)

    1 测试集群 内存:256GCPU:32Core (Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz)Disk(系统盘):300GDisk(数据盘):1.5T*1 2 ...

  8. 【原创】大数据基础之Benchmark(2)TPC-DS

    tpc 官方:http://www.tpc.org/ 一 简介 The TPC is a non-profit corporation founded to define transaction pr ...

  9. Impala:新一代开源大数据分析引擎

    Impala架构分析 Impala是Cloudera公司主导开发的新型查询系统,它提供SQL语义,能查询存储在Hadoop的HDFS和HBase中的PB级大数据.已有的Hive系统虽然也提供了SQL语 ...

随机推荐

  1. AC自动机小结

    专题链接 第一题--hdu2222 Keywords Search ac自动机的模板题,入门题.  题解 第二题--hdu2896 病毒侵袭   一类病毒的入门题,类似模板  题解 第三题--hdu3 ...

  2. Linux ftp 使用

    FTP 是File Transfer Protocol(文件传输协议)的英文简称,而中文简称为“文传协议”.用于Internet上的控制文件的双向传输.同时,它也是一个应用程序(Application ...

  3. /proc 【虚拟文件系统】

    在安装新硬件到 Linux 系统之前,你会想要知道当前系统的资源配置状况. Linux 将这类信息全集中在 /proc 文件系统下./proc 目录下的文件都是 Linux 内核虚拟出来的,当你读取它 ...

  4. 转:logBack.xml配置路径

    http://blog.csdn.net/z69183787/article/details/30284391 http://www.cppblog.com/fwxjj/archive/2012/08 ...

  5. #define宏定义形式的"函数"导致的bug

    定义了一个宏定义形式的"函数": #define  SUM8(YY)\ {\ int Y = YY>>2;\ ...\ } 然后使用的时候,传入了一个同名的变量Y: i ...

  6. conflict between "Chinese_PRC_CI_AI" and "Chinese_PRC_CI_AS" in the equal to operation

    在SQL SERVICE做关联查询的时候遇到了"conflict between "Chinese_PRC_CI_AI" and "Chinese_PRC_CI ...

  7. springmvc的foward和redirect跳转简单解析

    Spring MVC 中,我们在返回逻辑视图时,框架会通过 viewResolver 来解析得到具体的 View,然后向浏览器渲染.假设逻辑视图名为 hello,通过配置,我们 配置某个 ViewRe ...

  8. wordpress目录文件结构说明

    wordpress目录文件结构说明   wordpress目录文件结构说明. WordPress文件夹内,你会发现大量的代码文件和3个文件夹wp-admin wp-content wp-include ...

  9. Best Time to Buy and Sell Stock III [LeetCode]

    Say you have an array for which the ith element is the price of a given stock on day i. Design an al ...

  10. 全球Top10最佳移动统计分析sdk

    监视应用程序的分析帮助您优化您的移动应用程序的某些元素,它也给你正确的洞察到你的营销计划.没有手机的分析软件包会有缺乏必要的数据,以帮助你提高你的应用程序需要.如果你是一个软件开发者或出版商为Goog ...