Hive 安装

相比起很多教程先介绍概念，我喜欢先动手装上，然后用例子来介绍概念。我们先来安装一下Hive

先确认是否已经安装了对应的yum源，如果没有照这个教程里面写的安装cdh的yum源http://blog.csdn.net/nsrainbow/article/details/36629339

Hive是什么

Hive 提供了一个让大家可以使用sql去查询数据的途径。但是最好不要拿Hive进行实时的查询。因为Hive的实现原理是把sql语句转化为多个Map Reduce任务所以Hive非常慢，官方文档说Hive 适用于高延时性的场景而且很费资源。

举个简单的例子，可以像这样去查询

hive> select * from h_employee;

OK

      peter

      paul

Time taken: 9.289 seconds, Fetched:  row(s)

这个h_employee不一定是一个数据库表

metastore

Hive 中建立的表都叫metastore表。这些表并不真实的存储数据，而是定义真实数据跟hive之间的映射，就像传统数据库中表的meta信息，所以叫做metastore。实际存储的时候可以定义的存储模式有四种：

内部表（默认）分区表桶表外部表举个例子，这是一个简历内部表的语句

CREATE TABLE worker(id INT, name STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';

这个语句的意思是建立一个worker的内部表，内部表是默认的类型，所以不用写存储的模式。并且使用逗号作为分隔符存储

建表语句支持的类型

基本数据类型
tinyint / smalint / int /bigint
float / double
boolean
string

复杂数据类型
Array/Map/Struct

没有date /datetime

建完的表存在哪里呢？

在 /user/hive/warehouse 里面，可以通过hdfs来查看建完的表位置

$ hdfs dfs -ls /user/hive/warehouse

Found  items

drwxrwxrwt   - root     supergroup           -- : /user/hive/warehouse/h_employee

drwxrwxrwt   - root     supergroup           -- : /user/hive/warehouse/h_employee2

drwxrwxrwt   - wlsuser  supergroup           -- : /user/hive/warehouse/h_employee_export

drwxrwxrwt   - root     supergroup           -- : /user/hive/warehouse/h_http_access_logs

drwxrwxrwt   - root     supergroup           -- : /user/hive/warehouse/hbase_apache_access_log

drwxrwxrwt   - username supergroup           -- : /user/hive/warehouse/hbase_table_1

drwxrwxrwt   - username supergroup           -- : /user/hive/warehouse/hbase_table_2

drwxrwxrwt   - username supergroup           -- : /user/hive/warehouse/hive_apache_accesslog

drwxrwxrwt   - root     supergroup           -- : /user/hive/warehouse/hive_employee

一个文件夹对应一个metastore表

Hive 各种类型表使用

CREATE TABLE workers( id INT, name STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';

通过这样的语句就建立了一个内部表叫 workers，并且分隔符是逗号， \054 是ASCII 码
我们可以通过 show tables; 来看看有多少表，其实hive的很多语句是模仿mysql的，当你们不知道语句的时候，把mysql的语句拿来基本可以用。除了limit比较怪，这个后面会说

hive> show tables;

OK

h_employee

h_employee2

h_employee_export

h_http_access_logs

hive_employee

workers

Time taken: 0.371 seconds, Fetched: 6 row(s)

　　建立完后，我们试着插入几条数据。这边要告诉大家Hive不支持单句插入的语句，必须批量，所以不要指望能用insert into workers values (1,'jack') 这样的语句插入数据。hive支持的插入数据的方式有两种：从文件读取数据从别的表读出数据插入(insert from select) 这里我采用从文件读数据进来。先建立一个叫 worker.csv的文件

$ cat workers.csv

,jack

,terry

,michael

用LOAD DATA 导入到Hive的表中

hive> LOAD DATA LOCAL INPATH '/home/alex/workers.csv' INTO TABLE workers;

Copying data from file:/home/alex/workers.csv

Copying file: file:/home/alex/workers.csv

Loading data to table default.workers

Table default.workers stats: [num_partitions: , num_files: , num_rows: , total_size: , raw_data_size: ]

OK

Time taken: 0.655 seconds

注意不要少了那个 LOCAL ， LOAD DATA LOCAL INPATH 跟 LOAD DATA INPATH 的区别是一个是从你本地磁盘上找源文件，一个是从hdfs上找文件如果加上OVERWRITE可以再导入之前先清空表，比如 LOAD DATA LOCAL INPATH '/home/alex/workers.csv' OVERWRITE INTO TABLE workers; 查询一下数据

hive> select * from workers;

OK

   jack

   terry

   michael

Time taken: 0.177 seconds, Fetched:  row(s)

我们去看下导入后在hive内部表是怎么存的

# hdfs dfs -ls /user/hive/warehouse/workers/

Found  items

-rwxrwxrwt    root supergroup          -- : /user/hive/warehouse/workers/workers.csv

原来就是原封不动的把文件拷贝进去啊！就是这么土！我们可以试验再放一个文件 workers2.txt （我故意把扩展名换一个，其实hive是不看扩展名的）

# cat workers2.txt

,peter

,kate

,ted

导入

hive> LOAD DATA LOCAL INPATH '/home/alex/workers2.txt' INTO TABLE workers;

Copying data from file:/home/alex/workers2.txt

Copying file: file:/home/alex/workers2.txt

Loading data to table default.workers

Table default.workers stats: [num_partitions: , num_files: , num_rows: , total_size: , raw_data_size: ]

OK

Time taken: 0.79 seconds

去看下文件的存储结构

# hdfs dfs -ls /user/hive/warehouse/workers/

Found  items

-rwxrwxrwt    root supergroup          -- : /user/hive/warehouse/workers/workers.csv

-rwxrwxrwt    root supergroup          -- : /user/hive/warehouse/workers/workers2.txt

多出来一个workers2.txt 再用sql查询下

hive> select * from workers;

OK

   jack

   terry

   michael

   peter

   kate

   ted

Time taken: 0.144 seconds, Fetched:  row(s)

分区表

分区表是用来加速查询的，比如你的数据非常多，但是你的应用场景是基于这些数据做日报表，那你就可以根据日进行分区，当你要做2014-05-05的报表的时候只需要加载2014-05-05这一天的数据就行了。我们来创建一个分区表来看下

create table partition_employee(id int, name string)

partitioned by(daytime string)

row format delimited fields TERMINATED BY '\054';

可以看到分区的属性，并不是任何一个列我们先建立2个测试数据文件，分别对应两天的数据

# cat --

,kitty

,lily

# cat --

,sami

,micky

导入到分区表里面

hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-05' INTO TABLE partition_employee partition(daytime='2014-05-05');

Copying data from file:/home/alex/--

Copying file: file:/home/alex/--

Loading data to table default.partition_employee partition (daytime=--)

Partition default.partition_employee{daytime=--} stats: [num_files: , num_rows: , total_size: , raw_data_size: ]

Table default.partition_employee stats: [num_partitions: , num_files: , num_rows: , total_size: , raw_data_size: ]

OK

Time taken: 1.154 seconds

hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-06' INTO TABLE partition_employee partition(daytime='2014-05-06');

Copying data from file:/home/alex/--

Copying file: file:/home/alex/--

Loading data to table default.partition_employee partition (daytime=--)

Partition default.partition_employee{daytime=--} stats: [num_files: , num_rows: , total_size: , raw_data_size: ]

Table default.partition_employee stats: [num_partitions: , num_files: , num_rows: , total_size: , raw_data_size: ]

OK

Time taken: 0.763 seconds

导入的时候通过 partition 来指定分区。
查询的时候通过指定分区来查询

hive> select * from partition_employee where daytime='2014-05-05';

OK

  kitty   --

  lily    --

Time taken: 0.173 seconds, Fetched:  row(s)

我的查询语句并没有什么特别的语法，hive 会自动判断你的where语句中是否包含分区的字段。而且可以使用大于小于等运算符

hive> select * from partition_employee where daytime>='2014-05-05';

OK

  kitty   --

  lily    --

  sami    --

  mick'   2014-05-06

Time taken: 0.273 seconds, Fetched:  row(s)

我们去看看存储的结构

# hdfs dfs -ls /user/hive/warehouse/partition_employee

Found  items

drwxrwxrwt   - root supergroup           -- : /user/hive/warehouse/partition_employee/daytime=--

drwxrwxrwt   - root supergroup           -- : /user/hive/warehouse/partition_employee/daytime=--

我们试试二维的分区表

create table p_student(id int, name string)

partitioned by(daytime string,country string)

row format delimited fields TERMINATED BY '\054';

查入一些数据

# cat ---CN

,tammy

,eric

# cat ---CN

,paul

,jolly

# cat ---EN

,ivan

,billy

导入hive

hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-09-CN' INTO TABLE p_student partition(daytime='2014-09-09',country='CN');

Copying data from file:/home/alex/---CN

Copying file: file:/home/alex/---CN

Loading data to table default.p_student partition (daytime=--, country=CN)

Partition default.p_student{daytime=--, country=CN} stats: [num_files: , num_rows: , total_size: , raw_data_size: ]

Table default.p_student stats: [num_partitions: , num_files: , num_rows: , total_size: , raw_data_size: ]

OK

Time taken: 0.736 seconds

hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-CN' INTO TABLE p_student partition(daytime='2014-09-10',country='CN');

Copying data from file:/home/alex/---CN

Copying file: file:/home/alex/---CN

Loading data to table default.p_student partition (daytime=--, country=CN)

Partition default.p_student{daytime=--, country=CN} stats: [num_files: , num_rows: , total_size: , raw_data_size: ]

Table default.p_student stats: [num_partitions: , num_files: , num_rows: , total_size: , raw_data_size: ]

OK

Time taken: 0.691 seconds

hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-EN' INTO TABLE p_student partition(daytime='2014-09-10',country='EN');

Copying data from file:/home/alex/---EN

Copying file: file:/home/alex/---EN

Loading data to table default.p_student partition (daytime=--, country=EN)

Partition default.p_student{daytime=--, country=EN} stats: [num_files: , num_rows: , total_size: , raw_data_size: ]

Table default.p_student stats: [num_partitions: , num_files: , num_rows: , total_size: , raw_data_size: ]

OK

Time taken: 0.622 seconds

看看存储结构

# hdfs dfs -ls /user/hive/warehouse/p_student

Found  items

drwxr-xr-x   - root supergroup           -- : /user/hive/warehouse/p_student/daytime=--

drwxr-xr-x   - root supergroup           -- : /user/hive/warehouse/p_student/daytime=--

# hdfs dfs -ls /user/hive/warehouse/p_student/daytime=--

Found  items

drwxr-xr-x   - root supergroup           -- : /user/hive/warehouse/p_student/daytime=--/country=CN

查询一下数据

hive> select * from p_student;

OK

   tammy   --  CN

   eric    --  CN

   paul    --  CN

   jolly   --  CN

  ivan    --  EN

  billy   --  EN

Time taken: 0.228 seconds, Fetched:  row(s)

hive> select * from p_student where daytime='2014-09-10' and country='EN';

OK

  ivan    --  EN

  billy   --  EN

Time taken: 0.224 seconds, Fetched:  row(s)

桶表

桶表是根据某个字段的hash值，来将数据扔到不同的“桶”里面。外国人有个习惯，就是分类东西的时候摆几个桶，上面贴不同的标签，所以他们取名的时候把这种表形象的取名为桶表。桶表表专门用于采样分析
下面这个例子是官网教程直接拷贝下来的，因为分区表跟桶表是可以同时使用的，所以这个例子中同时使用了分区跟桶两种特性

CREATE TABLE b_student(id INT, name STRING)

PARTITIONED BY(dt STRING, country STRING)

CLUSTERED BY(id) SORTED BY(name) INTO  BUCKETS

row format delimited

    fields TERMINATED BY '\054';

意思是根据userid来进行计算hash值，用viewTIme来排序存储做数据跟导入的过程我就不在赘述了，这是导入后的数据

hive> select * from b_student;

OK

   tammy   --  CN

   eric    --  CN

   paul    --  CN

   jolly   --  CN

  allen   --  EN

Time taken: 0.727 seconds, Fetched:  row(s)

从4个桶中采样抽取一个桶的数据

hive> select * from b_student tablesample(bucket  out of  on id);

Total MapReduce jobs =

Launching Job  out of

Number of reduce tasks is set to  since there's no reduce operator

Starting Job = job_1406097234796_0041, Tracking URL = http://hadoop01:8088/proxy/application_1406097234796_0041/

Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1406097234796_0041

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.9 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 2.9 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1406097234796_0041

MapReduce Jobs Launched:

Job : Map:    Cumulative CPU: 2.9 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

   jolly   --  CN

外部表

外部表就是存储不是由hive来存储的，比如可以依赖Hbase来存储，hive只是做一个映射而已。我用Hbase来举例
先建立一张Hbase表叫 employee

hbase(main)::> create 'employee','info'

 row(s) in 0.4740 seconds  

=> Hbase::Table - employee

hbase(main)::> put 'employee',,'info:id',

 row(s) in 0.2080 seconds  

hbase(main)::> scan 'employee'

ROW                                      COLUMN+CELL

                                        column=info:id, timestamp=, value=

 row(s) in 0.0610 seconds  

hbase(main)::> put 'employee',,'info:name','peter'

 row(s) in 0.0220 seconds  

hbase(main)::> scan 'employee'

ROW                                      COLUMN+CELL

                                        column=info:id, timestamp=, value=

                                        column=info:name, timestamp=, value=peter

 row(s) in 0.0450 seconds  

hbase(main)::> put 'employee',,'info:id',

 row(s) in 0.0370 seconds  

hbase(main)::> put 'employee',,'info:name','paul'

 row(s) in 0.0180 seconds  

hbase(main)::> scan 'employee'

ROW                                      COLUMN+CELL

                                        column=info:id, timestamp=, value=

                                        column=info:name, timestamp=, value=peter

                                        column=info:id, timestamp=, value=

                                        column=info:name, timestamp=, value=paul

 row(s) in 0.0440 seconds

建立外部表进行映射

hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string)

    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'

    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name")

    > TBLPROPERTIES ("hbase.table.name" = "employee");

OK

Time taken: 0.324 seconds

hive> select * from h_employee;

OK

      peter

      paul

Time taken: 1.129 seconds, Fetched:  row(s)

查询语法

具体语法可以参考官方手册https://cwiki.apache.org/confluence/display/Hive/Tutorial 我只说几个比较奇怪的点

显示条数

展示x条数据，用的还是limit，比如

hive> select * from h_employee limit

    > ;

OK

      peter

Time taken: 0.284 seconds, Fetched:  row(s)

但是不支持起点，比如offset

（转自：http://www.2cto.com/database/201412/359250.html ）

Hive入门教程的更多相关文章

一条数据的HBase之旅，简明HBase入门教程-开篇
常见的HBase新手问题: 什么样的数据适合用HBase来存储? 既然HBase也是一个数据库,能否用它将现有系统中昂贵的Oracle替换掉? 存放于HBase中的数据记录,为何不直接存放于HDFS之 ...
wepack+sass+vue 入门教程（三）
十一.安装sass文件转换为css需要的相关依赖包 npm install --save-dev sass-loader style-loader css-loader loader的作用是辅助web ...
wepack+sass+vue 入门教程（二）
六.新建webpack配置文件 webpack.config.js 文件整体框架内容如下,后续会详细说明每个配置项的配置 webpack.config.js直接放在项目demo目录下 module.e ...
wepack+sass+vue 入门教程（一）
一.安装node.js node.js是基础,必须先安装.而且最新版的node.js,已经集成了npm. 下载地址 node安装,一路按默认即可. 二.全局安装webpack npm install ...
Content Security Policy 入门教程
阮一峰文章:Content Security Policy 入门教程
gulp详细入门教程
本文链接:http://www.ydcss.com/archives/18 gulp详细入门教程简介: gulp是前端开发过程中对代码进行构建的工具,是自动化项目的构建利器:她不仅能对网站资源进行优 ...
UE4新手引导入门教程
请大家去这个地址下载:file:///D:/UE4%20Doc/虚幻4新手引导入门教程.pdf
ABP(现代ASP.NET样板开发框架)系列之2、ABP入门教程
点这里进入ABP系列文章总目录基于DDD的现代ASP.NET开发框架--ABP系列之2.ABP入门教程 ABP是“ASP.NET Boilerplate Project (ASP.NET样板项目)” ...
webpack入门教程之初识loader(二)
上一节我们学习了webpack的安装和编译,这一节我们来一起学习webpack的加载器和配置文件. 要想让网页看起来绚丽多彩,那么css就是必不可少的一份子.如果想要在应用中增加一个css文件,那么w ...

随机推荐

[JS]Math.random()
参考网址:http://www.soulteary.com/2014/07/05/js-math-random-trick.html [JS]Math.random()的二三事看到题目,如果大家平时 ...
“Project 'MyFunProject' is not a J2SE 5.0 compliant project.”
CPU频率
CPU频率 CPU频率,就是CPU的时钟频率,简单说是CPU运算时的工作的频率(1秒内发生的同步脉冲数)的简称. 概念 CPU频率,就是CPU的时钟频率,简单说是CPU运算时的工作的频率(1秒内发生的 ...
CCF系列之日期计算(201509-2)
试题编号: 201509-2 时间限制: 1.0s 内存限制: 256.0MB 问题描述给定一个年份y和一个整数d,问这一年的第d天是几月几日? 注意闰年的2月有29天.满足下面条件之一的是闰年: ...
intellij springmvc的配置文件报错
报错: Checks references injected by IntelliLang plugin. Cannot resolve bean 解决: File--Settings[或直接CTR ...
SpringMVC之GET请求参数中文乱码
server.xml 文件中的编码过滤器设置是针对POST请求的,tomacat对GET和POST请求处理方式是不同的,要处理针对GET请求的编码问题,则需要改tomcat,conf目录下的serve ...
Go_Hello word
与Go相关直接命令有哪些? go get 获取远程包 go run 直接运行程序 go bulid 测试编译 go fmt 格式化代码 go install 编译包文件 ...
Servlet--继承HttpServlet写自己的Servlet
前面2篇关注的都是Servlet接口,在实际编码中一般不直接实现这个接口,而是继承HttpServlet类.因为j2e的包里面写好了GenericServlet和HttpServlet类来让我们简化编 ...
tp5中设置指定的log日志，可单独建立文件夹和文件名
1:在D:\www\tp5\thinkphp\library\think\Log.php中添加下列代码.可在runtime文件夹下建立tlogs文件夹(可自定义). /** * [payLog 支付日 ...
android之间传递list
Intent intent = new Intent(getActivity(), Activity_Character.class); intent.putExtra("mlTrait&q ...

Hive入门教程