Hive 表分区

Hive表的分区就是一个目录，分区字段不和表的字段重复

创建分区表：

create table tb_partition(id string, name string)

PARTITIONED BY (month string)

row format delimited fields terminated by '\t';

加载数据到hive分区表中

方法一：通过load方式加载

load data local inpath '/home/hadoop/files/nameinfo.txt' overwrite into table tb_partition partition(month='');

方法二：insert select 方式

insert overwrite table tb_partition partition(month='') select id, name from name;

hive> insert into table tb_partition partition(month='') select id, name from name;

Query ID = hadoop_20170918222525_7d074ba1-bff9-44fc-a664-508275175849

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

方法三：可通过手动上传文件到分区目录，进行加载

hdfs dfs -mkdir /user/hive/warehouse/tb_partition/month=201710
hdfs dfs -put nameinfo.txt /user/hive/warehouse/tb_partition/month=201710

虽然方法三手动上传文件到分区目录，但是查询表的时候是查询不到数据的，需要更新元数据信息。

更新源数据的两种方法：

方法一：msck repair table 表名

hive> msck repair table tb_partition;

OK

Partitions not in metastore:    tb_partition:month=201710

Repair: Added partition to metastore tb_partition:month=201710

Time taken: 0.265 seconds, Fetched: 2 row(s)

方法二：alter table tb_partition add partition(month='201708');

hive> alter table tb_partition add partition(month='');

OK

Time taken: 0.126 seconds

查询表数据：

hive> select *from tb_partition ;

OK

1       Lily    201708

2       Andy    201708

3       Tom     201708

1       Lily    201709

2       Andy    201709

3       Tom     201709

1       Lily    201710

2       Andy    201710

3       Tom     201710

Time taken: 0.161 seconds, Fetched: 9 row(s)

查询分区信息： show partitions 表名

hive> show partitions tb_partition;

OK

month=201708

month=201709

month=201710

Time taken: 0.154 seconds, Fetched: 3 row(s)

查看hdfs中的文件结构

[hadoop@node11 files]$ hdfs dfs -ls /user/hive/warehouse/tb_partition/

17/09/18 22:33:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Found 4 items

drwxr-xr-x   - hadoop supergroup          0 2017-09-18 22:25 /user/hive/warehouse/tb_partition/month=201707

drwxr-xr-x   - hadoop supergroup          0 2017-09-18 22:15 /user/hive/warehouse/tb_partition/month=201708

drwxr-xr-x   - hadoop supergroup          0 2017-09-18 05:55 /user/hive/warehouse/tb_partition/month=201709

drwxr-xr-x   - hadoop supergroup          0 2017-09-18 22:03 /user/hive/warehouse/tb_partition/month=201710

创建多级分区

create table tb_mul_partition(id string, name string)

PARTITIONED BY (month string, code string)

row format delimited fields terminated by '\t';

加载数据：

load data local inpath '/home/hadoop/files/nameinfo.txt' into table tb_mul_partition partition(month='',code='');

load data local inpath '/home/hadoop/files/nameinfo.txt' into table tb_mul_partition partition(month='',code='');

查询数据：

hive> select *From tb_mul_partition where code='';

OK

1       Lily    201709  10000

2       Andy    201709  10000

3       Tom     201709  10000

1       Lily    201710  10000

2       Andy    201710  10000

3       Tom     201710  10000

Time taken: 0.208 seconds, Fetched: 6 row(s)

测试以下指定一个分区：

hive> load data local inpath '/home/hadoop/files/nameinfo.txt' into table tb_mul_partition partition(month='');

FAILED: SemanticException [Error 10006]: Line 1:95 Partition not found ''201708''

hive> load data local inpath '/home/hadoop/files/nameinfo.txt' into table tb_mul_partition partition(code='');

FAILED: SemanticException [Error 10006]: Line 1:95 Partition not found ''20000''

创建是多级分区，指定一个分区是不可以的。

查看一下在hdfs中存储的结构：

[hadoop@node11 files]$ hdfs dfs -ls /user/hive/warehouse/tb_mul_partition/month=201710

drwxr-xr-x   - hadoop supergroup          0 2017-09-18 22:36 /user/hive/warehouse/tb_mul_partition/month=201710/code=10000

动态分区

回顾一下之前的向分区插入数据：

insert overwrite table tb_partition partition(month='201707') select id, name from name;

这里需要指定具体的分区信息‘201707’，这里通过动态操作，向表里插入数据。

新建表：

hive> create table tb_copy_partition like tb_partition;

OK

Time taken: 0.118 seconds

查看一下表结构：

hive> desc tb_copy_partition;

OK

id                      string

name                    string

month                   string                                      

# Partition Information

# col_name              data_type               comment             

month                   string

Time taken: 0.127 seconds, Fetched: 8 row(s)

接下来通过动态操作，向tb_copy_partitioon里面插入数据，

insert into table tb_copy_partition partition(month) select id, name, month from tb_partition; 这里注意需要将分区字段month放到最后。

hive> insert into table tb_copy_partition partition(month) select id, name, month from tb_partition;

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

这里报错，使用动态加载，需要 To turn this off set hive.exec.dynamic.partition.mode=nonstrict

那根据错误信息设置一下

hive> set hive.exec.dynamic.partition.mode=nonstrict;

查询设置信息，设置成功

hive> set hive.exec.dynamic.partition.mode;

hive.exec.dynamic.partition.mode=nonstrict

重新执行：

hive> insert into table tb_copy_partition partition(month) select id, name, month from tb_partition;

Query ID = hadoop_20170918230808_0bf202da-279f-4df3-a153-ece0e457c905

Total jobs =

Launching Job  out of

Number of reduce tasks is set to  since there's no reduce operator

Starting Job = job_1505785612206_0002, Tracking URL = http://node11:8088/proxy/application_1505785612206_0002/

Kill Command = /home/hadoop/app/hadoop-2.6.-cdh5.10.0/bin/hadoop job  -kill job_1505785612206_0002

Hadoop job information for Stage-: number of mappers: ; number of reducers:

-- ::, Stage- map = %,  reduce = %

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 1.94 sec

-- ::, Stage- map = %,  reduce = %, Cumulative CPU 3.63 sec

MapReduce Total cumulative CPU time:  seconds  msec

Ended Job = job_1505785612206_0002

Stage- is selected by condition resolver.

Stage- is filtered out by condition resolver.

Stage- is filtered out by condition resolver.

Moving data to: hdfs://cluster1/user/hive/warehouse/tb_copy_partition/.hive-staging_hive_2017-09-18_23-08-01_475_7542657053989652968-1/-ext-10000

Loading data to table default.tb_copy_partition partition (month=null)

         Time taken for load dynamic partitions :

        Loading partition {month=}

        Loading partition {month=}

        Loading partition {month=}

        Loading partition {month=}

         Time taken for adding to write entity :

Partition default.tb_copy_partition{month=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

Partition default.tb_copy_partition{month=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

Partition default.tb_copy_partition{month=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

Partition default.tb_copy_partition{month=} stats: [numFiles=, numRows=, totalSize=, rawDataSize=]

MapReduce Jobs Launched:

Stage-Stage-: Map:    Cumulative CPU: 3.63 sec   HDFS Read:  HDFS Write:  SUCCESS

Total MapReduce CPU Time Spent:  seconds  msec

OK

Time taken: 28.932 seconds

查询一下数据：

hive> select *From tb_copy_partition;

OK

1       Lily    201707

2       Andy    201707

3       Tom     201707

1       Lily    201708

2       Andy    201708

3       Tom     201708

1       Lily    201709

2       Andy    201709

3       Tom     201709

1       Lily    201710

2       Andy    201710

3       Tom     201710

Time taken: 0.121 seconds, Fetched: 12 row(s)

完成

Hive 表分区的更多相关文章

hive表分区相关操作
Hive 表分区 Hive表的分区就是一个目录,分区字段不和表的字段重复创建分区表: create table tb_partition(id string, name string) PARTIT ...
Hive表分区
必须在表定义时创建partition a.单分区建表语句:create table day_table (id int, content string) partitioned by (dt stri ...
[Hive]使用HDFS文件夹数据创建Hive表分区
描写叙述: Hive表pms.cross_sale_path建立以日期作为分区,将hdfs文件夹/user/pms/workspace/ouyangyewei/testUsertrack/job1Ou ...
hive表分区的修复
hive从低版本升级到高版本或者做hadoop的集群数据迁移时,需要重新创建表和表分区,由于使用的是动态分区,所以需要重新刷新分区表字段,否则无法查看数据. 在hive中执行中以下命令即可自动更新元数 ...
使用MSCK命令修复Hive表分区
set hive.strict.checks.large.query=false; set hive.mapred.mode=nostrict; MSCK REPAIR TABLE 表名; 通常是通过 ...
hive 表分区操作
hive的数据查询一般会扫描整个表,当表数据太大时,就会消耗些时间,有时候我们只需要对部分数据感兴趣,所以hive引入了分区的概念 hive的表分区区别于一般的分布式分区(hash分区,范围分区 ...
hive 表优化
一.外部表和内部表的区别 (1)创建表时指定external关键字,就是外部表,不指定external就是内部表 (2)内部表删除后把元数据和数据都删除了,外部表删除后只是删除了元数据,不会删除hdf ...
Hive管理表分区的创建，数据导入，分区的删除操作
Hive分区和传统数据库的分区的异同: 分区技术是处理大型数据集经常用到的方法.在Oracle中,分区表中的每个分区是一个独立的segment段对象,有多少个分区,就存在多少个相应的数据库对象.而在P ...
分析Hive表和分区的统计信息(Statistics)
类似于Oracle的分析表,Hive中也提供了分析表和分区的功能,通过自动和手动分析Hive表,将Hive表的一些统计信息存储到元数据中. 表和分区的统计信息主要包括:行数.文件数.原始数据大小.所占 ...

随机推荐

switch和if语句
if :基本语法: 1.单分支语句 : if(条件){代码块}else{代码块} 2.多分支语句 :if(条件){代码块} else if(条件){代码块}else{代码块} * 不要忘记添加else ...
python之socketserver ftp功能简单讲解
TCP协议中的socket一次只能和一个客户端通信,然而socketserver可以实现和多个客户端通信. 它是在socket的基础上进行了一层封装,底层还是调用的socket # 服务端 impor ...
CSS3圆圈动画放大缩小循环动画效果
代码如下: <!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" c ...
Android开发常用的一些功能列表（转）
文章来源:http://www.cnblogs.com/netsql/archive/2013/03/02/2939828.html 1.软件自动更新下载,并提示 2.软件登录注册,以及状态保存 3. ...
《Inside C#》笔记(十二) 委托与事件
C#的委托与C++的函数指针类似,但委托是类型安全的,意味着指针始终会指向有效的函数.委托的使用主要有两种:回调和事件. 一将委托作为回调函数在需要给一个函数传递一个函数指针,随后通过函数指针调用 ...
oracle FLASHBACK TABLE
闪回表 -- 开启行迁移 ALTER TABLE employees_test ENABLE ROW MOVEMENT; UPDATE employees_test SET salary = sala ...
通信原理之OSI七层参考模型（一）
1.什么是计算机网络谈计算机通信原理当然离不开计算机网络,那么什么是计算机网络.官方定义:计算机网络是由两台或两台以上的计算机通过网络设备连接起来所组成的一个系统,在这个系统中计算机与计算机之间可以 ...
Mac快速上手指南
上周刚入手了2017版MacBookPro,预装macOS High Sierra.第一次接触Mac系统,经过一周的使用,简单总结下与Windows相比最常用的功能,快速上手. 1.Mac键盘实现Ho ...
February 5th, 2018 Week 6th Monday
The world is what it is; men who are nothing, who allow themselves to become nothing, have no place ...
Unity3d Platformer Pro 2D游戏开发框架使用教程
前言 Platformer Pro框架是Unity3d AssetStore上一个非常强大和受欢迎的2d游戏开发框架,这个教程的大部分翻译于官方文档,一部分是工作总结,还有一部分是视频教程文档化.这个 ...

Hive 表分区

Hive 表分区的更多相关文章

随机推荐

热门专题