hive操作简单总结

Hive DDL、DML操作
背景介绍
• 一、DDL操作（数据定义语言）包括：Create、Alter、Show、Drop等。
• create database- 创建新数据库
• alter database - 修改数据库
• drop database - 删除数据库
• create table - 创建新表
• alter table - 变更（改变）数据库表
• drop table - 删除表
• create index - 创建索引（搜索键）
• drop index - 删除索引
• show table - 查看表
• 二、DML操作（数据操作语言）包括：Load 、Insert、Update、Delete、Merge。
• load data - 加载数据
• insert into - 插入数据
• insert overwrite - 覆盖数据（insert ... values从Hive 0.14开始可用。）
• update table - 更新表（update在Hive 0.14开始可用，并且只能在支持ACID的表上执行）
• delete from table where id = 1; - 删除表中ID等于1的数据（delete在Hive 0.14开始可用，并且只能在支持ACID的表上执行）
• merge - 合并（MERGE在Hive 2.2开始可用，并且只能在支持ACID的表上执行）
注意：频繁的update和delete操作已经违背了Hive的初衷。不到万不得已的情况，还是使用增量添加的方式最好。

1.实验环境准备
1）新建目录，并下载安装包
mkdir -p /data/hive2
cd /data/hive2
wget http://192.168.1.100:60000/allfiles/hive2/cat_group
wget http://192.168.1.100:60000/allfiles/hive2/goods
2）检查hadoop、Mysql、Hive是否启动，若无则启动
jps
cd /apps/hadoop/sbin
./start-all.sh
sudo service mysql status
sudo service mysql start
cd /apps/hive/bin
./hive

2.Hive数据仓库的操作
1）创建一个数据仓库，名为DB
create database DB;
create database if not exists DB; #优化命令
2）查看数据仓库DB的信息及路径
describe database DB;
3）删除名为DB的数据仓库
drop database if exists DB;

3.Hive数据表的操作
1）查看已存在的表
show tables;
2）创建一个名为cat的内部表
create table cat(cat_id string,cat_name string);
show tables;
3）创建一个外部表，表明为cat2
create external table if not exists cat2(cat_id string, cat_name string);
show tables;
4）修改cat表的表结构
alter table cat add columns(group_id string,cat_code string);
desc cat; #查看表结构
alter table cat2 rename to cat3;
5）删除
drop table cat3;
show tables;
6）创建相同结构的表
create table cat4 like cat;
show tables;

4.四种Hive中数据的导入方式
1）从本地文件系统中导入数据到Hive表
＃首先创建一个表，包含两个字段，以“\t”为分隔符
create table cat_group(group_id string,group_name string)
row format delimited fields terminated by '\t' stored as textfile;
show tables;
＃然后将Linux本地/data/hive2目录下的文件导入到表中
load data local inpath '/data/hive2/cat_group' into table cat_group;
＃查看数据
select * from cat_group limit 10;
2）将HDFS上的数据导入到Hive中
＃首先在另一个终端窗口，在HDFS上创建/myhive2目录
hadoop fs -mkdir /myhive2
＃将本地/data/hive2/下的表上传到HDFS的/myhive2上，并查看是否创建成功
hadoop fs -put /data/hive2/cat_group /myhive2
hadoop fs -ls /myhive2
＃在Hive中创建表
create table cat_group1(group_id string,group_name string)
row format delimited fields terminated by '\t' stored as textfile;
＃将HDFS中的表导入到Hive中，并查看结果
load data inpath ‘/myhive2/cat_group’ into table cat_group1;
select * from cat_group1 limit 10;
3）从别的表中查询出相应的数据导入到Hive中
＃首先创建一个表
create table cat_group2(group_id string,group_name string)
row format delimited fields terminated by '\t' stored as textfile;
＃将上一步得到的表的数据导入到新表中
insert into(或者用overwrite) table cat_group2 select * from cat_group1;
＃查看数据
select * from cat_group2 limit 10;
4）在建新表时，从别的表中查询出相应的数据插入到正在创建的表中
create table cat_group3 as select *from cat_group2;

5.三种Hive中数据的导出方式
1）导出到本地文件系统
＃首先在本地新建/data/hive2/out目录
mkdir -p /data/hive2/out
＃将Hive的cat_group表导出到上述目录
insert overwrite local directory ‘/data/hive2/out’ select * from cat_group;
＃在本地切换到/data/hive2/out目录，cat命令查询
ls
cat 000000_0
#刚刚查看的文件字符间没有分割，以下为导出改进方法
insert overwrite local directory ‘/data/hive2/out’ select group_id,concat(‘\t’,group_name) from cat_group;
2）导出到HDFS中
＃在HDFS上创建/myhive2/out目录
hadoop fs -mkdir /myhive2/out
＃将Hive的cat_group表导出到上述目录
insert overwrite directory ‘/myhive2/out’ select group_id,concat(‘\t’,group_name) from cat_group;
＃查看数据
hadoop fs -ls /myhive2/out
3）导出到Hive的另一个表中
＃首先在Hive中创建一个表
create table cat_group4(group_id string,group_name string)
row format delimited fields terminated by '\t' stored as textfile;
＃将旧表的数据导出到新表中
insert into table cat_group4 select * from cat_group;
＃查看数据
select * from cat_group4 limit 10;

6.Hive分区表的操作
1）创建分区表goods，并查看结构
create table goods(goods_id string,goods_status string) partitioned
by (cat_id string) row format delimited fields terminated by '\t';
desc goods;
2）向分区表插入数据
＃首先在Hive创建一个非分区表goods_1，用于存储本地表goods的数据
create table goods_1(goods_id string,goods_status string,cat_id string)
partitioned by (cat_id string) row format delimited fields terminated by '\t';
＃将本地表goods的数据导入到上述新表中
load data local inpath ‘/data/hive2/goods’ into table goods_1;
＃再将表goods_1的数据导入到分区表goods中
insert into table goods partition(cat_id=‘52052’) select goods_id,goods_status from goods_1 where cat_id=‘52052’;
＃查看数据
select * from goods limit 10;
3）查看表goods的分区
show partitions goods;
4）修改表分区
alter table goods partition(cat_id=52052) rename to partition(cat_id=52051);
show partitions goods;
5）删除表分区
＃先备份出一个goods_2表
create table goods_2(goods_id string,goods_status string)
partitioned by (cat_id string) row format delimited fields terminated by '\t';
insert into table goods_2 partition(cat_id=‘52052’) select goods_id,goods_status from goods_1 where cat_id=‘52052’;
＃删除goods表中的cat_id分区
alter table goods drop if exists partition (cat_id=’52051’)
show partitions goods;

7.Hive桶的操作
1）创建桶
＃创建goods_t表，按照一列聚类另一列排序，划分成两个桶
create table goods_t(goods_id string,goods_status string)
partitioned by (cat_id string) clustered by(goods_status)
sorted by (goods_id) into 2 buckets;
＃设置环境变量
set hive.enforce.bucketing=true;
2）向goods_t表中插入goods_2表中的数据
from goods_2 insert overwrite table goods_t partition(cat_id=‘52063’) select goods_id,goods_status;
3）查看结果
select * from goods_t tablesample(bucket 1 out of 2 on goods_id);

tablesample(bucket x out of y)

y必须是table总bucket数的倍数或者因子。hive根据y的大小，决定抽样的比例。例如，table总共分了64份，当y=32时，抽取(64/32=)2个bucket的数据，当y=128时，抽取(64/128=)1/2个bucket的数据。
x表示从哪个bucket开始抽取。例如，table总bucket数为32，tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数据，分别为第3个bucket和第（3+16=）19个bucket的数据。

Hive HQL操作
1.环境准备
1）首先检查Hadoop是否启动，若无则启动
jps
cd /apps/hadoop/sbin
./start-all.sh
2）启动Mysql库，用于存放Hive的元数据
sudo service mysql start
3）启动Hive命令行
hive
4）打开新的命令行，创建新目录用于存放原始文件
mkdir -p /data/hive3
cd /data/hive3
wget ……
5）在Hive命令行，创建两个表，以‘\t’为分隔符
create table buyer_log(id string,buyer_id string,dt string,ip string,opt_type string)
row format delimited fields terminated by '\t' stored as textfile;
create table buyer_favorite(buyer_id string,goods_id string,dt string)
row format delimited fields terminated by '\t' stored as textfile;
6）将下载到本地的表中数据导入到Hive中
load data local inpath '/data/hive3/buyer_log' into table buyer_log;
load data local inpath '/data/hive3/buyer_favorite' into table buyer_favorite;

2.查询操作
1）普通查询
select * from buyer_log limit 10;
2）别名查询
select b.id,b.ip from buyer_log b limit 10;
3）限定查询（where）
select buyer_id from buyer_log where opt_type=1 limit 10;
4）两表或多表联合查询
select l.dt,f.goods_id from buyer_log l,buyer_favorite f where l.buyer_id=f.buyer_id limit 10;
5）多表插入
＃先创建两个新表
create table buyer_log1 like buyer_log;
create table buyer_log2 like buyer_log;
＃再将buyer_log的数据插入到两个新表中
from buyer_log
insert overwrite table buyer_log1 select
insert overwrite table buyer_log2 select ;
6）多目录输出文件，将同一文件输出到本地不同文件夹
from buyer_log
insert overwrite local directory ‘/data/hive3/out’ select
insert overwrite local directory ‘/data/hive3/out1’ select ;
＃切入到本地目录/data/hive3查询是否成功输出
cd /data/hive3
ls out
ls out1

3.使用shell脚本调用Hive查询语句
1）编写脚本sh1
cd /data/hive3
vim sh1
#!/bin/bash
cd /apps/hive/bin;
hive -e ‘show tables;’
2）赋予执行权限，并执行
chmod +x sh1
./sh1

4.排序
1）Order by＝全局排序
select * from goods_visit order by click_num desc limit 10;
2）Sort by＝局部排序
＃设置Reduce个数为3
set mapred.reduce.tasks=3;
3）Group by＝分组操作

4）Distribute by＝分发为不同文件导出
＃设置Reduce个数为3
set mapred.reduce.tasks=3;
＃按照buyer_id做分发，输出到本地/data/hive4/out目录中
insert overwrite local directory ‘/data/hive4/out’ select * from buyer_favorite distribute by buyer_id;
＃结果是：在/data/hive4/out目录下按照buyer_id分成了3个文件
cd /data/hive4/out
ls
5）Cluster by＝distribute＋sort
＃设置Reduce个数为3
set mapred.reduce.tasks=3;
＃按照buyer_id分成三个文件，并按buyer_id排序
select * from buyer_favorite cluster by buyer_id

hive操作简单总结的更多相关文章

Hive 的简单使用及调优参考文档
Hive 的简单使用及调优参考文档 HIVE的使用命令行界面使用一下命令查看hive的命令行页面, hive --help --service cli 简化命令为hive –h 会输出下面的这 ...
Hive 文件格式 & Hive操作（外部表、内部表、区、桶、视图、索引、join用法、内置操作符与函数、复合类型、用户自定义函数UDF、查询优化和权限控制）
本博文的主要内容如下: Hive文件存储格式 Hive 操作之表操作:创建外.内部表 Hive操作之表操作:表查询 Hive操作之表操作:数据加载 Hive操作之表操作:插入单表.插入多表 Hive语 ...
吴裕雄--天生自然HADOOP操作实验学习笔记：使用hive操作hbase
实验目的熟悉hive和hbase的操作熟悉hadoop.hbase.hive.zookeeper的关系熟练大数据环境的搭建学会分析日志排除问题实验原理 1.hive整合hbase原理前面大 ...
开源：矿Android新闻client，快、小、支持离线阅读、操作简单、内容丰富，形式多样展示、的信息量、全功能等待（离开码邮箱)
分享:矿Android新闻client.快.小.支持离线阅读.操作简单.内容丰富,形式多样展示.的信息量.全功能等待(离开码邮箱) 历时30天我为了开发这个新闻clientAPP,下面简称觅闻 ht ...
Hadoop+Hive 操作mongodb数据
Hadoop+Hive 操作mongodb数据 1.版本概述 hadoop-2.7.3.hive-2.2 下载响应的jar包:http://mvnrepository.com/,直接搜索想要的jar包 ...
hive操作记录
hive是依赖于hdfs和yarn的一个数据仓库数据库和数据仓库的区别: 数据库在存储数据的同时,允许实时的增删改查等操作数据仓库在存储数据的同时还执行着计算和分析数据的工作,但是并不能实时的进行 ...
转 PV操作简单理解
传送门 PV操作简单理解进程通常分为就绪.运行和阻塞三个工作状态.三种状态在某些条件下可以转换,三者之间的转换关系如下: 进程三个状态之间的转换就是靠PV操作来控制的.PV操作主要就是P操作.V操作 ...
mysqli数据库操作简单实例
mysqli数据库操作简单实例代码结果
Spark2.0集成Hive操作的相关配置与注意事项
前言已完成安装Apache Hive,具体安装步骤请参照,Linux基于Hadoop2.8.0集群安装配置Hive2.1.1及基础操作补充说明 Hive中metastore(元数据存储)的三种方式 ...

随机推荐

一统江湖的大前端（8）- velocity.js 运动的姿势(上)
[摘要] 介绍CSS动画和JS动画的基本特点,以及轻量级动画库velocity.js的基本用法. 示例代码托管在:http://www.github.com/dashnowords/blogs 博客园 ...
关于Block内部要不要使用weakSelf的几种情况
本文转载自http://www.jianshu.com/p/c6ca540861d9 关于Block内部要不要使用weakSelf的几种情况我们知道当对block使用不当时会造成循环引用导致内存泄露 ...
IOS中获取各个文件的目录路径的方法和NSFileManager类
转自:http://blog.sina.com.cn/s/blog_5fb39f910101di92.html IOS中获取各种文件的目录路径的方法 iphone沙箱模型的有四个文件夹,分别是什么,永 ...
POJ 1966 Cable TV NETWORK(网络流-最小点割集)
Cable TV NETWORK The interconnection of the relays in a cable TV net ...
ARTS-S golang panic返回默认值
package main import "fmt" func fn_test_panic() (a int) { a = 2 panic("This is panic&q ...
Day 06 流程控制和爬虫基础2
目录 if 判断单分支结构双分支结构多分支结构 for循环 for循环的基本用法 for循环嵌套 break continue 爬虫基础2 爬取豆瓣TOP250 爬取豆瓣数据接口(异步数据) 爬 ...
Codeforces Global Round 6
久违的写篇博客吧 A. Competitive Programmer 题目链接:https://codeforces.com/contest/1266/problem/A 题意: 给你一个只包含数字 ...
《Java基础知识》Java 反射详解
定义 JAVA反射机制是在运行状态中,对于任意一个类,都能够知道这个类的所有属性和方法:对于任意一个对象,都能够调用它的任意方法和属性:这种动态获取信息以及动态调用对象方法的功能称为java语言的反射 ...
Caffe源码-Solver类
Solver类简介 Net类中实现了网络的前向/反向计算和参数更新,而Solver类中则是对此进行进一步封装,包含可用于逐次训练网络的Step()函数,和用于求解网络的优化解的Solve()函数,同时 ...
基于C# WPF框架的贪吃蛇
游戏开始界面游戏开始共有两条蛇,吃到红色食物加1分,吃到绿色毒食物减1分,知道0不减: 碰到墙壁游戏结束,碰到对方游戏结束,碰到自己游戏结束此游戏通过Canvas画布布局,通过C#代码实现游戏 ...

hive操作简单总结

tablesample(bucket x out of y)

hive操作简单总结的更多相关文章

随机推荐

热门专题