大数据入门到精通18--sqoop 导入关系库到hdfs中和hive表中

一，选择数据库，这里使用标准mysql sakila数据库

mysql -u root -D sakila -p

二。首先尝试把表中的数据导入到hdfs文件中，这样后续就可以使用spark来dataframe或者rdd来处理数据

sqoop import --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root --table rental --target-dir "SqoopImport/rental" --num-mappers 1

\\SqoopImport 目录必须有，rental 目录可以不存在

三。如果要导入到hive里面，要使用 --warehouse参数。

sqoop import --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root --table rental --warehouse-dir "/user/hive/warehouse/sakila.db" --num-mappers 2

\\因为之前我们已经全表导入过一次了，会提示文件已经存在的错误

hadoop fs -mv /user/hive/warehouse/sakila.db/rental /user/hive/warehouse/sakila.db/rental2
\\把原来的目录移走
sqoop import --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root --table rental --warehouse-dir "/user/hive/warehouse/sakila.db" --num-mappers 2

四。也可以通过sqoop命令来查看hive的元数据库。

1.查看多少个数据库

sqoop list-databases --connect "jdbc:mysql://host03.xyy:3306" --username root --password root

2.查看多少给表

sqoop list-tables --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root

3.sqoop执行select语句。

sqoop eval --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root --query "select * from rental limit 10"

五。导入hive或者hdfs中使用追加模式

sqoop import --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root --table rental --where "date(return_date) < '2005-07-30'" --warehouse-dir "/user/hive/warehouse/sakila.db" --append --num-mappers 2

Total MapReduce CPU Time Spent: 9 seconds 250 msec
OK
23191
Time taken: 30.055 seconds, Fetched: 1 row(s)
hive>
\\原来数据hive里面的表格数据是16044条，重新append一批数据以后百年城23191条
\\apend 也可以应用 hdfs文件中，和target-dir配合使用

六。导入hdfs和hive中使用删除模式

OK
7147
Time taken: 32.115 seconds, Fetched: 1 row(s)
hive>
\\--delete-target-dir 是删除模式导入，清空原来的数据，这个命令也可以在导入hdfs下使用

注意以上例子中都是使用了where条件的导入。

七。关于导入数据的并行数量

前面几个例子都是使用了--num-mapper 2 也就是两个并行。

实际上默认是因为原来的mysql表中有主键，如果没有主键是不能直接指定并行为2 的。因为系统不知道怎么切割数据。

如果要并行需要使用另外一个参数

在mysql中执行复制一个表格
create table customer_copy like customer;
insert into customer_copy select * from customer

sqoop import --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root --table customer_copy --warehouse-dir "/user/hive/warehouse/sakila.db" --delete-target-dir -split-by address_id --num-mappers 2

八。sqoop全表导入

//导入数据库mysql到hive

sqoop import-all-tables --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root --hive-import --hive-database sakila --m 2

如果其中部分表格没有主键并行就有问题。需要使用一个参数 --autoreset-to-one-mapper

sqoop import-all-tables --connect "jdbc:mysql://host03.xyy:3306/sakila" --username root --password root --warehouse-dir "SqoopImport/sakila" --autoreset-to-one-mapper --m 2

这样对于没有主键的自动变成一个map去处理

九。文件格式

通过参数决定每个表存入hdfs中的格式

--as-textfile （default）
--as-avrodatafile
--as-sequencefile
--as-parquetfile

10.sqoop import参数列表

Argument	Description
`--append`	Append data to an existing dataset in HDFS
`--as-avrodatafile`	Imports data to Avro Data Files
`--as-sequencefile`	Imports data to SequenceFiles
`--as-textfile`	Imports data as plain text (default)
`--as-parquetfile`	Imports data to Parquet Files
`--boundary-query <statement>`	Boundary query to use for creating splits
`--columns <col,col,col…>`	Columns to import from table
`--delete-target-dir`	Delete the import target directory if it exists
`--direct`	Use direct connector if exists for the database
`--fetch-size <n>`	Number of entries to read from database at once.
`--inline-lob-limit <n>`	Set the maximum size for an inline LOB
`-m,--num-mappers <n>`	Use n map tasks to import in parallel
`-e,--query <statement>`	Import the results of `statement`.
`--split-by <column-name>`	Column of the table used to split work units. Cannot be used with `--autoreset-to-one-mapper` option.
`--split-limit <n>`	Upper Limit for each split size. This only applies to Integer and Date columns. For date or timestamp fields it is calculated in seconds.
`--autoreset-to-one-mapper`	Import should use one mapper if a table has no primary key and no split-by column is provided. Cannot be used with `--split-by <col>`option.
`--table <table-name>`	Table to read
`--target-dir <dir>`	HDFS destination dir
`--temporary-rootdir <dir>`	HDFS directory for temporary files created during import (overrides default "_sqoop")
`--warehouse-dir <dir>`	HDFS parent for table destination
`--where <where clause>`	WHERE clause to use during import
`-z,--compress`	Enable compression
`--compression-codec <c>`	Use Hadoop codec (default gzip)
`--null-string <null-string>`	The string to be written for a null value for string columns
`--null-non-string <null-string>`	The string to be written for a null value for non-string columns

大数据入门到精通18--sqoop 导入关系库到hdfs中和hive表中的更多相关文章

大数据入门到精通1--大数据环境下的基础文件HDFS 操作
1.使用hdfs用户或者hadoop用户登录 2.在linux shell下执行命令 hadoop fs -put '本地文件名' hadoop fs - put '/home/hdfs/sample ...
大数据入门到精通19--mysql 数据导入到hive数据中
一.正常按照数据库和表导入 \\前面介绍了通过底层文件得形式导入到hive的表中,或者直接导入到hdfs中,\\现在介绍通过hive的database和table命令来从上层操作.sqoop impo ...
大数据入门到精通13--为后续和MySQL数据库准备
We will be using the sakila database extensively inside the rest of the course and it would be great ...
大数据入门到精通2--spark rdd 获得数据的三种方法
通过hdfs或者spark用户登录操作系统,执行spark-shell spark-shell 也可以带参数,这样就覆盖了默认得参数 spark-shell --master yarn --num-e ...
大数据学习之路之sqoop导入
按照网上的代码导入 hadoop(十九)-Sqoop数据清洗 - 简书 (jianshu.com) ./sqoop import --connect "jdbc:mysql://192.16 ...
大数据入门到精通16--hive 的条件语句和聚合函数
一.条件表达 case when ... then when .... then ... when ... then ...end select film_id,rpad(title,20," ...
大数据入门到精通12--spark dataframe 注册成hive 的临时表
一.获得最初的数据并形成dataframe val ny= sc.textFile("data/new_york/")val header=ny.firstval filterNY ...
大数据入门到精通11-spark dataframe 基础操作
// dataframe is the topic 一.获得基础数据.先通过rdd的方式获得数据 val ny= sc.textFile("data/new_york/")val ...
大数据入门到精通10--spark rdd groupbykey的使用
//groupbykey 一.准备数据val flights=sc.textFile("data/Flights/flights.csv")val sampleFlights=sc ...

随机推荐

ROS tf
一.节点中使用(cpp,python) 1. ros wiki 提供的tutorials 2. https://blog.csdn.net/start_from_scratch/article/det ...
工控随笔_06_西门子_Step7归档项目无法备份的解决方法
在一次备份Step7项目时,突然发现无法进行备份而是报错,具体的报错内容如下所示: 图 step7 归档程序时报pkzipc.exe 应用程序错误内存不能为"read" 一.s ...
Django框架连接Oracle -ServerName方式报错
连接前: 修改后:
关于栈、队列、优先队列的应用——UVa11995
这本来是上一篇博客里的内容,但不知道什么原因breakdown了……我就简单放上一道题好了题意:这道题的题目是“猜猜数据结构”,题意就是给你一些输入输出数据,让你根据这些数据判断是什么数据结构.要猜 ...
廖雪峰Java9正则表达式-1正则表达式入门-2正则表达式匹配规则
正则表达式的匹配规则: 从左到右按规则匹配匹配规则及示例可以匹配不能匹配 "abc" "abc" 不能匹配:"ab", "A ...
oracle dg 报错提示涉及硬盘错误
###oracle dg 报错提示涉及硬盘错误 Dec 23 03:28:01 xhisdg rsyslogd: [origin software="rsyslogd" swVe ...
爬虫系列1：Requests+Xpath 爬取豆瓣电影TOP
爬虫1:Requests+Xpath 爬取豆瓣电影TOP [抓取]:参考前文爬虫系列1:https://www.cnblogs.com/yizhiamumu/p/9451093.html [分页]: ...
sqlserver 游标使用
文章来源:https://blog.csdn.net/farmwang/article/details/78661326 --声明一个游标 DECLARE MyCursor CURSOR FOR SE ...
在MyEclipse使用Git新建分支，并上传分支---图文教程
1.选中项目,右键--->Team--->Switch To--->New Branch: 2.在弹出的窗口中,填写新建的分支名称,如下图 3.当前分支就会变成新建分支“test” ...
Java 公平锁与非公平锁学习研究
最近学习研究了一下Java中关于公平锁与非公平锁的底层实现原理,总结了一下. 首先呢,通过其字面意思,公平与非公平的评判标准就是付出与收获成正比(和社会中的含义差不多一个意思).放到程序中,尤其是在 ...

大数据入门到精通18--sqoop 导入关系库到hdfs中和hive表中

大数据入门到精通18--sqoop 导入关系库到hdfs中和hive表中的更多相关文章

随机推荐

热门专题