使用sqoop把mysql数据导入hive

export HADOOP_COMMON_HOME=/hadoop

export HADOOP_MAPRED_HOME=/hadoop

 

cp /hive/lib/mysql-connector-java-5.1.25-bin.jar     /sqoop/lib/

share表第一列为自增主键 share_id，share_id<1000的数据共有999条：

mysql> SELECT COUNT(*) FROM share WHERE share_id<1000;

+----------+

| COUNT(*) |

+----------+

|      999 |

+----------+

1 ROW IN SET (0.00 sec)

使用sqoop导入到hive

[hduser@www lib]$ /sqoop/bin/sqoop import --connect jdbc:mysql://localhost/shipincon --table share --username root --password xxx --hive-import --where "share_id<1000"  -- --default-character-set=utf8

 

 

Warning: /usr/lib/hbase does not exist! HBase imports will fail.

Please set $HBASE_HOME to the root of your HBase installation.

		--password 密码   比较危险，可以改成 -P   (Read password from console)

13/07/12 18:42:58 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

		默认是用hive的分隔符 ctrl+a 可以使用 --fields-terminated-by 来更改

		插一句：在console vim中如何输入ctrl+a：  在输入模式下先按ctrl+v，再按ctrl+a即可，会显示成^A

13/07/12 18:42:58 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override

13/07/12 18:42:58 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.

13/07/12 18:42:58 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.

13/07/12 18:42:58 INFO tool.CodeGenTool: Beginning code generation

13/07/12 18:42:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1

13/07/12 18:42:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1

13/07/12 18:42:58 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /hadoop

Note: /tmp/sqoop-hduser/compile/720a3072de33c8a826cdf96d9eced686/share.java uses or overrides a deprecated API.

Note: Recompile with -Xlint:deprecation for details.

13/07/12 18:43:00 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/720a3072de33c8a826cdf96d9eced686/share.jar

13/07/12 18:43:00 WARN manager.MySQLManager: It looks like you are importing from mysql.

	如果没有blob clob longvarbinary字段，可以使用 --direct 速度也许会快一些。    text字段不清楚。

13/07/12 18:43:00 WARN manager.MySQLManager: This transfer can be faster! Use the --direct

13/07/12 18:43:00 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.

13/07/12 18:43:00 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)

13/07/12 18:43:00 INFO mapreduce.ImportJobBase: Beginning import of share

13/07/12 18:43:02 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`share_id`), MAX(`share_id`) FROM `share` WHERE ( share_id<1000 )

13/07/12 18:43:02 INFO mapred.JobClient: Running job: job_201307121751_0001

13/07/12 18:43:03 INFO mapred.JobClient:  map 0% reduce 0%

13/07/12 18:43:12 INFO mapred.JobClient:  map 50% reduce 0%

13/07/12 18:43:17 INFO mapred.JobClient:  map 100% reduce 0%

13/07/12 18:43:18 INFO mapred.JobClient: Job complete: job_201307121751_0001

13/07/12 18:43:18 INFO mapred.JobClient: Counters: 18

13/07/12 18:43:18 INFO mapred.JobClient:   Job Counters

13/07/12 18:43:18 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=20723

13/07/12 18:43:18 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0

13/07/12 18:43:18 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0

		################################ map tasks  =4 就会生成四个文件

13/07/12 18:43:18 INFO mapred.JobClient:     Launched map tasks=4

13/07/12 18:43:18 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0

13/07/12 18:43:18 INFO mapred.JobClient:   File Output Format Counters

13/07/12 18:43:18 INFO mapred.JobClient:     Bytes Written=202094

13/07/12 18:43:18 INFO mapred.JobClient:   FileSystemCounters

13/07/12 18:43:18 INFO mapred.JobClient:     HDFS_BYTES_READ=455

13/07/12 18:43:18 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=260292

13/07/12 18:43:18 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=202094

13/07/12 18:43:18 INFO mapred.JobClient:   File Input Format Counters

13/07/12 18:43:18 INFO mapred.JobClient:     Bytes Read=0

13/07/12 18:43:18 INFO mapred.JobClient:   Map-Reduce Framework

13/07/12 18:43:18 INFO mapred.JobClient:     Map input records=999

13/07/12 18:43:18 INFO mapred.JobClient:     Physical memory (bytes) snapshot=179216384

13/07/12 18:43:18 INFO mapred.JobClient:     Spilled Records=0

13/07/12 18:43:18 INFO mapred.JobClient:     CPU time spent (ms)=3360

13/07/12 18:43:18 INFO mapred.JobClient:     Total committed heap usage (bytes)=65011712

13/07/12 18:43:18 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1519448064

13/07/12 18:43:18 INFO mapred.JobClient:     Map output records=999

13/07/12 18:43:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=455

13/07/12 18:43:18 INFO mapreduce.ImportJobBase: Transferred 197.3574 KB in 18.0516 seconds (10.9329 KB/sec)

		共999行记录

13/07/12 18:43:18 INFO mapreduce.ImportJobBase: Retrieved 999 records.

13/07/12 18:43:19 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `share` AS t LIMIT 1

13/07/12 18:43:19 WARN hive.TableDefWriter: Column add_time had to be cast to a less precise type in Hive

		add_time的精度变小了。。为嘛？  mysql中 `add_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,

mysql> select add_time from share limit 1;

+---------------------+

| add_time            |

+---------------------+

| 2012-04-08 17:17:13 |

+---------------------+

1 row in set (0.00 sec)

到了hive中会变成：

2012-04-08 17:17:13.0

	貌似精度是变大了。

不过据说某些数据库的timestamp字段支持到了毫秒的精度。

13/07/12 18:43:19 INFO hive.HiveImport: Removing temporary files from import process: hdfs://localhost:54310/user/hduser/share/_logs

13/07/12 18:43:19 INFO hive.HiveImport: Loading uploaded data into Hive

13/07/12 18:43:21 INFO hive.HiveImport: WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.

13/07/12 18:43:22 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-0.10.0.jar!/hive-log4j.properties

13/07/12 18:43:22 INFO hive.HiveImport: Hive history file=/tmp/hduser/hive_job_log_hduser_201307121843_1382245494.txt

13/07/12 18:43:25 INFO hive.HiveImport: OK

13/07/12 18:43:25 INFO hive.HiveImport: Time taken: 2.389 seconds

13/07/12 18:43:25 INFO hive.HiveImport: Loading data to table default.share

13/07/12 18:43:26 INFO hive.HiveImport: Table default.share stats: [num_partitions: 0, num_files: 5, num_rows: 0, total_size: 202094, raw_data_size: 0]

13/07/12 18:43:26 INFO hive.HiveImport: OK

13/07/12 18:43:26 INFO hive.HiveImport: Time taken: 0.726 seconds

13/07/12 18:43:26 INFO hive.HiveImport: Hive import complete.

13/07/12 18:43:26 INFO hive.HiveImport: Export directory is empty, removing it.

[hduser@www lib]$

 

 

 

--default-character-set=utf8之前还有 --

 

Arguments to mysqldump and other subprograms may be supplied

after a '--' on the command line.

去hive中看看数据：

[hduser@www lib]$ /hive/bin/hive

WARNING: org.apache.hadoop.metrics.jvm.EventCounter IS deprecated. Please USE org.apache.hadoop.log.metrics.EventCounter IN ALL the log4j.properties files.

Logging initialized USING configuration IN jar:file:/usr/LOCAL/hive/lib/hive-common-0.10.0.jar!/hive-log4j.properties

Hive history file=/tmp/hduser/hive_job_log_hduser_201307121854_1566000179.txt

hive> USE shipincon;

FAILED: Error IN metadata: ERROR: The DATABASE shipincon does NOT exist.

FAILED: Execution Error, RETURN code 1 FROM org.apache.hadoop.hive.ql.EXEC.DDLTask

hive> SHOW DATABASES;

OK

DEFAULT

TIME taken: 0.244 seconds

hive> USE DEFAULT;

OK

TIME taken: 0.018 seconds

hive> SHOW TABLES;

OK

a

cleanmaster

cm

login_info

login_interm

share

TIME taken: 0.155 seconds

hive> SELECT COUNT(*) FROM sahre;

FAILED: SemanticException [Error 10001]: Line 1:21 TABLE NOT found 'sahre'

hive> SELECT COUNT(*) FROM share

    > ;

Total MapReduce jobs = 1

Launching Job 1 OUT OF 1

NUMBER OF reduce tasks determined at compile TIME: 1

IN ORDER TO CHANGE the average LOAD FOR a reducer (IN bytes):

  SET hive.EXEC.reducers.bytes.per.reducer=<number>

IN ORDER TO LIMIT the maximum NUMBER OF reducers:

  SET hive.EXEC.reducers.MAX=<number>

IN ORDER TO SET a constant NUMBER OF reducers:

  SET mapred.reduce.tasks=<number>

Starting Job = job_201307121751_0002, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201307121751_0002

KILL Command = /usr/LOCAL/hadoop/libexec/../bin/hadoop job  -KILL job_201307121751_0002

Hadoop job information FOR Stage-1: NUMBER OF mappers: 1; NUMBER OF reducers: 1

2013-07-12 18:54:37,120 Stage-1 map = 0%,  reduce = 0%

2013-07-12 18:54:40,143 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec

2013-07-12 18:54:41,154 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec

2013-07-12 18:54:42,162 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec

2013-07-12 18:54:43,168 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec

2013-07-12 18:54:44,175 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec

2013-07-12 18:54:45,181 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec

2013-07-12 18:54:46,192 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec

2013-07-12 18:54:47,199 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.82 sec

2013-07-12 18:54:48,208 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 0.82 sec

2013-07-12 18:54:49,219 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.74 sec

2013-07-12 18:54:50,238 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.74 sec

MapReduce Total cumulative CPU TIME: 1 seconds 740 msec

Ended Job = job_201307121751_0002

MapReduce Jobs Launched:

Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.74 sec   HDFS READ: 202547 HDFS WRITE: 4 SUCCESS

Total MapReduce CPU TIME Spent: 1 seconds 740 msec

OK

999

TIME taken: 18.573 seconds

hive> SHOW CREATE TABLE share;

OK

CREATE  TABLE share(

  share_id BIGINT,

  add_time string,

  tag string)

COMMENT 'Imported by sqoop on 2013/07/12 19:48:33'

ROW FORMAT DELIMITED

  FIELDS TERMINATED BY '\u0001'

  LINES TERMINATED BY '\n'

STORED AS INPUTFORMAT

  'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

  'hdfs://localhost:54310/user/hive/warehouse2/share'

TBLPROPERTIES (

  'numPartitions'='0',

  'numFiles'='4',

  'transient_lastDdlTime'='1373629961',

  'totalSize'='227161',

  'numRows'='0',

  'rawDataSize'='0')

TIME taken: 0.292 seconds

 

hive>

再去dfs里面看看：

[hduser@www lib]$ /hadoop/bin/hadoop dfs -ls /user/hive/warehouse2/share

Found 5 items

-rw-r--r--   1 hduser supergroup          0 2013-07-12 18:43 /user/hive/warehouse2/share/_SUCCESS

-rw-r--r--   1 hduser supergroup      53226 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00000

-rw-r--r--   1 hduser supergroup      50185 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00001

-rw-r--r--   1 hduser supergroup      50634 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00002

-rw-r--r--   1 hduser supergroup      48049 2013-07-12 18:43 /user/hive/warehouse2/share/part-m-00003

乍一看，像是4个partition。可是细看文件名，并不是partition的那种结构：
比如：/user/hive/warehouse2/cm/country_name=VN_en

把文件复制到本地：

/hadoop/bin/hadoop dfs -get /user/hive/warehouse2/share  ~/

[hduser@www share]$ cd ~/share

[hduser@www share]$ wc -l *

   250 part-m-00000

   250 part-m-00001

   249 part-m-00002

   250 part-m-00003

     0 _SUCCESS

   999 total

–hive-overwrite 可以覆盖hive中的现有数据。
–hive-table
Sets the table name to use when importing to Hive.
可以把数据import到某个分区中：
–hive-partition-key Name of a hive field to partition are sharded on
–hive-partition-value String-value that serves as partition key for this imported into hive in this job.

可以写个cron，周期性地把share表的新增数据导入到hive中了: –append

参考：

http://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html

http://www.mysqlperformanceblog.com/2013/07/11/mysql-and-hadoop/

使用sqoop把mysql数据导入hive的更多相关文章

python脚本用sqoop把mysql数据导入hive
转:https://blog.csdn.net/wulantian/article/details/53064123 用python把mysql数据库的数据导入到hive中,该过程主要是通过pytho ...
使用 sqoop 将mysql数据导入到hive表（import）
Sqoop将mysql数据导入到hive表中先在mysql创建表 CREATE TABLE `sqoop_test` ( `id` ) DEFAULT NULL, `name` varchar() ...
使用sqoop将mysql数据导入到hive中
首先准备工具环境:hadoop2.7+mysql5.7+sqoop1.4+hive3.1 准备一张数据库表: 接下来就可以操作了... 一.将MySQL数据导入到hdfs 首先我测试将zhaopin表 ...
Sqoop将mysql数据导入hbase的血与泪
Sqoop将mysql数据导入hbase的血与泪(整整搞了大半天) 版权声明:本文为yunshuxueyuan原创文章.如需转载请标明出处: https://my.oschina.net/yunsh ...
使用 sqoop 将mysql数据导入到hdfs（import）
Sqoop 将mysql 数据导入到hdfs(import) 1.创建mysql表 CREATE TABLE `sqoop_test` ( `id` ) DEFAULT NULL, `name` va ...
sqoop将mysql数据导入hbase、hive的常见异常处理
原创不易,如需转载,请注明出处https://www.cnblogs.com/baixianlong/p/10700700.html,否则将追究法律责任!!! 一.需求: 1.将以下这张表(test_ ...
使用sqoop将mysql数据导入到hadoop
hadoop的安装配置这里就不讲了. Sqoop的安装也很简单. 完成sqoop的安装后,可以这样测试是否可以连接到mysql(注意:mysql的jar包要放到 SQOOP_HOME/lib 下): ...
使用sqoop将mysql中表导入hive中报错
[hdfs@node1 root]$ sqoop import --connect jdbc:mysql://node2:3306/cm?charset-utf8 --username root -- ...
sqoop导出mysql数据进入hive错误
看mr的运行显示:sqoop job可以获得的select max(xxx)结果,但是当mr开始时却显示大片错误,就是连接超时,和连接重置等问题, 最后去每个节点ping mysql的ip地址,发现 ...

随机推荐

PAT006 Tree Traversals Again
题目: An inorder binary tree traversal can be implemented in a non-recursive way with a stack. For exa ...
逃不掉的mysql数据库安装方式大全yum rpm 源码
数据库虽然也不是天天安装,但每次安装都要找来找去挺烦,特整理记录在此. 系统基于:Centos 7.x 数据库版本: MySQL 5.7.x 转载请注明出处 Yum 安装方式 1.下载 yum rep ...
Origin8.0使用心得（不定时更新）
如何画双Y轴坐标? 网上的教程很多,有点鱼龙混杂的感觉,今天整理一下最简单的方法,备忘. 有3组数据,1组自变量,2组因变量.如何在同一张图下使用同一个X轴,但左边的Y轴与右边的Y轴的刻度不一样? 最 ...
android webview css z-index属性无效
在做android的web页面嵌入的时候,当使用css的z-index设置重叠div失败: 查询google说设置 -webkit-transform:translateZ(0) canvas{ -w ...
Python_selenium之获取当前页面的href属性，id属性，图片信息和截全屏
Python_selenium之获取当前页面的href属性,id属性,图片信息和截全屏一. 获取当前页面的全部信息 1. 图片信息包括图片名称.图片大小等信息 2. 只需将图片信息打印出来(ima ...
iOS开发之--制作属于自己的frameWork
开发的时候,有时候,我们会遇到协同开发,在协同开发的时候,每个开发者都会创建自己的工具类,还有就是当一个项目需要嵌套到另一个项目里面,这些时候,如果能把所需的部分打包成framework,会方便很多, ...
NHibernate VS IbatisNet
NHibernate 是当前最流行的 Java O/R mapping 框架Hibernate 的移植版本,当前版本是 1.0 .2 .它出身于sf.net..IbatisNet 是另外一种优秀的 ...
python 解析 XML文件
如下使用xml.etree.ElementTree模块来解析XML文件.ElementTree模块中提供了两个类用来完成这个目的: ElementTree表示整个XML文件(一个树形结构) Eleme ...
【BZOJ4259】残缺的字符串 FFT
[BZOJ4259]残缺的字符串 Description 很久很久以前,在你刚刚学习字符串匹配的时候,有两个仅包含小写字母的字符串A和B,其中A串长度为m,B串长度为n.可当你现在再次碰到这两个串时, ...
160622、详解JavaScript变量提升
变量在程序中随处可见.它们是一些始终在相互影响,相互作用的的数据和逻辑.正是这些互动使应用程序活了起来. 在JavaScript中使用变量很重要的一方面就是变量的提升 —— 它决定了一个变量何时可以被 ...

使用sqoop把mysql数据导入hive

使用sqoop把mysql数据导入hive的更多相关文章

随机推荐

热门专题