1. Scenario description

when I use sqoop to import mysql table into hive, I got the following error:

// :: WARN hcat.SqoopHCatUtilities: The Sqoop job can fail if types are not  assignment compatible
// :: WARN hcat.SqoopHCatUtilities: The HCatalog field submername has type string. Expected = varchar based on database column type : VARCHAR
// :: WARN hcat.SqoopHCatUtilities: The Sqoop job can fail if types are not assignment compatible
// :: INFO mapreduce.DataDrivenImportJob: Configuring mapper for HCatalog import job
// :: INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
// :: INFO client.RMProxy: Connecting to ResourceManager at hadoop-namenode01/192.168.1.101:
// :: WARN conf.HiveConf: HiveConf of name hive.server2.webui.host.port does not exist
// :: INFO Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
// :: INFO db.DBInputFormat: Using read commited transaction isolation
// :: INFO mapreduce.JobSubmitter: number of splits:
// :: INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1562229385371_50086
// :: INFO impl.YarnClientImpl: Submitted application application_1562229385371_50086
// :: INFO mapreduce.Job: The url to track the job: http://hadoop-namenode01:8088/proxy/application_1562229385371_50086/
// :: INFO mapreduce.Job: Running job: job_1562229385371_50086
// :: INFO hive.metastore: Closed a connection to metastore, current connections:
// :: INFO mapreduce.Job: Job job_1562229385371_50086 running in uber mode : false
// :: INFO mapreduce.Job: map % reduce %
// :: INFO mapreduce.Job: Task Id : attempt_1562229385371_50086_m_000000_0, Status : FAILED
Error: GC overhead limit exceeded

Why Sqoop Import throws this exception?
The answer is – During the process, RDBMS database (NOT SQOOP) fetches all the rows at one shot and tries to load everything into memory. This causes memory spill out and throws error. To overcome this you need to tell RDBMS database to return the data in batches. The following parameters “?dontTrackOpenResources=true&defaultFetchSize=10000&useCursorFetch=true” following the jdbc connection string tells database to fetch 10000 rows per batch.

The script I use to import is as follows:

file sqoop_order_detail.sh

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect jdbc:mysql://lenmom-mysql:3306/inventory \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

the target mysql table has 10 billion record.

2.Solution:

2.1 solution 1

modify the mysql url to set stream read data style by append the following content:

?dontTrackOpenResources=true&defaultFetchSize=&useCursorFetch=true

of which the defaultFetchSize can be changed according to specific condition,in my case, the whole script is :

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect jdbc:mysql://lenmom-mysql:3306/inventory?dontTrackOpenResources=true\&defaultFetchSize=10000\&useCursorFetch=true\&useUnicode=yes\&characterEncoding=utf8\&characterEncoding=utf8 \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

don't  forget to use escape for & in shell script, or we can also use "jdbc url" to instead of using escape.

#!/bin/bash

/home/lenmom/sqoop-1.4./bin/sqoop import \
--connect "jdbc:mysql://lenmom-mysql:3306/inventory?dontTrackOpenResources=true&defaultFetchSize=10000&useCursorFetch=true&useUnicode=yes&characterEncoding=utf8&characterEncoding=utf8" \
--username root --password root \
--driver com.mysql.jdbc.Driver \
--table order_detail \
--hcatalog-database orc \
--hcatalog-table order_detail \
--hcatalog-partition-keys pt_log_d \
--hcatalog-partition-values \
--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' \
-m

2.2 Solution 2

sqoop import -Dmapreduce.map.memory.mb= -Dmapreduce.map.java.opts=-Xmx1600m -Dmapreduce.task.io.sort.mb=

Above parameters needs to be tuned according to the data for a successful SQOOP pull.

2.3 Solution 3

increase mapper number(the default mapper number is 4, should not greater than datanode number)

sqoop job --exec lenmom-job -- --num-mappers ;

reference:

https://stackoverflow.com/questions/26484873/cloudera-settings-sqoop-import-gives-java-heap-space-error-and-gc-overhead-limit

sqoop import mysql to hive table:GC overhead limit exceeded的更多相关文章

  1. troubleshooting-sqoop mysql导入hive 报:GC overhead limit exceeded

    Halting due to Out Of Memory Error...18/09/13 21:42:17 INFO mapreduce.Job: Task Id : attempt_1536756 ...

  2. java.lang.OutOfMemoryError:GC overhead limit exceeded填坑心得

    我遇到这样的问题,本地部署时抛出异常java.lang.OutOfMemoryError:GC overhead limit exceeded导致服务起不来,查看日志发现加载了太多资源到内存,本地的性 ...

  3. [转]java.lang.OutOfMemoryError:GC overhead limit exceeded

    我遇到这样的问题,本地部署时抛出异常java.lang.OutOfMemoryError:GC overhead limit exceeded导致服务起不来,查看日志发现加载了太多资源到内存,本地的性 ...

  4. java.lang.OutOfMemoryError:GC overhead limit exceeded

    在调测程序时报java.lang.OutOfMemoryError:GC overhead limit exceeded 错误 错误原因:在用程序进行数据切割时报了该错误.由于在本地执行数据切割测试的 ...

  5. Android:java.lang.OutOfMemoryError:GC overhead limit exceeded

    Android编译:java.lang.OutOfMemoryError:GC overhead limit exceeded 百度好多什么JVM啊之类的东西,新手简单粗暴的办法: 1.在的Model ...

  6. oozie: GC overhead limit exceeded 解决方法

    1.异常表现形式 1)  提示信息      Error java.lang.OutOfMemoryError: GC overhead limit exceeded 2)提示出错      Erro ...

  7. java.lang.OutOfMemoryError:GC overhead limit exceeded解决方法

    异常如下:Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded 一.解 ...

  8. java.lang.OutOfMemoryError:GC overhead limit exceeded解决方

    Tomcat异常信息: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit excee ...

  9. solr索引报错(java.lang.OutOfMemoryError:GC overhead limit exceeded)

    配置文件修改如下: <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3 ...

随机推荐

  1. 团队协作editconfig与eslint

    editconfig root = true [*] charset = utf-8 indent_style = space indent_size = 2 end_of_line = lf ins ...

  2. 使用Jackson的@JsonFormat注解时出现少了 8 个小时

    比如数据库存的日期是2018-01-05,转成json则变成了2018-01-04 解决办法: @JsonFormat(pattern="yyyy-MM-dd") public D ...

  3. 织梦DedeCMS会员空间内的文章列表无法分页的解决方法

    DedeCMS 5.7会员空间的文章列表分页显示不正常,总是显示0页0条记录错误.下面告诉大家如何解决这个问题: 找到并打开include/arc.memberlistview.class.php文件 ...

  4. MySQL常用五大引擎的区别

    MyISAM: 如果你有一个 MyISAM 数据表包含着 FULLTEXT 或 SPATIAL 索引,你将不能把它转换为使用 另一种引擎,因为只有 MyISAM 支持这两种索引. BLOB: 如果你有 ...

  5. 洛谷 P4377 [USACO18OPEN]Talent Show + 分数规划

    分数规划 分数规划可以用来处理有关分数即比值的有关问题. 而分数规划一般不单独设题,而是用来和dp,图论,网络流等算法结合在一起. 而基础的做法一般是通过二分. 二分题目我们都知道,需要求什么的最小或 ...

  6. 数组思维 -- join的一些用法感悟

    组合字符串的时候, 组合 sql 的时候, 使用join 会非常有用, join  and   记得前端时间去看面试题的时候, 总会出一个小的性能题目, 就是   如果有大量的字符串处理的时候, 怎么 ...

  7. php grpc请求go,报Yac::get(): Unserialization failed

    大概说下yac是个啥东西..看鸟哥的博客 Yac 是为PHP实现的一个基于共享内存, 无锁的内容Cache Yac的两个应用场景:1.让PHP进程之间共享一些简单的数据2.高效地缓存一些页面结果 假设 ...

  8. 《挑战30天C++入门极限》在c/c++中利用数组名作为函数参数传递排序和用指针进行排序的例子。

        在c/c++中利用数组名作为函数参数传递排序和用指针进行排序的例子. 以下两个例子要非常注意,函数传递的不是数组中数组元素的真实值而是数组在内存中的实际地址. #include <std ...

  9. Windows系统清除占用的串口号列表批处理

    蛋疼总是无缘无故被占用 @echo off reg query "HKLM\SYSTEM\CurrentControlSet\Control\COM Name Arbiter" / ...

  10. C++2.0新特性(二)——<一致性初始化、Initializer_list 、for循环、explicit>

    一.一致性初始化(uniform initialization) 之前初始化时存在多个版本,让使用者使用时比较混乱,现在提供一种万用的初始化方法,就是使用大括号. 原理解析:当编译器看到大括号包起来的 ...