dataX调优
dataX调优
标签(空格分隔): ETL
一,Datax调优方向
DataX调优要分成几个部分(注:此处任务机指运行Datax任务所在的机器)。
1,网络本身的带宽等硬件因素造成的影响;
2,DataX本身的参数;
3,从源端到任务机;
4,从任务机到目的端;
即当觉得DataX传输速度慢时,需要从上述四个方面着手开始排查。
1,网络带宽等硬件因素调优
此部分主要需要了解网络本身的情况,即从源端到目的端的带宽是多少(实际带宽计算公式),平时使用量和繁忙程度的情况,从而分析是否是本部分造成的速度缓慢。以下提供几个思路。
1,可使用从源端到目的端scp,python http,nethogs等观察实际网络及网卡速度;
2,结合监控观察任务运行时间段时,网络整体的繁忙情况,来判断是否应将任务避开网络高峰运行;
3,观察任务机的负载情况,尤其是网络和磁盘IO,观察其是否成为瓶颈,影响了速度;
2,DataX本身的参数调优
全局
{
"core":{
"transport":{
"channel":{
"speed":{
"channel": 2, ## 此处为数据导入的并发度,建议根据服务器硬件进行调优
"record":-1, ##此处解除对读取行数的限制
"byte":-1, ##此处解除对字节的限制
"batchSize":2048 ##每次读取batch的大小
}
}
}
},
"job":{
...
}
}
局部
"setting": {
"speed": {
"channel": 2,
"record":-1,
"byte":-1,
"batchSize":2048
}
}
}
}
# channel增大,为防止OOM,需要修改datax工具的datax.py文件。
# 如下所示,可根据任务机的实际配置,提升-Xms与-Xmx,来防止OOM。
# tunnel并不是越大越好,过分大反而会影响宿主机的性能。
DEFAULT_JVM = "-Xms1g -Xmx1g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=%s/log" % (DATAX_HOME)
Jvm 调优
python datax.py --jvm="-Xms3G -Xmx3G" ../job/test.json
此处根据服务器配置进行调优,切记不可太大!否则直接Exception
以上为调优,应该是可以针对每个json文件都可以进行调优。
3,功能测试和性能测试
quick start https://github.com/alibaba/DataX/blob/master/userGuid.md
3.1 动态传参
如果需要导入数据的表太多而表的格式又相同,可以进行json文件的复用,举个简单的例子: python datax.py -p “-Dsdbname=test -Dstable=test” ../job/test.json
"column": ["*"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://xxx:xx/${sdbname}?characterEncoding=utf-8",
"table": ["${stable}"]
}
],
上述例子可以在linux下与shell进行嵌套使用。
3.2 mysql -> hdfs
示例一:全量导
# 1. 查看配置模板
python datax.py -r mysqlreader -w hdfswriter
# 2. 创建和编辑配置文件
vim custom/mysql2hdfs.json
{
"job":{
"setting":{
"speed":{
"channel":1
}
},
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"username":"xxx",
"password":"xxx",
"column":["id","name","age","birthday"],
"connection":[
{
"table":[
"tt_user"
],
"jdbcUrl":[
"jdbc:mysql://192.168.1.96:3306/test"
]
}
]
}
},
"writer":{
"name":"hdfswriter",
"parameter":{
"defaultFS":"hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
"fileType":"text",
"path":"/tmp/test01",
"fileName":"tt_user",
"column":[
{"name":"id", "type":"INT"},
{"name":"name", "type":"VARCHAR"},
{"name":"age", "type":"INT"}
{"name":"birthday", "type":"date"}
],
"writeMode":"append",
"fieldDelimiter":"\t",
"compress":"GZIP"
}
}
}
]
}
}
# 3. 启动导数进程
python datax.py custom/mysql2hdfs.json
# 4. 日志结果
2018-11-23 14:37:58.056 [job-0] INFO JobContainer -
任务启动时刻 : 2018-11-23 14:37:45
任务结束时刻 : 2018-11-23 14:37:58
任务总计耗时 : 12s
任务平均流量 : 9B/s
记录写入速度 : 0rec/s
读出记录总数 : 7
读写失败总数 : 0
示例二:增量导(表切分)
{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "admin",
"password": "qweasd123",
"column": [
"id",
"name",
"age",
"birthday"
],
"splitPk": "id",
"where": "id<10",
"connection": [{
"table": [
"tt_user",
"ttt_user"
],
"jdbcUrl": [
"jdbc:mysql://hadoop01:3306/test"
]
}]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.minq-cluster": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType": "text",
"path": "/tmp/test/user",
"fileName": "mysql_test_user",
"column": [{
"name": "id",
"type": "INT"
},
{
"name": "name",
"type": "VARCHAR"
},
{
"name": "age",
"type": "INT"
},
{
"name": "birthday",
"type": "date"
}
],
"writeMode": "append",
"fieldDelimiter": "\t"
}
}
}]
}
}
注意:外域机器通信需要用外网ip,未配置hostname访问会访问异常。
可以通过配置 hdfs-site.xml 进行解决:
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
<description>only cofig in clients</description>
</property>
或者通过配置java客户端:
Configuration conf=new Configuration();
conf.set("dfs.client.use.datanode.hostname", "true");
或者通过配置 datax 工作配置:
"hadoopConfig": {
"dfs.client.use.datanode.hostname":"true",
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.minq-cluster": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS00018:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS00019:8020",
"dfs.client.failover.proxy.provider.minq-cluster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
这段对应源码中:
hadoopConf = new org.apache.hadoop.conf.Configuration();
Configuration hadoopSiteParams = taskConfig.getConfiguration(Key.HADOOP_CONFIG);
JSONObject hadoopSiteParamsAsJsonObject = JSON.parseObject(taskConfig.getString(Key.HADOOP_CONFIG));
if (null != hadoopSiteParams) {
Set<String> paramKeys = hadoopSiteParams.getKeys();
for (String each : paramKeys) {
hadoopConf.set(each, hadoopSiteParamsAsJsonObject.getString(each));
}
}
hadoopConf.set(HDFS_DEFAULTFS_KEY, taskConfig.getString(Key.DEFAULT_FS));
示例三:增量导(sql查询)
mysql2hdfs-condition.json
{
"job": {
"setting": {
"speed": {
"channel":1
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "xxx",
"password": "xxx",
"connection": [
{
"querySql": [
"select id,name,age,birthday from tt_user where id <= 5"
],
"jdbcUrl": [
"jdbc:mysql://192.168.1.96:3306/test"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter":{
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop":
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
"fileType":"text",
"path":"/tmp/test01",
"fileName":"tt_user",
"column":[
{"name":"id", "type":"INT"},
{"name":"name", "type":"VARCHAR"},
{"name":"age", "type":"INT"}
{"name":"birthday", "type":"date"}
],
"writeMode":"append",
"fieldDelimiter":"\t"
}
}
}
]
}
}
hdfs -> mysql
# 1. 查看配置模板
python datax.py -r hdfsreader -w mysqlwriter
# 2. 创建和编辑配置文件
vim custom/hdfs2mysql.json
{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [{
"index": "0",
"type": "long"
},
{
"index": "1",
"type": "string"
},
{
"index": "2",
"type": "long"
},
{
"index": "3",
"type": "date"
}
],
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"encoding": "UTF-8",
"fileType": "text",
"path": "/tmp/test/tt_user*",
"fieldDelimiter": "\t"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"id",
"name",
"age",
"birthday"
],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.96:3306/test",
"table": ["ttt_user"]
}],
"username": "zhangqingli",
"password": "xxx",
"preSql": [
"select * from ttt_user",
"select name from ttt_user"
],
"session": [
"set session sql_mode='ANSI'"
],
"writeMode": "insert"
}
}
}]
}
}
# 3. 启动导数进程
python datax.py custom/hdfs2mysql.json
# 4. 日志结果
任务启动时刻 : 2018-11-23 14:44:54
任务结束时刻 : 2018-11-23 14:45:06
任务总计耗时 : 12s
任务平均流量 : 9B/s
记录写入速度 : 0rec/s
读出记录总数 : 7
读写失败总数 : 0
mongo -> hdfs
示例一:全量导
{
"job": {
"setting": {
"speed": {
"channel": 1
}
},
"content": [{
"reader": {
"name": "mongodbreader",
"parameter": {
"address": ["192.168.1.96:27017"],
"userName": "xxxx",
"userPassword": "xxxx",
"dbName": "test",
"collectionName": "student",
"column": [
{"name": "_id", "type": "string"},
{"name": "name", "type": "string"},
{"name": "age", "type": "double"},
{"name": "clazz", "type": "double"},
{"name": "hobbies", "type": "Array"},
{"name": "ss", "type": "Array"}
],
"splitter": ","
}
},
"writer": {
"name": "hdfswriter",
"parameter":{
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType":"text",
"path":"/tmp/test01",
"fileName":"mongo_student",
"column":[
{"name": "_id", "type": "string"},
{"name": "name", "type": "string"},
{"name": "age", "type": "double"},
{"name": "clazz", "type": "double"},
{"name": "hobbies", "type": "string"},
{"name": "ss", "type": "string"}
],
"writeMode":"append",
"fieldDelimiter":"\u0001"
}
}
}]
}
}
示例二:mongo增量导
{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "mongodbreader",
"parameter": {
"address": ["地址"],
"userName": "用户名",
"userPassword": "密码",
"dbName": "库名",
"collectionName": "集合名",
"query":"{created:{ $gte: ISODate('1990-01-01T16:00:00.000Z'), $lte: ISODate('2010-01-01T16:00:00.000Z') }}",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "owner", "type": "string" },
{ "name": "contributor", "type": "string" },
{ "name": "type", "type": "string" },
{ "name": "amount", "type": "int" },
{ "name": "divided", "type": "double" },
{ "name": "orderId", "type": "string" },
{ "name": "orderPrice", "type": "int" },
{ "name": "created", "type": "date" },
{ "name": "updated", "type": "date" },
{ "name": "hobbies", "type": "Array"}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType": "text",
"path": "/user/hive/warehouse/aries.db/ods_goldsystem_mdaccountitems/accounting_day=$dt",
"fileName": "filenamexxx",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "owner", "type": "string" },
{ "name": "contributor", "type": "string" },
{ "name": "type", "type": "string" },
{ "name": "amount", "type": "int" },
{ "name": "divided", "type": "double" },
{ "name": "orderId", "type": "string" },
{ "name": "orderPrice", "type": "int" },
{ "name": "created", "type": "date" },
{ "name": "updated", "type": "date" },
{ "name": "hobbies", "type": "string"}
],
"writeMode": "append",
"fieldDelimiter": "\t"
}
}
}]
}
}
hdfs -> mongo
{
"job": {
"setting": {
"speed": {
"channel": 2
}
},
"content": [{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [
{ "index": 0, "type": "String" },
{ "index": 1, "type": "String" },
{ "index": 2, "type": "Long" },
{ "index": 3, "type": "Date" }
],
"defaultFS": "hdfs://flashHadoop",
"hadoopConfig": {
"dfs.nameservices": "flashHadoop",
"dfs.ha.namenodes.flashHadoop": "nn1,nn2",
"dfs.namenode.rpc-address.flashHadoop.nn1": "VECS01118:8020",
"dfs.namenode.rpc-address.flashHadoop.nn2": "VECS01119:8020",
"dfs.client.failover.proxy.provider.flashHadoop": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"encoding": "UTF-8",
"fieldDelimiter": "\t",
"fileType": "text",
"path": "/tmp/test/mongo_student*"
}
},
"writer": {
"name": "mongodbwriter",
"parameter": {
"address": [
"192.168.1.96:27017"
],
"userName": "test",
"userPassword": "xxx",
"dbName": "test",
"collectionName": "student_from_hdfs",
"column": [
{ "name": "_id", "type": "string" },
{ "name": "name", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "birthday", "type": "date" }
],
"splitter": ",",
"upsertInfo": {
"isUpsert": "true",
"upsertKey": "_id"
}
}
}
}]
}
}
dataX调优的更多相关文章
- Impala 架构探索-Impala 系统组成与使用调优
要好好使用 Impala 就得好好梳理一下他得结构以及他存在得一些问题或者需要注意得地方.本系列博客主要想记录一下对 Impala 架构梳理以及使用上的 workaround. Impala 简介 首 ...
- 46张PPT讲述JVM体系结构、GC算法和调优
本PPT从JVM体系结构概述.GC算法.Hotspot内存管理.Hotspot垃圾回收器.调优和监控工具六大方面进行讲述.(内嵌iframe,建议使用电脑浏览) 好东西当然要分享,PPT已上传可供下载 ...
- 《深入理解Java虚拟机》调优案例分析与实战
上节学习回顾 在上一节当中,主要学习了Sun JDK的一些命令行和可视化性能监控工具的具体使用,但性能分析的重点还是在解决问题的思路上面,没有好的思路,再好的工具也无补于事. 本节学习重点 在书本上本 ...
- Spark Shuffle原理、Shuffle操作问题解决和参数调优
摘要: 1 shuffle原理 1.1 mapreduce的shuffle原理 1.1.1 map task端操作 1.1.2 reduce task端操作 1.2 spark现在的SortShuff ...
- 搭建 windows(7)下Xgboost(0.4)环境 (python,java)以及使用介绍及参数调优
摘要: 1.所需工具 2.详细过程 3.验证 4.使用指南 5.参数调优 内容: 1.所需工具 我用到了git(内含git bash),Visual Studio 2012(10及以上就可以),xgb ...
- jvm系列(四):jvm调优-命令大全(jps jstat jmap jhat jstack jinfo)
文章同步发布于github博客地址,阅读效果更佳,欢迎品尝 运用jvm自带的命令可以方便的在生产监控和打印堆栈的日志信息帮忙我们来定位问题!虽然jvm调优成熟的工具已经有很多:jconsole.大名鼎 ...
- jvm系列(六):jvm调优-从eclipse开始
jvm调优-从eclipse开始 概述 什么是jvm调优呢?jvm调优就是根据gc日志分析jvm内存分配.回收的情况来调整各区域内存比例或者gc回收的策略:更深一层就是根据dump出来的内存结构和线程 ...
- web前端性能调优
最近2个月一直在做手机端和电视端开发,开发的过程遇到过各种坑.弄到快元旦了,终于把上线了.2个月干下来满满的的辛苦,没有那么忙了自己准备把前端的性能调优总结以下,以方便以后自己再次使用到的时候得于得心 ...
- JVM调优总结
堆大小设置JVM 中最大堆大小有三方面限制:相关操作系统的数据模型(32-bt还是64-bit)限制:系统的可用虚拟内存限制:系统的可用物理内存限制.32位系统下,一般限制在1.5G~2G:64为操作 ...
随机推荐
- Tomcat Connector(BIO, NIO, APR)三种运行模式(转)
Tomcat支持三种接收请求的处理方式:BIO.NIO.APR . BIO 阻塞式I/O操作即使用的是传统 I/O操作,Tomcat7以下版本默认情况下是以BIO模式运行的,由于每个请求都要创建一个线 ...
- 纯CSS3绘制神奇宝贝伊布动画特效
在线演示 本地下载
- leecode刷题(24)-- 翻转二叉树
leecode刷题(24)-- 翻转二叉树 翻转二叉树 翻转一棵二叉树. 示例: 输入: 4 / \ 2 7 / \ / \ 1 3 6 9 输出: 4 / \ 7 2 / \ / \ 9 6 3 1 ...
- 02Spring基于xml的IOC配置--实例化Bean的三种方式
maven依赖 <dependencies> <!--IOC相关依赖--> <dependency> <groupId>org.springframew ...
- 关于redis的几件小事(四)redis的过期策略以及内存淘汰机制
1.数据为什么会过期? 首先,要明白redis是用来做数据缓存的,不是用来做数据存储的(当然也可以当数据库用),所以数据时候过期的,过期的数据就不见了,过期主要有两种情况, ①在设置缓存数据时制定了过 ...
- 02 前端之css
---恢复内容开始--- 1.css的几种引入方式: 1.行内样式 (行内式是在标记的style属性中设定的css样式.不推荐大规模使用) <p style="color: red&q ...
- 记一些使用mpvue时遇到的问题
一.在mpvue中使用vuex(和在vue中使用不同) 1.vue中使用vuex,在main.js中: import store from './store' new Vue({ store }) ...
- vue项目-axios封装、easy-mock使用
vue全家桶概括下来就是 项目构建工具(vue-cli) 路由(vue-router) 状态管理(vuex) http请求工具 vue有自己的http请求工具插件vue-resource,但是vue2 ...
- react + antd Form表单校验
非空限制 {getFieldDecorator('name', { rules: [{ required: true, message: '名称不能为空', }],})( <Input plac ...
- c++MMMMM:oo
1.union,struct和class的区别