1.DataX介绍

DataX

DataX 是阿里巴巴集团内被广泛使用的离线数据同步工具/平台,实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、TableStore(OTS)、MaxCompute(ODPS)、DRDS 等各种异构数据源之间高效的数据同步功能。

Features

DataX本身作为数据同步框架,将不同数据源的同步抽象为从源头数据源读取数据的Reader插件,以及向目标端写入数据的Writer插件,理论上DataX框架可以支持任意数据源类型的数据同步工作。同时DataX插件体系作为一套生态系统, 每接入一套新数据源该新加入的数据源即可实现和现有的数据源互通。

安装

Download DataX下载地址

解压后即可使用,运行脚本如下

python27 datax.py ..\job\test.json

2.DataX数据同步

2.1 从MySQL到MySQL

建表语句

DROP TABLE IF EXISTS `tb_dmp_requser`;
CREATE TABLE `tb_dmp_requser` (
`reqid` varchar() NOT NULL COMMENT '活动编号',
`exetype` varchar() DEFAULT NULL COMMENT '执行类型',
`allnum` varchar() DEFAULT NULL COMMENT '全部目标用户数量',
`exenum` varchar() DEFAULT NULL COMMENT '执行的目标用户数据',
`resv` varchar() DEFAULT NULL,
`createtime` datetime DEFAULT NULL
)

将dmp数据库的tb_dmp_requser表拷贝到dota2_databank的tb_dmp_requser表

job_mysql_to_mysql.json如下

{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"allnum", "reqid"
],
"connection": [{
"jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/dmp"],
"table": ["tb_dmp_requser"]
}],
"password": "",
"username": "root"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"allnum", "reqid"
],
"preSql": [
"delete from tb_dmp_requser"
],
"connection": [{
"jdbcUrl": "jdbc:mysql://127.0.0.1:3306/dota2_databank",
"table": ["tb_dmp_requser"]
}],
"password": "",
"username": "root",
"writeMode": "replace"
}
}
}],
"setting": {
"speed": {
"channel": ""
}
}
}
}

2.2 从Oracle到Oracle

将scott用户下的test表拷贝到test用户下的test表

建表语句

drop table TEST;

CREATE TABLE TEST (
ID NUMBER() NULL,
NAME VARCHAR2( BYTE) NULL
)
LOGGING
NOCOMPRESS
NOCACHE;

job_oracle_oracle.json

{
"job": {
"content": [
{
"reader": {
"name": "oraclereader",
"parameter": {
"column": ["id","name"],
"connection": [
{
"jdbcUrl": ["jdbc:oracle:thin:@localhost:1521:ORCL"],
"table": ["test"]
}
],
"password": "tiger",
"username": "scott",
"where":"rownum < 1000"
}
},
"writer": {
"name": "oraclewriter",
"parameter": {
"column": ["id","name"],
"connection": [
{
"jdbcUrl": "jdbc:oracle:thin:@localhost:1521:ORCL",
"table": ["test"]
}
],
"password": "test",
"username": "test"
}
}
}
],
"setting": {
"speed": {
"channel":
}
}
}
}

2.3 从HBase到本地

将HBase的"LXW"表拷贝到本地路径../job/datax_hbase

建表语句,添加两条数据

hbase(main)::> create 'LXW','CF'
row(s) in 1.2120 seconds => Hbase::Table - LXW
hbase(main)::> put 'LXW','row1','CF:NAME','lxw'
row(s) in 0.0120 seconds hbase(main)::> put 'LXW','row1','CF:AGE',''
row(s) in 0.0080 seconds hbase(main)::> put 'LXW','row1','CF:ADDRESS','BeijingYiZhuang'
row(s) in 0.0070 seconds hbase(main)::> put 'LXW','row2','CF:ADDRESS','BeijingYiZhuang2'
row(s) in 0.0060 seconds hbase(main)::> put 'LXW','row2','CF:AGE',''
row(s) in 0.0050 seconds hbase(main)::> put 'LXW','row2','CF:NAME','lxw2'
row(s) in 0.0040 seconds hbase(main)::> exit

job_hbase_to_local.json

hbase高可用集群配置参考https://www.cnblogs.com/Java-Starter/p/10756647.html

{
"job": {
"content": [
{
"reader": {
"name": "hbase11xreader",
"parameter": {
"hbaseConfig": {
"hbase.zookeeper.quorum": "CentOS7Five:2181,CentOS7Six:2181,CentOS7Seven:2181"
},
"table": "LXW",
"encoding": "utf-8",
"mode": "normal",
"column": [
{
"name":"rowkey",
"type":"string"
},
{
"name":"CF:NAME",
"type":"string"
},
{
"name":"CF:AGE",
"type":"string"
},
{
"name":"CF:ADDRESS",
"type":"string"
} ], "range": {
"endRowkey": "",
"isBinaryRowkey": false,
"startRowkey": ""
} }
},
"writer": {
"name": "txtfilewriter",
"parameter": {
"dateFormat": "yyyy-MM-dd",
"fieldDelimiter": "\t",
"fileName": "LXW",
"path": "../job/datax_hbase",
"writeMode": "truncate"
}
}
}
],
"setting": {
"speed": {
"channel":
}
}
}
}

在../job/datax_hbase路径下生成文件LXW__e647d969_d2c6_47ad_9534_15c90d696099

文件内容如下

row1    lxw        BeijingYiZhuang
row2 lxw2 BeijingYiZhuang2

2.4 从本地到HBase

将本地文件导入到HBase的LXW表中

源数据source.txt

row3,jjj1,,BeijingYiZhuang3
row4,jjj2,,BeijingYiZhuang4

job_local_to_hbase.json

{
"job": {
"setting": {
"speed": {
"channel":
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": "../job/datax_hbase/source.txt",
"charset": "UTF-8",
"column": [
{
"index": ,
"type": "String"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
}
],
"fieldDelimiter": ","
}
},
"writer": {
"name": "hbase11xwriter",
"parameter": {
"hbaseConfig": {
"hbase.zookeeper.quorum": "CentOS7Five:2181,CentOS7Six:2181,CentOS7Seven:2181"
},
"table": "LXW",
"mode": "normal",
"rowkeyColumn": [
{
"index":,
"type":"string"
}
],
"column": [
{
"index":,
"name":"CF:NAME",
"type":"string"
},
{
"index":,
"name":"CF:AGE",
"type":"string"
},
{
"index":,
"name":"CF:ADDRESS",
"type":"string"
}
],
"versionColumn":{
"index": -,
"value":""
},
"encoding": "utf-8"
}
}
}
]
}
}

导入过后可以看到,新增的数据

hbase(main)::* get 'LXW','row3'
COLUMN CELL
CF:ADDRESS timestamp=, value=BeijingYiZhuang3
CF:AGE timestamp=, value=
CF:NAME timestamp=, value=jjj1

2.5 从本地到HDFS/Hive

HDFS导入到本地不支持高可用,所以这里不做实验

Hive高可用配置参考https://www.cnblogs.com/Java-Starter/p/10756528.html

将本地数据文件导入到HDFS/Hive,在Hive上建表才可以导入

因为路径的问题,只能在Linux端操作

源数据source.txt

,,,
,,,

建表语句

 create table datax_test(
col1 varchar(),
col2 varchar(),
col3 varchar(),
col4 varchar()
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC;

fileType要orc,text类型必须要压缩,有可能乱码

job_local_to_hdfs.json

{
"setting": {},
"job": {
"setting": {
"speed": {
"channel":
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["../job/datax_hbase/source.txt"],
"encoding": "UTF-8",
"column": [
{
"index": ,
"type": "String"
},
{
"index": ,
"type": "String"
},
{
"index": ,
"type": "String"
},
{
"index": ,
"type": "String"
}
],
"fieldDelimiter": ","
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://ns1/",
"hadoopConfig":{
"dfs.nameservices": "ns1",
"dfs.ha.namenodes.ns1": "nn1,nn2",
"dfs.namenode.rpc-address.ns1.nn1": "CentOS7One:9000",
"dfs.namenode.rpc-address.ns1.nn2": "CentOS7Two:9000",
"dfs.client.failover.proxy.provider.ns1": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
},
"fileType": "orc",
"path": "/user/hive/warehouse/datax_test",
"fileName": "datax_test",
"column": [
{
"name": "col1",
"type": "VARCHAR"
},
{
"name": "col2",
"type": "VARCHAR"
},
{
"name": "col3",
"type": "VARCHAR"
},
{
"name": "col4",
"type": "VARCHAR"
}
],
"writeMode": "append",
"fieldDelimiter": ",",
"compress":"NONE"
}
}
}
]
}
}

导入完毕,查看hive

Time taken: 0.105 seconds
hive>
>
>
> select *from datax_test;
OK Time taken: 0.085 seconds, Fetched: row(s)

2.6 从txt到oracle

txt,dat,csv等格式均可,该dat文件16G,一亿八千万条记录。

建表语句

CREATE TABLE T_CJYX_HOMECOUNT (
"ACYC_ID" VARCHAR2( BYTE) NULL ,
"ADDRESS_ID" VARCHAR2( BYTE) NULL ,
"ADDRESS_NAME" VARCHAR2( BYTE) NULL ,
"ADDRESS_LEVEL" VARCHAR2( BYTE) NULL ,
"CHECK_TARGET_NUM" VARCHAR2( BYTE) NULL ,
"CHECK_VALUE" VARCHAR2( BYTE) NULL ,
"TARGET_PHONE" VARCHAR2( BYTE) NULL ,
"NOTARGET_PHONE" VARCHAR2( BYTE) NULL ,
"PARENT_ID" VARCHAR2( BYTE) NULL ,
"BCYC_ID" VARCHAR2( BYTE) NULL
)

job_txt_to_oracle.json文件如下

{
"setting": {},
"job": {
"setting": {
"speed": {
"channel":
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["E:/opt/srcbigdata2/di_00121_20190427.dat"],
"encoding": "UTF-8",
"nullFormat": "",
"column": [
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
}, ],
"fieldDelimiter": "$"
}
},
"writer": {
"name": "oraclewriter",
"parameter": {
"column": ["acyc_id","address_id","address_name","address_level","check_target_num","check_value","target_phone","notarget_phone","parent_id","bcyc_id"],
"connection": [
{
"jdbcUrl": "jdbc:oracle:thin:@localhost:1521:ORCL",
"table": ["T_CJYX_HOMECOUNT"]
}
],
"password": "test",
"username": "test"
}
}
}
]
}
}

脚本

python27 datax.py ../job/job_txt_to_oracle.json

效率比oracle自带的sqlldr快很多,只需要117分钟,就导入了一亿八千万数据,sqlldr需要41小时。

2.7 从txt到txt

job_txt_to_txt.json如下

{
"setting": {},
"job": {
"setting": {
"speed": {
"channel":
}
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": ["../job/data_txt/a.txt"],
"encoding": "UTF-8",
"column": [
{
"index": ,
"type": "string"
},
{
"index": ,
"type": "string"
}, ],
"fieldDelimiter": "$"
}
},
"writer": {
"name": "txtfilewriter",
"parameter": {
"path": "../job/data_txt/",
"fileName": "luohw",
"writeMode": "truncate",
"format": "yyyy-MM-dd"
}
}
}
]
}
}

导入完毕生成文件如下

DataX操作指南的更多相关文章

  1. 【项目管理】GitHub使用操作指南

    GitHub使用操作指南 作者:白宁超 2016年10月5日18:51:03> 摘要:GitHub的是版本控制和协作代码托管平台,它可以让你和其他人的项目从任何地方合作.相对于CVS和SVN的联 ...

  2. datax+hadoop2.X兼容性调试

    以hdfsreader到hdfswriter为例进行说明: 1.datax的任务配置文件里需要指明使用的hadoop的配置文件,在datax+hadoop1.X的时候,可以直接使用hadoop1.X/ ...

  3. Tourist.js – 简单灵活的操作指南和导航插件

    Tourist.js 是一个基于 Backbone 和 jQuery 开发的轻量库,帮助你在应用程序创建简单易用的操作指南和导航功能.相比网站,它更适合用于复杂的,单页网站类型的应用程序.Touris ...

  4. HHKB MAC 配置指南 操作指南 快捷键

    1. 设备: mac电脑一台.hhkb键盘一个 2. 初级配置 (1)调节hhkb的模式为Macintosh模式:011001 (打开键盘侧边的滑盖,按照这个顺序调正) (2)Mac电脑安装官方驱动  ...

  5. 比较详细Python正则表达式操作指南(re使用)

    比较详细Python正则表达式操作指南(re使用) Python 自1.5版本起增加了re 模块,它提供 Perl 风格的正则表达式模式.Python 1.5之前版本则是通过 regex 模块提供 E ...

  6. [推荐]DataX、DbSync和Timetunnel学习贴

    [推荐]DataX.DbSync和Timetunnel学习贴 一 DataX 二 DbSync 三  Timetunnel TimeTunnel :http://code.taobao.org/p/T ...

  7. 关于sqoop与datax。 和sqoop to oracle插件OraOop

         之前我还在想了解下datax,是否有可能替换sqoop,但了解后发现,datax和sqoop的业务场景是不同的.前者适合异构数据库的同步,后者适合hdfs与rdbms互相之间的同步.针对sq ...

  8. WEBUS2.0 In Action - 索引操作指南(2)

    上一篇:WEBUS2.0 In Action - 索引操作指南(1) | 下一篇:WEBUS2.0 In Action - 搜索操作指南(1) 3. 添加.删除.撤销删除和修改文档 在WEBUS中要将 ...

  9. WEBUS2.0 In Action - 搜索操作指南 - (1)

    上一篇:WEBUS2.0 In Action - 索引操作指南(2) | 下一篇:WEBUS2.0 In Action - 搜索操作指南(2) 1. IQueriable中内置的搜索功能 在Webus ...

随机推荐

  1. y7000笔记本 darknet-yolo安装与测试(Ubuntu16.04+Cuda9.0+Cudnn7.1)

    https://zhuanlan.zhihu.com/p/41096599 1.先查看是否安装有以下组件,若有先考虑彻底删除再安装(安装严格按照下面顺序进行) 查看nvidia 版本 nvidia-s ...

  2. LeetCode---Sort && Segment Tree && Greedy

    307. Range Sum Query - Mutable 思路:利用线段树,注意数据结构的设计以及建树过程利用线段树,注意数据结构的设计以及建树过程 public class NumArray { ...

  3. 前端中的 Attribute & Property

    为了在翻译上显示出区别,Attribute一般被翻译为特性,Property被译为属性. 在使用上面,Angular已经表明态度 Template binding works with propert ...

  4. Nginx常见配置

    特别提示:本人博客部分有参考网络其他博客,但均是本人亲手编写过并验证通过.如发现博客有错误,请及时提出以免误导其他人,谢谢!欢迎转载,但记得标明文章出处:http://www.cnblogs.com/ ...

  5. win10搜索框突然不能使用了

    备忘: win10搜索不出来了,使用以下方法恢复了,备忘下 1,首先打开任务管理器 重新启动wservice服务 2.发现这时候搜索依然不能使用 然后重新启动explorer.exe (1)右键关闭该 ...

  6. LC 712. Minimum ASCII Delete Sum for Two Strings

    Given two strings s1, s2, find the lowest ASCII sum of deleted characters to make two strings equal. ...

  7. Postman系列之发送请求(一)

    实验简介 Postman是一款功能强大的网页调试与发送网页HTTP请求的Chrome插件.它能提供功能强大的 Web API 和 HTTP 请求的调试,它能够发送任何类型的HTTP 请求 (GET, ...

  8. centos6密钥验证

    密钥验证: 公钥(服务器上)私钥(客户端)在远程登录软件上可生成SSH密钥对.在服务器上建目录.SSH 再在其中建文件authorized_keys,复制公钥到服务器上此文件中. (1)selinux ...

  9. LoadRunner中的Web 函数列表

    LoadRunner中的Web 函数列表 web test LoadRunner fuction_list D:\Program Files (x86)\Mercury Interactive\Mer ...

  10. Azure sql database 监控存储过程的传参情况

    背景 实施开发的同事找到我,反馈说项目中使用Azure sql database 之后,无法使用Profiler来监控自己开发的存储过程的参数传参情况.确实profiler这些实例级别的工具在Azur ...