DataX - [02] 安装部署

操作系统：Alibaba Cloud Linux release 3 (Soaring Falcon)

Java：1.8.0_372

Python：3.6.8 => 2.7.1

一、安装部署

（1）下载DataX：http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

（2）解压到合适的目录：tar -zxvf datax.tar.gz -C /home/ecs-user/module/

（3）进入bin目录，执行 python datax.py etl_job.json 即可运行同步作业

二、配置案例

（1）查看作业的配置模板：python datax.py -r streamreader -w streamwriter

（2）作业模板内容如下

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !

Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.

Please refer to the streamreader document:

     https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md 

Please refer to the streamwriter document:

     https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md 

Please save the following configuration as a json file and  use

     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json

to run the job.

{

    "job": {

        "content": [

            {

                "reader": {

                    "name": "streamreader",

                    "parameter": {

                        "column": [],

                        "sliceRecordCount": ""

                    }

                },

                "writer": {

                    "name": "streamwriter",

                    "parameter": {

                        "encoding": "",

                        "print": true

                    }

                }

            }

        ],

        "setting": {

            "speed": {

                "channel": ""

            }

        }

    }

}

（3）根据模板配置案例作业

python /home/ecs-user/module/datax/bin/datax.py -r streamreader -w streamwriter >>./stream2stream.json

（4）启动DataX

三、Q & A

Q1：Python版本问题

Q2：._xxx/plugin.json配置文件不存在问题

6.1、在这里遇到一个报错，因为datax.py中的print函数都是不带小括号的 print "Hello World"，而当前python版本为3.6.8（要求需要带括号）。

处理步骤：

（1）查看/usr/bin和/etc/alternatives下关于python的内容

（2）修改/etc/alternatives的软连接指向/usr/bin/python2

（3）还有一个unversioned-python的软连接需要修改，指向/usr/bin/python2

（4）命令执行python，查看版本变化

6.2、plugin的._xxxxx/plugin.json配置文件不存在问题

处理步骤：

（1）遇到以上问题，首先我想到的就是创建软连接 ._txtfilereder，指向plugin.json所在的目录

（2）很显然，这种方式不能解决问题。最后使用了如下方式（删掉了plugin目录下reader/writer两个子目录下的所有以._开头的文件）

（3）然后再次启动DataX执行我的job

执行结果

[ecs-user@harley63 plugin]$ python /home/ecs-user/module/datax/bin/datax.py /home/ecs-user/datax_job/stream2stream.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !

Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.

2024-08-01 09:47:20.233 [main] WARN  ConfigParser - 插件[streamreader,streamwriter]加载失败，1s后重试... Exception:Code:[Common-00], Describe:[您提供的配置文件存在错误信息，请检查您的作业配置 .] - 配置信息错误，您提供的配置文件[/home/ecs-user/module/datax/plugin/writer/._mongodbwriter/plugin.json]不存在. 请检查您的配置文件.

2024-08-01 09:47:21.247 [main] ERROR Engine - 

经DataX智能分析,该任务最可能的错误原因是:

com.alibaba.datax.common.exception.DataXException: Code:[Common-00], Describe:[您提供的配置文件存在错误信息，请检查您的作业配置 .] - 配置信息错误，您提供的配置文件[/home/ecs-user/module/datax/plugin/writer/._mongodbwriter/plugin.json]不存在. 请检查您的配置文件.

        at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)

        at com.alibaba.datax.common.util.Configuration.from(Configuration.java:95)

        at com.alibaba.datax.core.util.ConfigParser.parseOnePluginConfig(ConfigParser.java:153)

        at com.alibaba.datax.core.util.ConfigParser.parsePluginConfig(ConfigParser.java:134)

        at com.alibaba.datax.core.util.ConfigParser.parse(ConfigParser.java:63)

        at com.alibaba.datax.core.Engine.entry(Engine.java:137)

        at com.alibaba.datax.core.Engine.main(Engine.java:204)

[ecs-user@harley63 plugin]$

[ecs-user@harley63 plugin]$ rm -f /home/ecs-user/module/datax/plugin/writer/._*

[ecs-user@harley63 plugin]$ python /home/ecs-user/module/datax/bin/datax.py /home/ecs-user/datax_job/stream2stream.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !

Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.

2024-08-01 09:47:46.596 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl

2024-08-01 09:47:46.604 [main] INFO  Engine - the machine info  => 

        osInfo: Alibaba 1.8 25.372-b03

        jvmInfo:        Linux amd64 5.10.134-16.3.al8.x86_64

        cpu num:        2

        totalPhysicalMemory:    -0.00G

        freePhysicalMemory:     -0.00G

        maxFileDescriptorCount: -1

        currentOpenFileDescriptorCount: -1

        GC Names        [ParNew, ConcurrentMarkSweep]

        MEMORY_NAME                    | allocation_size                | init_size

        Par Survivor Space             | 16.63MB                        | 16.63MB

        Code Cache                     | 240.00MB                       | 2.44MB

        Compressed Class Space         | 1,024.00MB                     | 0.00MB

        Metaspace                      | -0.00MB                        | 0.00MB

        Par Eden Space                 | 133.13MB                       | 133.13MB

        CMS Old Gen                    | 857.63MB                       | 857.63MB                       

2024-08-01 09:47:46.621 [main] INFO  Engine -

{

        "content":[

                {

                        "reader":{

                                "name":"streamreader",

                                "parameter":{

                                        "column":[

                                                {

                                                        "type":"long",

                                                        "value":"10"

                                                },

                                                {

                                                        "type":"string",

                                                        "value":"hello, 你好,世界 - DataX"

                                                }

                                        ],

                                        "sliceRecordCount":10

                                }

                        },

                        "writer":{

                                "name":"streamwriter",

                                "parameter":{

                                        "encoding":"UTF-8",

                                        "print":true

                                }

                        }

                }

        ],

        "setting":{

                "speed":{

                        "channel":"5"

                }

        }

}

2024-08-01 09:47:46.642 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null

2024-08-01 09:47:46.645 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0

2024-08-01 09:47:46.645 [main] INFO  JobContainer - DataX jobContainer starts job.

2024-08-01 09:47:46.646 [main] INFO  JobContainer - Set jobId = 0

2024-08-01 09:47:46.670 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...

2024-08-01 09:47:46.670 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do prepare work .

2024-08-01 09:47:46.670 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .

2024-08-01 09:47:46.670 [job-0] INFO  JobContainer - jobContainer starts to do split ...

2024-08-01 09:47:46.671 [job-0] INFO  JobContainer - Job set Channel-Number to 5 channels.

2024-08-01 09:47:46.671 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks.

2024-08-01 09:47:46.672 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks.

2024-08-01 09:47:46.690 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...

2024-08-01 09:47:46.699 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.

2024-08-01 09:47:46.701 [job-0] INFO  JobContainer - Running by standalone Mode.

2024-08-01 09:47:46.714 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks.

2024-08-01 09:47:46.721 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.

2024-08-01 09:47:46.722 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.

2024-08-01 09:47:46.737 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started

2024-08-01 09:47:46.742 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started

2024-08-01 09:47:46.750 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started

2024-08-01 09:47:46.753 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

2024-08-01 09:47:46.764 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

10      hello, 你好,世界 - DataX

2024-08-01 09:47:46.865 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[126]ms

2024-08-01 09:47:46.865 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[135]ms

2024-08-01 09:47:46.865 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[110]ms

2024-08-01 09:47:46.866 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[114]ms

2024-08-01 09:47:46.866 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[122]ms

2024-08-01 09:47:46.866 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.

2024-08-01 09:47:56.721 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 1100 bytes | Speed 110B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%

2024-08-01 09:47:56.722 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.

2024-08-01 09:47:56.722 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.

2024-08-01 09:47:56.722 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do post work.

2024-08-01 09:47:56.722 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.

2024-08-01 09:47:56.723 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /home/ecs-user/module/datax/hook

2024-08-01 09:47:56.724 [job-0] INFO  JobContainer -

         [total cpu info] =>

                averageCpu                     | maxDeltaCpu                    | minDeltaCpu

                -1.00%                         | -1.00%                         | -1.00%

         [total gc info] =>

                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime

                 ParNew               | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s

                 ConcurrentMarkSweep  | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             

2024-08-01 09:47:56.724 [job-0] INFO  JobContainer - PerfTrace not enable!

2024-08-01 09:47:56.724 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 1100 bytes | Speed 110B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%

2024-08-01 09:47:56.725 [job-0] INFO  JobContainer -

任务启动时刻                    : 2024-08-01 09:47:46

任务结束时刻                    : 2024-08-01 09:47:56

任务总计耗时                    :                 10s

任务平均流量                    :              110B/s

记录写入速度                    :              5rec/s

读出记录总数                    :                  50

读写失败总数                    :                   0

[ecs-user@harley63 plugin]$

— 业精于勤荒于嬉，行成于思毁于随 —