1. 什么是DistCp

　　DistCp(分布式拷贝)是用于大规模集群内部和集群之间拷贝的工具。它使用Map/Reduce实现文件分发，错误处理和恢复，以及报告生成。它把文件和目录的列表作为map任务的输入，每个任务会完成源列表中部分文件的拷贝。由于使用了Map/Reduce方法，这个工具在语义和执行上都会有特殊的地方。

1.1 DistCp使用的注意事项

　　1. DistCp会尝试着均分需要拷贝的内容，这样每个map拷贝差不多相等大小的内容。但因为文件是最小的拷贝粒度，所以配置增加同时拷贝（如map）的数目不一定会增加实际同时拷贝的数目以及总吞吐量。

　　2. 如果没使用-m选项，DistCp会尝试在调度工作时指定map的数据为 min (total_bytes / bytes.per.map, 20 * num_task_trackers)，其中bytes.per.map默认是256MB。

　　3. 建议对于长时间运行或定期运行的作业，根据源和目标集群大小、拷贝数量大小以及带宽调整map的数目。

　　4. 对于不同Hadoop版本间的拷贝，用户应该使用HftpFileSystem。这是一个只读文件系统，所以DistCp必须运行在目标端集群上（更确切的的说是能够写入目标集群的TaskTracker上）。源的格式是 hftp://<dfs.http.address>/<path> （默认情况dfs.http.address是 <namenode>:50070）。

2. Hadoop DistCp的api使用

[root@node105 ~]# hadoop distcp

usage: distcp OPTIONS [source_path...] <target_path>

              OPTIONS

 -append                       Reuse existing data in target files and

                               append new data to them if possible

 -async                        Should distcp execution be blocking

 -atomic                       Commit all changes or none

 -bandwidth <arg>              Specify bandwidth per map in MB

 -blocksperchunk <arg>         If set to a positive value, fileswith more

                               blocks than this value will be split into

                               chunks of <blocksperchunk> blocks to be

                               transferred in parallel, and reassembled on

                               the destination. By default,

                               <blocksperchunk> is  and the files will be

                               transmitted in their entirety without

                               splitting. This switch is only applicable

                               when the source file system implements

                               getBlockLocations method and the target

                               file system implements concat method

 -copybuffersize <arg>         Size of the copy buffer to use. By default

                               <copybuffersize> is 8192B.

 -delete                       Delete from target, files missing in source

 -diff <arg>                   Use snapshot diff report to identify the

                               difference between source and target

 -f <arg>                      List of files that need to be copied

 -filelimit <arg>              (Deprecated!) Limit number of files copied

                               to <= n

 -filters <arg>                The path to a file containing a list of

                               strings for paths to be excluded from the

                               copy.

 -i                            Ignore failures during copy

 -log <arg>                    Folder on DFS where distcp execution logs

                               are saved

 -m <arg>                      Max number of concurrent maps to use for

                               copy

 -mapredSslConf <arg>          Configuration for ssl config file, to use

                               with hftps://. Must be in the classpath.

 -numListstatusThreads <arg>   Number of threads to use for building file

                               listing (max ).

 -overwrite                    Choose to overwrite target files

                               unconditionally, even if they exist.

 -p <arg>                      preserve status (rbugpcaxt)(replication,

                               block-size, user, group, permission,

                               checksum-type, ACL, XATTR, timestamps). If

                               -p is specified with no <arg>, then

                               preserves replication, block size, user,

                               group, permission, checksum type and

                               timestamps. raw.* xattrs are preserved when

                               both the source and destination paths are

                               in the /.reserved/raw hierarchy (HDFS

                               only). raw.* xattrpreservation is

                               independent of the -p flag. Refer to the

                               DistCp documentation for more details.

 -rdiff <arg>                  Use target snapshot diff report to identify

                               changes made on target

 -sizelimit <arg>              (Deprecated!) Limit number of files copied

                               to <= n bytes

 -skipcrccheck                 Whether to skip CRC checks between source

                               and target paths.

 -strategy <arg>               Copy strategy to use. Default is dividing

                               work based on file sizes

 -tmp <arg>                    Intermediate work path to be used for

                               atomic commit

 -update                       Update target, copying only missingfiles or

                               directories

3. 测试用例

　　1. 查看将要迁移的目标文件

[root@calculation101 ~]# hdfs dfs -du -h /test///

　　2. 创建新集群的测试目录：

[hdfs@node105 root]$

[hdfs@node105 root]$ hdfs dfs -mkdir -p /yangjianqiu/data/

[hdfs@node105 root]$

[hdfs@node105 root]$ hdfs dfs -chown -R root:root  /yangjianqiu/data/

[hdfs@node105 root]$

[hdfs@node105 root]$ exit

exit

[root@node105 ~]#

[root@node105 ~]# hdfs dfs -ls /yangjianqiu

Found  items

drwxr-xr-x   - root root           -- : /yangjianqiu/data

　　2. 开始迁移数据I并记录日志以及迁移数据所用时间：

[root@node105 ~]# mkdir /yangjianqiu

[root@node105 ~]#

[root@node105 ~]#

[root@node105 ~]# nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 & 
[]  
[root@node105 ~]# 
[root@node105 ~]# jobs 
[]+ Running nohup time hadoop distcp hdfs://calculation101:8020/test/2018/10/23 hdfs://node105:8020/yangjianqiu/data >> /yangjianqiu/distcp.log 2>&1 &

4. 应用程序调用distcp接口

总结

【参考资料】

https://blog.bcmeng.com/post/hbase-bulkload.html Hive 数据 bulkload 导入 HBase

https://blog.csdn.net/levy_cui/article/details/70156682 hadoop跨集群之间迁移hive数据

http://blog.itpub.net/30089851/viewspace-2062010 hadoop 集群跨版本数据迁移

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/administration/content/distcp_between_ha_clusters.html DistCp between HA clusters

https://docs.cloudera.com/documentation/enterprise/5-12-x/topics/cdh_admin_distcp_data_cluster_migrate.html Copying Cluster Data Using DistCp

https://www.programcreek.com/java-api-examples/index.php?api=org.apache.hadoop.tools.DistCp Java Code Examples for org.apache.hadoop.tools.DistCp

https://www.cnblogs.com/yinzhengjie/p/9872365.html HDFS集群PB级数据迁移方案-DistCp生产环境实操篇

Hadoop跨集群迁移数据（整理版）的更多相关文章

Hive跨集群迁移
Hive跨集群迁移数据工作是会出现的事情, 其中涉及到数据迁移, metastore迁移, hive版本升级等. 1. 迁移hdfs数据至新集群hadoop distcp -skipcrccheck ...
Hadoop 跨集群访问
[原文地址] 跨集群访问发表于 2015-06-01 | 简单总结下跨集群访问的多种方式. 跨集群访问HDFS 直接给出HDFS URI 我们平常执行hadoop fs -ls /之类的操作 ...
Kafka跨集群迁移方案MirrorMaker原理、使用以及性能调优实践
序言Kakfa MirrorMaker是Kafka 官方提供的跨数据中心的流数据同步方案.其实现原理,其实就是通过从Source Cluster消费消息然后将消息生产到Target Cluster,即 ...
Hadoop hbase集群断电数据块被破坏无法启动
集群机器意外断电重启,导致hbase 无法正常启动,抛出reflect invocation异常,可能是正在执行的插入或合并等操作进行到一半时中断,导致部分数据文件不完整格式不正确或在hdfs上blo ...
Spark+Hadoop+Hive集群上数据操作记录
[rc@vq18ptkh01 ~]$ hadoop fs -ls / drwxr-xr-x+ - jc_rc supergroup 0 2016-11-03 11:46 /dt [rc@vq18ptk ...
elasticsearch跨集群数据迁移
写这篇文章,主要是目前公司要把ES从2.4.1升级到最新版本7.8,不过现在是7.9了,官方的文档:https://www.elastic.co/guide/en/elasticsearch/refe ...
中国联通改造 Apache DolphinScheduler 资源中心，实现计费环境跨集群调用与数据脚本一站式访问
截止2022年,中国联通用户规模达到4.6亿,占据了全中国人口的30%,随着5G的推广普及,运营商IT系统普遍面临着海量用户.海量话单.多样化业务.组网模式等一系列变革的冲击. 当前,联通每天处理话单 ...
Elasticsearch跨集群搜索(Cross Cluster Search)
1.简介 Elasticsearch在5.3版本中引入了Cross Cluster Search(CCS 跨集群搜索)功能,用来替换掉要被废弃的Tribe Node.类似Tribe Node,Cros ...
【转】最近搞Hadoop集群迁移踩的坑杂记
http://ju.outofmemory.cn/entry/237491 Overview 最近一段时间都在搞集群迁移.最早公司的hadoop数据集群实在阿里云上的,机器不多,大概4台的样子,据说每 ...

随机推荐

Java基础---Java环境配置
java 下载:https://www.java.com/zh_CN/ 1.Java安装:jdk9 2. JAVA_HOME 环境变量的配置在DOS命令行下使用这些工具,就要先进入到JDK的bin目 ...
Linux下嵌入式Web服务器BOA和CGI编程开发
**目录**一.环境搭建二.相关配置(部分)三.调试运行四.测试源码参考五.常见错误六.扩展(CCGI,SQLite) # 一.环境搭建操作系统:Ubuntu12.04 LTSboa下载地址(但是我找 ...
小菜鸡deepin系统手动更新火狐浏览器
前言 Deepin 是个好系统,让我看到国产系统的希望,也让我看到Linux桌面化和大众化的可能(如果你想抬杠:Deepin只是Linux魔改没什么好显摆的.那--你开心就好 ^ _ ^ ).虽然有一 ...
DS DI ES SI等等
DS is called data segment register. It points to the segment of the data used by the running program ...
k8s集群node节点一直NotReady, 且node节点(并非master)的kubelet报错：Unable to update cni config: No networks found in /etc/cni/net.d
若要转载本文,请务必声明出处:https://www.cnblogs.com/zhongyuanzhao000/p/11401031.html 问题: 集群搭建的过程中,master节点初始化成功,但 ...
刨根究底字符编码之十六——Windows记事本的诡异怪事：微软为什么跟联通有仇？（没有BOM，所以被误判为UTF8。“联通”两个汉字的GB内码，其第一第二个字节的起始部分分别是“110”和“10”，，第三第四个字节也分别是“110”和“10”）
1. 当用一个软件(比如Windows记事本或Notepad++)打开一个文本文件时,它要做的第一件事是确定这个文本文件究竟是使用哪种编码方式保存的,以便于该软件对其正确解码,否则将显示为乱码. 一般 ...
表单提交学习笔记（二）—使用jquery.validate.js进行表单验证
一.官网下载地址:http://plugins.jquery.com/validate/ 二.用法 1.在页面上进行引用 <script src="~/scripts/jquery-1 ...
SQLServer更新数据库每行一个随机数
代码: ) DECLARE user_extension_cursor CURSOR FOR SELECT UserCode FROM Users OPEN user_extension_cursor ...
浮动IP地址（Float IP）与 ARP欺骗技术
浮动IP地址: 一个网卡是可以添加多个IP的. 就是多个主机工作在同一个集群中,即两台主机以上.每台机器除了自己的实IP外,会设置一个浮动IP,浮动IP与主机的服务(HTTP服务/邮箱服务)绑在一起 ...
pandas.to_datetime() 只保留【年-月-日】
Outline pandas.to_datetime() 生成的日期会默认带有 [2019-07-03 00:00:00]的分钟精度:但有时并不需要这些分钟精度: 去掉分钟精度可以通过pandas ...

Hadoop跨集群迁移数据（整理版）