参考:http://rsync.samba.org/how-rsync-works.html

我们关注的是其发送与接收校验文件的算法,这里附上原文和我老婆(^_^)的翻译:

The Sender

The sender process reads the file index numbers and associated block checksum sets one at a time from the generator.

发送进程一次从生成器读取一个文件索引号和关联的块校验集合

For each file id the generator sends it will store the block checksums and build a hash index of them for rapid lookup.

 对于生成器发送的每个文件ID,它会存储数据块校验和并生成它们的哈希索引,以进行快速查找 。

Then the local file is read and a checksum is generated for the block beginning with the first byte of the local file. This block checksum is looked for in the set that was sent by the generator, and if no match is found, the non-matching byte will be appended to the non-matching data and the block starting at the next byte will be compared. This is what is referred to as the “rolling checksum”

 然后会读取本地文件,并为以本地文件的第一个字节开头的数据块生成校验和。此数据块校验和在由生成器发送的集中查找,如果未找到匹配,则会将非匹配字节附加到非匹配数据,并且会比较以下一字节开头的数据块。 这称为“rolling checksum” .

If a block checksum match is found it is considered a matching block and any accumulated non-matching data will be sent to the receiver followed by the offset and length in the receiver's file of the matching block and the block checksum generator will be advanced to the next byte after the matching block.

 如果找到数据块校验和匹配,则会将它视为匹配块,所有累积的非匹配数据将被加上在接收端的文件中的匹配数据块的偏移量和长度之后发送到接收端,并且数据块校验和生成器将提前到匹配块之后的下一字节。

Matching blocks can be identified in this way even if the blocks are reordered or at different offsets. This process is the very heart of the rsync algorithm.

可以以这种方式标识匹配块,即使重新排列数据块的顺序或数据块的偏移量不同。此过程是 rsync 算法的核心。

In this way, the sender will give the receiver instructions for how to reconstruct the source file into a new destination file. These instructions detail all the matching data that can be copied from the basis file (if one exists for the transfe), and includes any raw data that was not available locally. At the end of each file's processing a whole-file checksum is sent and the sender proceeds with the next file.

Generating the rolling checksums and searching for matches in the checksum set sent by the generator require a good deal of CPU power. Of all the rsync processes it is the sender that is the most CPU intensive.

The Receiver

The receiver will read from the sender data for each file identified by the file index number. It will open the local file (called the basis) and will create a temporary file.

The receiver will expect to read non-matched data and/or to match records all in sequence for the final file contents. When non-matched data is read it will be written to the temp-file. When a block match record is received the receiver will seek to the block offset in the basis file and copy the block to the temp-file. In this way the temp-file is built from beginning to end.

The file's checksum is generated as the temp-file is built. At the end of the file, this checksum is compared with the file checksum from the sender. If the file checksums do not match the temp-file is deleted. If the file fails once it will be reprocessed in a second phase, and if it fails twice an error is reported.

After the temp-file has been completed, its ownership and permissions and modification time are set. It is then renamed to replace the basis file.

Copying data from the basis file to the temp-file make the receiver the most disk intensive of all the rsync processes. Small files may still be in disk cache mitigating this but for large files the cache may thrash as the generator has moved on to other files and there is further latency caused by the sender. As data is read possibly at random from one file and written to another, if the working set is larger than the disk cache, then what is called a seek storm can occur, further hurting performance.

将数据从基础文件复制到临时文件会使receiver在所有rsync进程中最耗磁盘。小文件可以仍处于缓解此作用的磁盘缓存中,但对于大型文件,由于生成器已移动到其他文件,并且存在sender引起的进一步延迟,缓存可能会"抖动"(thrash)。 数据可能从一个文件随机读取,写入另一文件,如果工作集大于磁盘缓存,则会发生"寻道风暴"(seek storm),进一步影响性能。

看到这儿可能还是一头雾水,好吧,刚bing一下,前面已经有人栽树了,有图有真相:

http://coolshell.cn/articles/7425.html#more-7425

附:rsync server配置:

1.修改/etc/rsyncd.conf文件

uid=nobody
gid=nobody
use chroot = yes
max connections = 4
log file = /var/log/rsyncd.log
pid file = /var/run/rsyncd.pid
lock file = /var/run/rsync.lock
[MYSERVER]
path = /
read only = no
write only = no
list = yes
uid = root
gid = root
auth users = root
secrets file = /root/rsyncd.secrets

2. 创建/root/rsyncd.secrets
root:123456

设置该文件权限为400,否则会提示错误
chmod 400 /root/rsyncd.secrets

3. 修改/etc/xinetd.d/rsync文件

# default: off
# description: The rsync server is a good addition to an ftp server, as it \
# allows crc checksumming etc.
service rsync
{
disable = no
flags = IPv6
socket_type = stream
wait = no
user = root
server = /usr/bin/rsync
server_args = --daemon /etc/rsyncd.conf
log_on_failure += USERID
}

4. chkconfig rsync on

yum install xinetd

5. service xinetd restart

客户端

1.创建/tmp/rsync.pass文件
123456

设置该文件权限为400,否则会提示错误
chmod 400 /tmp/rsync.pass

2.运行rsync(手动测试时使用)
rsync -az --password-file=/tmp/rsync.pass --progress /local/file1 root@192.168.20.221::MYSERVER/remote/

下面附上最近的POC的结果:

Scenrio 1 (transfer if file content changed):
command: rsync --password-file=/tmp/rsync.pass --progress scptest2.vmdk root@9.112.224.244::MYSERVER/root/rsync/
result: only transfer the changed data.

Scenario 2( transfer resume if network break ):
command: rsync --password-file=/tmp/rsync.pass --progress scptest2.vmdk root@9.112.224.244::MYSERVER/root/rsync/
result: the rsync client will wait utill the network recovery and rescue the transfer.

Scenario 3(partial transfer):
command: rsync --password-file=/tmp/rsync.pass --progress --partial scptest2.vmdk root@9.112.224.244::MYSERVER/root/rsync/
result: if rsync use --partial parameter before transfer broken, it will continue to transfer the remaining data.

if scenario 1,2,3 happened concurrently, rsync client still can resume and continue to finish the transfer. because rsync has it's special
algorithm to check file changed.

rsync 文件校验及同步原理及rsync server配置的更多相关文章

  1. rsync 文件校验及同步原理

    rsync 文件校验及同步原理 参考:http://rsync.samba.org/how-rsync-works.html 我们关注的是其发送与接收校验文件的算法,这里附上原文和我老婆(^_^)的翻 ...

  2. Rsync+inotify 实时数据同步 inotify master 端的配置

    强大的,细致的,异步的文件系统事件监控机制.Linux 内科从 2.6.13 起支持 inotify Inotify 实现的几款软件:Inotify,sersync,lsyncd ※Inotify 实 ...

  3. rsync(三)算法原理和工作流程分析

    在开始分析算法原理之前,简单说明下rsync的增量传输功能. 假设待传输文件为A,如果目标路径下没有文件A,则rsync会直接传输文件A,如果目标路径下已存在文件A,则发送端视情况决定是否要传输文件A ...

  4. rsync与inotify 数据同步

    发布:thebaby   来源:脚本学堂     [大 中 小] 本文介绍下,在linux系统中,使用rsync与inotify实现数据同步的一个实例,有研究文件同步的朋友可以作个参考.本文转自:ht ...

  5. Linux下同步工具inotify+rsync使用详解

    1. rsync 1.1 什么是rsync rsync是一个远程数据同步工具,可通过LAN/WAN快速同步多台主机间的文件.它使用所谓的“Rsync演算法”来使本地和远程两个主机之间的文件达到同步,这 ...

  6. (转)Linux下同步工具inotify+rsync使用详解

    原文:https://segmentfault.com/a/1190000002427568 1. rsync 1.1 什么是rsync rsync是一个远程数据同步工具,可通过LAN/WAN快速同步 ...

  7. rsync+inotify实时数据同步多目录实战

    rsync+inotify实时数据同步多目录实战       inotify配置是建立在rsync服务基础上的配置过程 操作系统 主机名 网卡eth0 默认网关 用途 root@58server1 1 ...

  8. 利用Inotify和Rsync将webproject文件自己主动同步到多台应用server

    背景:须要搭建一套跟线上一模一样的环境,用来预公布,这是当中的web分发的一个小模块的实现过程. 1 工具以及环境简单介绍 1.1,Inotify工具 Inotify,它是一个内核用于通知用户空间程序 ...

  9. rsync 文件同步和备份

    rsync 是同步文件的利器,一般用于多个机器之间的文件同步与备份,同时也支持在本地的不同目录之间互相同步文件.在这种场景下,rsync 远比 cp 命令和 ftp 命令更加合适,它只会同步需要更新的 ...

随机推荐

  1. (转)RabbitMQ消息队列(二):”Hello, World“

    本文将使用Python(pika 0.9.8)实现从Producer到Consumer传递数据”Hello, World“. 首先复习一下上篇所学:RabbitMQ实现了AMQP定义的消息队列.它实现 ...

  2. 分享10款功能强大的HTML5/CSS3应用插件

    1.纯CSS3美化Checkbox和Radiobox按钮 外观很时尚 利用CSS3我们可以打造非常具有个性化的用户表单,今天我们就利用CSS3美化Checkbox复选框和Radiobox单选框.CSS ...

  3. BigInteger大数家法源代码及分析

    我们可以把一个很大很长的数分成多个短小的数,然后保存在一个数组中,大数之间的四则运算及其它运算都是通过数组完成.JDK就是这么实现的.JDK的BigInteger类里用一个int数组来保存数据: /* ...

  4. Silverlight引用WebService时取消对ServiceReferences.ClientConfig文件的依赖

    做过Silverlight项目的朋友都知道一般来说我们在Silverlight项目中都需要引用WebService或是WCF,引用的方式是在Visual Studio窗口中通过“添加服务引用”来添加引 ...

  5. Data Mining Resources

    韩家炜 http://www.cs.uiuc.edu/~hanj/ 著名数据挖掘书籍,<数据挖掘概念和技术>作者,在DM界久负盛名.他的个人主页里面有很多他的papers,都非常经典:还有 ...

  6. 【转】使用Memcached提高.NET应用程序的性能

    在应用程序运行的过程中总会有一些经常需要访问并且变化不频繁的数据,如果每次获取这些数据都需要从数据库或者外部文件系统中去读取,性能肯定会受到影响,所以通常的做法就是将这部分数据缓存起来,只要数据没有发 ...

  7. VIM小技巧之文件名补全

    恩,这两天在看<简明Python教程>,那里面作者建议写代码的时候前面的注释写上文件名,写上调用的解释器,比如这样: 恩,然后我当然不可能每回新建一个文件,就要在开头写上一大串东西啊,vi ...

  8. flex基础学习

    Flex是Adobe开发的一种RIA,富互联网应用,用Flex开发的东西都可以使用Flash做出来,但是Flex主要是面向的程序开发人员,前台使用ActionScript和MXML. 上面介绍了fle ...

  9. [大牛翻译系列]Hadoop(16)MapReduce 性能调优:优化数据序列化

    6.4.6 优化数据序列化 如何存储和传输数据对性能有很大的影响.在这部分将介绍数据序列化的最佳实践,从Hadoop中榨出最大的性能. 压缩压缩是Hadoop优化的重要部分.通过压缩可以减少作业输出数 ...

  10. php foreach 操作数组的代码

    php foreach 操作数组的代码.   foreach()有两种用法:  foreach(array_name as $value)  {  statement;  }  这里的array_na ...