本文試圖對Linux/Unix下的rsync工具的實作進行分析,并将描述下列問題:
rsync 算法縱覽(非數學性的);
rsync 工具中,算法是如何實作的;
rsync 工具中用到的協定;
rsync 工具中,各個程序的作用(The identifiable roles the rsync processes play).
本文主要目的是為讀者提供打下一個基礎,在此基礎上,讀者可以更好的了解下列問題:
rsync工作原理
rsync缺陷
Why a requested feature is unsuited to the code-base.
當然,也許這篇文章頁有助于程式員更好的閱讀rsync代碼。
當談到Rsync時候,我們将使用一些術語來指代rsync工具在完成其任務的不同階段 下的各個角色或者程序。下面為一些後文将會用到的術語:
client/用戶端
role/角色
用戶端對同步過程進行初始化。
server/伺服器端
伺服器是指遠端的rsync程序或者用戶端通過遠端shell、socket所連接配接到的系 統。
伺服器(server)是一個通用的術語,注意不要将其與Deamon混為一談。
一旦從Client到Server的連結建立起來,Client(客戶 端)/Server(服務 器)的這兩個角色的差别,就被Sender(發送者)/Receiver(接收者)所 取代了。
daemon/守護程序
角色,同時也是程序
Daemon是一個rsync程序,該程序用于等待接收從Client發起的連接配接。在一 些平台上,Daemon也被叫做服務(Service)
remote shell/遠端shell
角色,同時也是一系列的程序
一個或多個程序,用于向client和遠端的server之間提供連通性。
sender/發送者
role and process
可以存取待同步的檔案資源的rsync程序。
receiver/接收者
作為角色:指同步過程中的目标系統; 作為程序:指目标系統中,用于接收資料并接資料寫入磁盤的程序。
generator/生産者
process/程序
生産者程序用于識别檔案的變化,并維持檔案級别的邏輯。
當rsync用戶端啟動後,它首先通過管道(pipes)或者網絡來與server 程序建立 第一個連接配接。
根據遠端連接配接的建立方式不同,rsync用戶端的處理也不同。
當遠端為一個通過remote shell建立起來的非Daemon server時,client會fork遠 端shell,并借此在遠端系統上啟動一個伺服器(server)。此後,client和 server均通過remote shell上的管道(pipes)來通訊。此過程中,單就rsync程序 而言,不涉及到網絡操作。在這種模式下,server程序的rsync選項是通過用于啟 動remote shell的指令行來傳遞的。
當rsync可以通過deamon來通訊時,它實際上是在直接通過網絡來通訊。此模式 下,rsync的參數必須通過socket來發送,該過程具體如下:
通訊剛剛開始啟動的時候,Client和Server将各自的版本号發送給對方,并選擇較 低的版本号作為檔案傳輸的标準。如果Server端的rsync是一個Daemon-Mode,則 rsync的選項由Client發送至Server。之後由Client發送到Server的,是exclude list,即排除的檔案清單。
Local Rsync jobs (when the source and destination are both on locally mounted filesystems) are done exactly like a push. The client, which becomes the sender, forks a server process to fulfill the receiver role. The client/sender and server/receiver communicate with each other over pipes.
The file list includes not only the pathnames but also ownership, mode, permissions, size and modtime. If the --checksum option has been specified it also includes the file checksums.
The first thing that happens once the startup has completed is that the sender will create the file list. While it is being built, each entry is transmitted to the receiving side in a network-optimised way.
When this is done, each side sorts the file list lexicographically by path relative to the base directory of the transfer. (The exact sorting algorithm varies depending on what protocol version is in effect for the transfer.) Once that has happened all references to files will be done by their index in the file list.
If necessary the sender follows the file list with id→name tables for users and groups which the receiver will use to do a id→name→id translation for every file in the file list.
After the file list has been received by the receiver, it will fork to become the generator and receiver pair completing the pipeline.
Rsync is heavily pipelined. This means that it is a set of processes that communicate in a (largely) unidirectional way. Once the file list has been shared the pipeline behaves like this:
generator → sender → receiver
The output of the generator is input for the sender and the output of the sender is input for the receiver. Each process runs independently and is delayed only when the pipelines stall or when waiting for disk I/O or CPU resources.
The generator process compares the file list with its local directory tree. Prior to beginning its primary function, if --delete has been specified, it will first identify local files not on the sender and delete them on the receiver.
The generator will then start walking the file list. Each file will be checked to see if it can be skipped. In the most common mode of operation files are not skipped if the modification time or size differs. If --checksum was specified a file-level checksum will be created and compared. Directories, device nodes and symlinks are not skipped. Missing directories will be created.
If a file is not to be skipped, any existing version on the receiving side becomes the "basis file" for the transfer, and is used as a data source that will help to eliminate matching data from having to be sent by the sender. To effect this remote matching of data, block checksums are created for the basis file and sent to the sender immediately following the file's index number. An empty block checksum set is sent for new files and if --whole-file was specified.
The block size and, in later versions, the size of the block checksum are calculated on a per file basis according to the size of that file.
The sender process reads the file index numbers and associated block checksum sets one at a time from the generator.
For each file id the generator sends it will store the block checksums and build a hash index of them for rapid lookup.
Then the local file is read and a checksum is generated for the block beginning with the first byte of the local file. This block checksum is looked for in the set that was sent by the generator, and if no match is found, the non-matching byte will be appended to the non-matching data and the block starting at the next byte will be compared. This is what is referred to as the “rolling checksum”
If a block checksum match is found it is considered a matching block and any accumulated non-matching data will be sent to the receiver followed by the offset and length in the receiver's file of the matching block and the block checksum generator will be advanced to the next byte after the matching block.
Matching blocks can be identified in this way even if the blocks are reordered or at different offsets. This process is the very heart of the rsync algorithm.
In this way, the sender will give the receiver instructions for how to reconstruct the source file into a new destination file. These instructions detail all the matching data that can be copied from the basis file (if one exists for the transfe), and includes any raw data that was not available locally. At the end of each file's processing a whole-file checksum is sent and the sender proceeds with the next file.
Generating the rolling checksums and searching for matches in the checksum set sent by the generator require a good deal of CPU power. Of all the rsync processes it is the sender that is the most CPU intensive.
The receiver will read from the sender data for each file identified by the file index number. It will open the local file (called the basis) and will create a temporary file.
The receiver will expect to read non-matched data and/or to match records all in sequence for the final file contents. When non-matched data is read it will be written to the temp-file. When a block match record is received the receiver will seek to the block offset in the basis file and copy the block to the temp-file. In this way the temp-file is built from beginning to end.
The file's checksum is generated as the temp-file is built. At the end of the file, this checksum is compared with the file checksum from the sender. If the file checksums do not match the temp-file is deleted. If the file fails once it will be reprocessed in a second phase, and if it fails twice an error is reported.
After the temp-file has been completed, its ownership and permissions and modification time are set. It is then renamed to replace the basis file.
Copying data from the basis file to the temp-file make the receiver the most disk intensive of all the rsync processes. Small files may still be in disk cache mitigating this but for large files the cache may thrash as the generator has moved on to other files and there is further latency caused by the sender. As data is read possibly at random from one file and written to another, if the working set is larger than the disk cache, then what is called a seek storm can occur, further hurting performance.
The daemon process, like many daemons, forks for every connection. On startup, it parses the rsyncd.conf file to determine what modules exist and to set the global options.
When a connection is received for a defined module the daemon forks a new child process to handle the connection. That child process then reads the rsyncd.conf file to set the options for the requested module, which may chroot to the module path and may drop setuid and setgid for the process. After that it will behave just like any other rsync server process adopting either a sender or receiver role.
A well-designed communications protocol has a number of characteristics.
Everything is sent in well defined packets with a header and an optional body or data payload.
In each packet's header a type and or command specified.
Each packet has a definite length.
In addition to these characteristics, protocols have varying degrees of statefulness, inter-packet independence, human readability, and the ability to reestablish a disconnected session.
Rsync's protocol has none of these good characteristics. The data is transferred as an unbroken stream of bytes. With the exception of the unmatched file-data, there are no length specifiers nor counts. Instead the meaning of each byte is dependent on its context as defined by the protocol level.
As an example, when the sender is sending the file list it simply sends each file list entry and terminates the list with a null byte. Within the file list entries, a bitfield indicates which fields of the structure to expect and those that are variable length strings are simply null terminated. The generator sending file numbers and block checksum sets works the same way.
This method of communication works quite well on reliable connections and it certainly has less data overhead than the formal protocols. It unfortunately makes the protocol extremely difficult to document, debug or extend. Each version of the protocol will have subtle differences on the wire that can only be anticipated by knowing the exact protocol version.
This document is a work in progress. The author expects that it has some glaring oversights and some portions that may be more confusing than enlightening for some readers. It is hoped that this could evolve into a useful reference.
Specific suggestions for improvement are welcome, as would be a complete rewrite.
資料同步(Sync)是很多網絡應用需要 的解決的問題,比如檔案鏡像。這裡就以檔案同步為例,問題模型:網絡中兩個主機Host-A和Host-B,都有同一檔案File-Old的拷貝,現在這 個檔案在Host-A上做了一些改變成為了File-New,需要通過同步讓Host-B也獲得F-New。
讓我們想想怎麼處理這個問題,最簡單的方法,把所有資料都傳輸一遍,這樣是簡單,但是顯得浪費,因為File-New相對于File-Old隻 是有些小改變,全部copy代價太大。如果我們能夠隻傳輸發生改變的部分,也就是增、删、改的檔案部分,那就太好了。這樣,我們要解決的問題變成,如何得 到File-Old和File-New的差别。
如果Host-A上面保留有一個File-Old,那用普通的diff算法求一下和File-New的差别就行了,但是實際應用中,Host- A往往不會保留File-Old;或者檔案格式本身有很強的版本控制功能,Host-B告訴Host-A它手上檔案的版本,Host-A就能夠計算出差 别;更多情況下,檔案就是一串bytes,沒有版本控制資訊,沒有曆史拷貝,Rsync和RDC就是解決這種情況的同步的。
這裡大概介紹一下Rsync算法大概原理:
1) Host-B把File-Old劃分成不重合的大小為K位元組的若幹塊,不足K位元組的結尾部分加上Padding,然後對每一塊求弱Hash和強Hash。 弱Hash就是說很有可能兩個不同的塊Hash值相同,但是計算起來快,而且這裡要求這個若Hash能夠Rolling,也就是說已知位元組1到位元組K這個 塊的Hash值,能夠很快的計算出位元組2到位元組K+1這個塊的Hash值,往前Roll一個位元組,計算很快;強Hash就是可以認為不同塊肯定有不同 Hash值,Rsync用的是MD4。我們讓WH表示弱Hash,SH表示強Hash。
2) Host-B把每個塊的WH和SH值發送給Host-A。
3) 該Host-A上場了,他的運算量比較大。Host-A對File-New每一個長度為K的塊(也就是以每個位元組開頭的長度為K的塊)計算WH,計算出來 之後和Host-B發送過來的WH比對,如果發現有相同的,再計算這個塊的SH進行比對,如果還是相符,說明這個塊在File-Old裡面也存在。假如 File-New長度為N,那麼Host-A要處理大約(N-K)個塊,這裡可見用兩個Hash算法的作用,WH用來做初步比較,而且因為它可以 Rolling,是以能夠很快篩選掉大多數不比對,對于漏網之魚,也躲不過SH的篩選。
4) 通過上面的計算,Host-A可知道,File-New中哪些塊和File-Old中的塊相同,這樣自然也可以計算出哪些不同,Host-A把這些不同 encode一下送給Host-B。
5) Host-B收到Host-A送來的資料,decode,就得到了File-New相對于File-Old的改變,于是獲得了File-New。
整個過程隻需要一個round-trip,而且可以精确的得到一個位元組級别的差别,Host-A的運算量相對要大一些。
RDC算法要求Host-A和Host-B通過一緻的規則對File-New和File-Old分别進行分塊,然後對每個塊計算 SH,Host-B把每個塊的SH值發給Host-A,Host-A對兩組SH進行diff,就可以知道有哪些塊不同,哪些塊被删掉了,哪些塊被添加了。 RDC的關鍵在于分塊規則,也使用WH,要讓同一規則應用于File-Old和File-New的時候,分出來的塊能夠盡量展現出差別。
比如File-Old包含"I Love Playing Basketball”,
File-New是"I Like Playing Football"。
如果是RSync算法,Host-A能夠計算出準确的差别,"I Like Playing Football" 黃色部分修改了,綠色部分是增加的,精确到每個字元,Host-A主要告訴Host-B:"把第4-6号字元換成'ike',把16-21号字元去掉,插 入'Foot'”。
如果是RDC算法,可能得到下面的結果:
File-Old分塊的結果,分成3塊。
"I Love Playing Basketball”
File-New分塊的結果,分成3塊。
"I Like Playing Football"
Host-A經過比對,發現隻有File-Old的第2塊和File-New的第2塊比對,于是就告訴Host-B:"把你的第一塊換成‘I Like’,把你的第3塊換成‘Football’”。
如上面看到,RDC相對而言比較浪費,相比RSync,要多傳輸一些資料,但是Host-A和Host-B的計算量比較平均。為了讓RDC發揮 好的性能,一定要制定一個好的分塊機制,讓包含Diff的塊盡量少包含沒有Diff的資料,怎麼做到這一點呢,還要靠WH,通過rolling checksum來從資料中快速挖掘出資料的性質。
注意一點就是RSync的分塊政策是每塊都是固定長度的,而RDC則每塊長度可能不一樣。
雖然RDC相對浪費一點,但是傳送的大部分還是Delta資料,而且計算量相對平均而且較少,目前Window 2003 Server R2上的DFS使用的就是RDC算法,還有一個應用就是Live Messenger的Shared Folder功能,用一用,就知道效率不差了:)
Note:
本文轉自holy2009 51CTO部落格,原文連結:http://blog.51cto.com/holy2010/549704