DetectMoves
From RdiffBackupWiki
(text deleted)..ow (0.13.3) rdiff-backup bases its records of a file on the filename.
If the file's contents change between backups, only the changes are stored, and not a whole new copy of the file. However, if the name of the file changes, rdiff-backup thinks the old file has gone away, and a brand new file created; it copies over all the data again, even though it is all already present in the backup archive.
A way to change this would be to have rdiff-backup calculate an MD5 hash fingerprint for the file, and try to match the disappearance of a file with the appearance of a new file that has the same fingerprint.
Ben Escoto writes:
The md5 sum thing for detecting moved/duplicate files is complicated, and may make things slower. Before someone can even start implementing it, someone should:
1. Determine if it would be worth it. Is the point to save space? Are people really short of space nowadays?
2. Figure out how to store this information in a way compatible with the current format (which pairs increment and source files by filename).
Andrew K Bressen responds:
Linux logrotate and many other programs change the names of old files with every run, so a large weblog kept online on the original host for four weeks for user access and rotated daily would take up 28 times as much space on a backup server. Similarly it would eat 28 times the network bandwidth, and would slow down the backup runs every single day by the amount of time it took to transfer the entire file. There are other cases a bit less pathological but still significant; I think longterm it is worth the trouble. I don't have a quick answer for how to handle it, though. My initial thoughts run to nasty sets of indirection layers that might indeed have noticable speed penalties.
ps: I didn't see anything on the wiki intro pages that told me how to add line breaks that are not paragraph boundries. Text does seem to reflow when the browser is horizontally resized, so it may simply not be possible. Apologies for resulting ugliness. --akb
Kevin Walker:
I agree that the logrotate issue is important. When I run rdiff-backup, typically the majority of the time and bandwidth is taken up by copying, e.g., maillog.4 (which is identical to the old maillog.3).
Perhaps the general problem of detecting renamed or moved files is too difficult, but think it would be easy to handle the imporant logrotate case:
- If rdiff-backup encounters a files whose name ends with ".n" (n = 1, 2, 3, ...), it is compared with filename.(n-1) as well as filename.n
Contents |
Notes
1. The filename is new to the backup (rename)
- if the file hasn't changed its content then it is possible to find, by
comparing newfile list with removed file list (a list with 'filesize and md5sum)
- if it has changed I find it quite impossible in any easy way.
2. The file is completly diffrent from the one in the backup (log files)
Statistics
I ran fdupes on a directory with 6500 files of size 850KB-1MB and it took 2m45s.. And it's not common to have 6500 in the same directory.
logrotate hint
Logrotate by default uses numbered extensions and each day/week/month all the logs are renamed. Therefore, following each log rotation rename, rdiff-backup will backup all the logs as each numbered copy has completely different contents.
- messages -> messages.1 -> messages.2 -> messages.3 -> messages.4 -> (Deleted)
Instead use the dateext option in the logrotate.conf file. Logs will then receive a permanent date extension like -YYYYMMDD, instead of changing numbered one.
- messages -> messages-20080210 -> (Deleted after 4 log rotations)
Advantages:
- Each rotated log has a permanent unique name with contents which is never changed. Therefore rdiff-backup will only ever store a single copy in the backup repository.
- As each log has a permanent unique name it is trivial to restore a particular log using its increment name, rather than having to calculate the date when the required log had a specific backup number.
rdiff-backup /backup/rdiff-backup-data/increments/var/log/messages-20080210.*.snapshot* /tmp/messages-20080210
-- Mike Fleetwood
Link-Backup solution
Link-Backup solves this issue. It gracefully handles file and directory renames, moved files, and duplicate files without additional storage or transfer. It's a small, simple python script that does hard linking between dated backups, and comes with a cgi viewer. It also understands how to run incrementally, supporting both local and remote backup. Remote operation occurs over ssh. It maintains a flat catalog of all files at the dest named by md5sum+stat, then hard-links backup trees against that catalog. It uses the previous tree to accelerate the linking process. If the file is present in the old tree with the same stats, it hardlinks to it. If it isn't, an md5sum+stat is generated at the source, and the file is looked up in the catalog at the dest. If present, it is hardlinked to. If not, it is requested from the source, added to the catalog, then hardlinked to from the backup tree. I've been using it since 2004.
http://www.scottlu.com/Content/Link-Backup.html
storeBackup solution
Gregor Gorjanc:
storeBackup utility solved this issue. Text bellow is copied from article in linuxfocus (http://linuxfocus.org/English/January2004/article321.shtml)
... In addition, files or directory structures are re-named by users, in incremental backups they are again (unnecessarily) secured. The solution to this is to check the backup for files with the same content (possibly compressed) and to refer to those. The hard link is this reference. (Explanation: data blocks in Unix systems are administered through inodes. Many different file names in as many directories may refer to an inode. The actual file is being deleted with its last hard link (= directory name). (Hard links may point to a specific file only within one file system.) With this trick of the hard links, which were already created in existing backup files, each file is present in each backup although it exists physically on the harddrive only once. Copying and renaming of files or directories takes only the storage space of the hard links - nearly nothing. ...
This is the same as using md5sum on all the files, and it doesn't cover the possiblity of renaming a file and then changing the contents of it.
--- ErikJohansson
wrapper to mv
Perhaps for cases like these a wrapper to "mv" could be packaged with rdiff-backup. It could maintain a hint list of moves for rdiff-backup to check (run MD5 on I guess). Since the functionality is going to be most useful for scripts that move stuff, or users doing one-off moves of really large files, it wouldn't be that hard to get users to change their habits. Anyone with the logfile problem could fix it with a minor change to the logrotate script, and an extra command line flag. Ryan
You have to wrap glibc syscalls for renaming files, not just the command mv, this is nontrivial and very ugly. -- ErikJohansson
Tracking inodes (file serial numbers)
What about using the inode number as the unique reference ? Having done some tests, it doesn't seem to change with either editing, renaming or moving. Maybe throwing an MD5 check for good measure. To answer Ben's point #1, yeah sure, disc space is cheap but bandwidth isn't. I've found that if 'someuser' moves a directory is can require a LOT of bandwidth to backup the next incremental. I quite like the idea of BackupPC's (http://backuppc.sourceforge.net./info.html) pooling (ie just keeping the unique files).
Cammy @ The Penguin Factory
Efficient use of inodes as unique references?
I don't see why md5 is needed. rdiff is already pretty efficient at comparing files, and inodes alone provide a good hint about where similar (including approximately similar) data from previous backups will be. All that's needed is a way to record when a file's diff is based on a differently-named file from the previous backup.
With the next backup an auxilliary map of devices+inodes -> filenames could be stored. You wouldn't need all files, only the ones receiving an incremental update, and of those possibly only the ones which are starting 'fresh', depending on where rdiff-backup starts looking for differences. A future backup would then be able to attempt to look up preexisting data by inode - if it found somewhat similar data under a differing filename, it could store a special entry which recorded that link between filenames and the diffs. It seems that this might allow a format upgrade without breaking older archives.
Restoring would be slightly trickier, since you would want the restore umbrella to encompass files which started off outside the restore point but migrated in by the time you request. If the incremental backup is compressed and/or inside a tar file, maybe the link log ought to be made easily accessible so that a quick reverse-order reconstruction of the necessary initial set of files is possible. At the same time the list of file deletions could use the same tweak. This would have the effect of eliminating the creation and deletion of unnecessary files during a restore, if that's not done already.
Finally, if it matters, I think it's probably a good idea to expect the eventual reuse of inodes. Still, it doesn't matter much if they're only being used as a hint for where similar data from the most recent backups might be obtained. If the file contents look dissimilar (like maybe the first block differs too much) then the files are probably bad candidates for comparison and regular old storage could be used if it's a big deal. But even if they are "deceptively" similar, then so long as the data and metadata such as hard links works out the same after restore, no problem with the occasional mismatch from inode reuse. It's semantic.
-- CxSeven
I guess I'm so interested in this because I'm haunted by inefficiency on my crappy old system. Which rdiff is the best at relieving for backups, so far. (Few other programs actually look for similarities WITHIN files such as my huge mailboxes). But when I reorganize my media files that can result in gigabytes of data requiring a second transfer..
-- CxSeven
inodes when combined with lvm snapshots
Inodes stay the same when you create a lvm snapshot and mount the snapshot elsewhere, don't they? As long as you don't encode the device in with the inode, but then you end up with potential inode collisions if you are backing up several partitions at once. How often will these collisions happen? Maybe check the file contents as well, once you disocover identical inodes. If viable, then that would suit me. Remember that this would be useful despite the fact that disk is cheap -- disk may be cheap, but the more backups you take, the faster you run out. Make it more efficient on systems where lots of moves are done, and you can run more backups.
Incidentally, I suspect an option similar to --no-compare-inode, but only ignoring the device rather than the inodes will be necessary for such lvm snapshots. So in the end, you want a few different options:
--no-compare-inode as it currently stands, --no-compare-device for lvm snapshots and nfs and the like where device numbers can chance but inodes stay the same, and --detects-moves which may end up deciding to ignore device numbers, and checks whether files get renamed by checking files with identical inodes and identical contents.
-- tconnors
Back to SuggestedFeatures
