I've used unison for a long while for keeping things like my music in sync between machines. But it's never felt entirely safe, or right. (Or fast!) Using a VCS would be better, but would consume a lot more space.
Well, space still matters on laptops, with their smallish SSDs, but I have terabytes of disk on my file servers, so VCS space overhead there is no longer of much concern for files smaller than videos. So, here's a way I've been experimenting with to get rid of unison in this situation.
Set up some sort of networked filesystem connection to the file server. I hate to admit I'm still using NFS.
Log into the file server, init a git repo, and check all your music (or whatever) into it.
When checking out on each client, use
git clone --shared
. This avoids including any objects in the client's local.git
directory.
git clone --shared /mnt/fileserver/stuff.git stuff
- Now you can just use git as usual, to add/remove stuff, commit, update, etc.
Caveats:
git add
is not very fast. Reading, checksumming, and writing out gig after gig of data can be slow. Think hours. Maybe days. (OTOH, I ran that on an Thecus.)- Overall, I'm happy with the speed, after the initial setup. Git pushes data around faster than unison, despite not really being intended to be used this way.
- Note that use of
git clone --shared
, and read the caveats about this mode ingit-clone(1)
. git repack
is not recommended on clients because it would read and write the whole git repo over NFS.- Make sure your NFS server has large file support. (The userspace one doesn't; kernel one does.) You don't just need it for enormous pack files. The failure mode I saw was git failing in amusing ways that involved creating empty files.
- Git doesn't deal very well with a bit flipping somewhere
in the middle of a 32 gigabyte pack file. And since this
method avoids duplicating the data in
.git
, the clones are not available as backups if something goes wrong. So if regenerating your entire repo doesn't appeal, keep a backup of it.
(Thanks to Ted T'so for the hint about using --shared, which makes this work significantly better, and simpler.)
When you do the clone, use git clone -s. This sets up the file $GIT_DIR/objects/info/alternates to point at the objects directory of the base repository. It means that git will look for objects in the base repository if they can't be found in the clone directory. That way you can do git gc without worrying about breaking the hard links.
For example, I normally keep /usr/projects/linux/base as a clone of Linus's linux repository. I do my local hacking in various repository that are cloned using "git clone base ext4". This causes /usr/projects/linux/ext4/.git/objects/info/alternates to contain the single line "/usr/projects/linux/base/objects".... and the objects directory is otherwise completely empty. When I make commits into the ext4 repository, those objects are created in /usr/projects/linux/ext4/.git/objects; and then when I push them to the base directory, a copy is made there. Afterwards, if I do a "git gc" in /usr/projects/linux/ext4, those objects will eventually disappear. (There is an expiry time for safety reasons, so they won't disappear right away, unless you explicitly prune the reflog via "git reflog expire --expire=0 --expire-unreachable=0 --all; git gc; git prune" --- why this is so is beyond the scope of this comment, though. :-)
In any case, the advantage of using "git clone -s" is that git gc is safe; you don't have to worry about breaking hard links and causing the disk usage to explode. The downside is there is only one copy of the objects, so if you do have local hard disk corruption the savings in disk space using also makes your system slightly less robust against random disk-induced data loss.
I do like the idea of experimenting with using git as a replacement for Unison. One potential problem which does come to mind is that git doesn't preserve file permissions, which might be an issue in some cases...
There is a script in git's contrib directory, called something like git-new-workdir. It will create a valid .git dir, that symlinks everything to the source. That way you don't get duplicate objects even if you commit in this new workdir. However, you have to be careful to check out different branch in each of such new workdirs, since git has no way to know what other workdirs exist and update their working tree content if you commit to branch they have checked out. This is also the reason why it's a contrib script and not an official command.
I've also started making a .gitalternates files that contains something like: * -delta
That should speed up commits and/or packs by avoiding it trying delta compression. Have not used it long enough (and it's not documented) to know for sure where/how it helps.
Note that when committing from a shared clone, you do end up with locally written objects. I clear these out by running something like this:
git push && rm -vf $MR_REPO/.git/objects/??/*
It's "safe" to delete the local objects once they're pushed to the server.