I've used unison for a long while for keeping things like my music in sync between machines. But it's never felt entirely safe, or right. (Or fast!) Using a VCS would be better, but would consume a lot more space.

Well, space still matters on laptops, with their smallish SSDs, but I have terabytes of disk on my file servers, so VCS space overhead there is no longer of much concern for files smaller than videos. So, here's a way I've been experimenting with to get rid of unison in this situation.

  • Set up some sort of networked filesystem connection to the file server. I hate to admit I'm still using NFS.

  • Log into the file server, init a git repo, and check all your music (or whatever) into it.

  • When checking out on each client, use git clone --shared. This avoids including any objects in the client's local .git directory.

    git clone --shared /mnt/fileserver/stuff.git stuff
  • Now you can just use git as usual, to add/remove stuff, commit, update, etc.

Caveats:

  • git add is not very fast. Reading, checksumming, and writing out gig after gig of data can be slow. Think hours. Maybe days. (OTOH, I ran that on an Thecus.)
  • Overall, I'm happy with the speed, after the initial setup. Git pushes data around faster than unison, despite not really being intended to be used this way.
  • Note that use of git clone --shared, and read the caveats about this mode in git-clone(1).
  • git repack is not recommended on clients because it would read and write the whole git repo over NFS.
  • Make sure your NFS server has large file support. (The userspace one doesn't; kernel one does.) You don't just need it for enormous pack files. The failure mode I saw was git failing in amusing ways that involved creating empty files.
  • Git doesn't deal very well with a bit flipping somewhere in the middle of a 32 gigabyte pack file. And since this method avoids duplicating the data in .git, the clones are not available as backups if something goes wrong. So if regenerating your entire repo doesn't appeal, keep a backup of it.

(Thanks to Ted T'so for the hint about using --shared, which makes this work significantly better, and simpler.)

You can avoid the links.
Check out the --git-dir and --work-tree options to git. You should be able to use those to avoid your links...
Comment by ejr [claimid.com]
Another way of avoiding the hardlinks...

When you do the clone, use git clone -s. This sets up the file $GIT_DIR/objects/info/alternates to point at the objects directory of the base repository. It means that git will look for objects in the base repository if they can't be found in the clone directory. That way you can do git gc without worrying about breaking the hard links.

For example, I normally keep /usr/projects/linux/base as a clone of Linus's linux repository. I do my local hacking in various repository that are cloned using "git clone base ext4". This causes /usr/projects/linux/ext4/.git/objects/info/alternates to contain the single line "/usr/projects/linux/base/objects".... and the objects directory is otherwise completely empty. When I make commits into the ext4 repository, those objects are created in /usr/projects/linux/ext4/.git/objects; and then when I push them to the base directory, a copy is made there. Afterwards, if I do a "git gc" in /usr/projects/linux/ext4, those objects will eventually disappear. (There is an expiry time for safety reasons, so they won't disappear right away, unless you explicitly prune the reflog via "git reflog expire --expire=0 --expire-unreachable=0 --all; git gc; git prune" --- why this is so is beyond the scope of this comment, though. :-)

In any case, the advantage of using "git clone -s" is that git gc is safe; you don't have to worry about breaking hard links and causing the disk usage to explode. The downside is there is only one copy of the objects, so if you do have local hard disk corruption the savings in disk space using also makes your system slightly less robust against random disk-induced data loss.

I do like the idea of experimenting with using git as a replacement for Unison. One potential problem which does come to mind is that git doesn't preserve file permissions, which might be an issue in some cases...

Comment by tytso [livejournal.com]
Avoiding clones alotgether

There is a script in git's contrib directory, called something like git-new-workdir. It will create a valid .git dir, that symlinks everything to the source. That way you don't get duplicate objects even if you commit in this new workdir. However, you have to be careful to check out different branch in each of such new workdirs, since git has no way to know what other workdirs exist and update their working tree content if you commit to branch they have checked out. This is also the reason why it's a contrib script and not an official command.

Comment by drak.ucw.cz/~bulb//
pack.packSizeLimit
You can use the git configuration variable "pack.packSizeLimit" to keep the packfiles small enough to handle.
Comment by dmarti [myopenid.com]
comment 5

I've also started making a .gitalternates files that contains something like: * -delta

That should speed up commits and/or packs by avoiding it trying delta compression. Have not used it long enough (and it's not documented) to know for sure where/how it helps.

Comment by joey [kitenet.net]
comment 6

Note that when committing from a shared clone, you do end up with locally written objects. I clear these out by running something like this:

git push && rm -vf $MR_REPO/.git/objects/??/*

It's "safe" to delete the local objects once they're pushed to the server.

Comment by joey [kitenet.net]