git-annex v6

Version 6 of git-annex, released last week, adds a major new feature; support for unlocked large files that can be edited as usual and committed using regular git commands.

For example:

git init
git annex init --version=6
mv ~/foo.iso .
git add foo.iso
git commit -m "added hundreds of megabytes to git annex (not git)"
git remote add origin ssh://sever/dir
git annex sync origin --content # uploads foo.iso

Compare that with how git-annex has worked from the beginning, where git annex add is used to add a file, and then the file is locked, preventing further modifications of it. That is still a very useful way to use git-annex for many kinds of files, and is still supported of course. Indeed, you can easily switch files back and forth between being locked and unlocked.

This new unlocked file mode uses git's smudge/clean filters, and I was busy developing it all through December. It started out playing catch-up with git-lfs somewhat, but has significantly surpassed it now in several ways.

So, if you had tried git-annex before, but found it didn't meet your needs, you may want to give it another look now.

Now a few thoughts on git-annex vs git-lfs, and different tradeoffs made by them.

After trying it out, my feeling is that git-lfs brings an admirable simplicity to using git with large files. File contents are automatically uploaded to the server when a git branch is pushed, and downloaded when a branch is merged, and after setting it up, the user may not need to change their git workflow at all to use git-lfs.

But there are some serious costs to that simplicity. git-lfs is a centralized system. This is especially problimatic when dealing with large files. Being a decentralized system, git-annex has a lot more flexability, like transferring large file contents peer-to-peer over a LAN, and being able to choose where large quantities of data are stored (maybe in S3, maybe on a local archive disk, etc).

The price git-annex pays for this flexability is you have to configure it, and run some additional commands. And, it has to keep track of what content is located where, since it can't assume the answer is "in the central server".

The simplicity of git-lfs also means that the user doesn't have much control over what files are present in their checkout of a repository. git-lfs downloads all the files in the work tree. It doesn't have facilities for dropping the content of some files to free up space, or for configuring a repository to only want to get a subset of files in the first place. On the other hand, git-annex has excellent support for all those things, and this comes largely for free from its decentralized design.

If git has showed us anything, it's perhaps that a little added complexity to support a fully distributed system won't prevent people using it. Even if many of them end up using it in a mostly centralized way. And that being decentralized can have benefits beyond the obvious ones.

Oh yeah, one other advantage of git-annex over git-lfs. It can use half as much disk space!

A clone of a git-lfs repository contains one copy of each file in the work tree. Since the user can edit that file at any time, or checking out a different branch can delete the file, it also stashes a copy inside .git/lfs/objects/.

One of the main reasons git-annex used locked files, from the very beginning, was to avoid that second copy. A second local copy of a large file can be too expensive to put up with. When I added unlocked files in git-annex v6, I found it needed a second copy of them, same as git-lfs does. That's the default behavior. But, I decided to complicate git-annex with a config setting:

git config annex.thin true
git annex fix

Run those two commands, and now only one copy is needed for unlocked files! How's it work? Well, it comes down to hard links. But there is a tradeoff here, which is why this is not the default: When you edit a file, no local backup is preserved of its old content. So you have to make sure to let git-annex upload files to another repository before editing them or the old version could get lost. So it's a tradeoff, and maybe it could be improved. (Only thin out a file after a copy has been uploaded?)

This adds a small amount of complexity to git-annex, but I feel it's well worth it to let unlocked files use half the disk space. If the git-lfs developers are reading this, that would probably be my first suggestion for a feature to consider adding to git-lfs. I hope for more opportunities to catch-up to git-lfs in turn.