Version 6 of git-annex, released last week, adds a major new feature; support for unlocked large files that can be edited as usual and committed using regular git commands.
For example:
git init
git annex init --version=6
mv ~/foo.iso .
git add foo.iso
git commit -m "added hundreds of megabytes to git annex (not git)"
git remote add origin ssh://sever/dir
git annex sync origin --content # uploads foo.iso
Compare that with how git-annex has worked from the beginning, where
git annex add
is used to add a file, and then the file is locked,
preventing further modifications of it. That is still a very useful way to
use git-annex for many kinds of files, and is still supported of course.
Indeed, you can easily switch files back and forth between being locked and
unlocked.
This new unlocked file mode uses git's smudge/clean filters, and I was busy developing it all through December. It started out playing catch-up with git-lfs somewhat, but has significantly surpassed it now in several ways.
So, if you had tried git-annex before, but found it didn't meet your needs, you may want to give it another look now.
Now a few thoughts on git-annex vs git-lfs, and different tradeoffs made by them.
After trying it out, my feeling is that git-lfs brings an admirable simplicity to using git with large files. File contents are automatically uploaded to the server when a git branch is pushed, and downloaded when a branch is merged, and after setting it up, the user may not need to change their git workflow at all to use git-lfs.
But there are some serious costs to that simplicity. git-lfs is a centralized system. This is especially problimatic when dealing with large files. Being a decentralized system, git-annex has a lot more flexability, like transferring large file contents peer-to-peer over a LAN, and being able to choose where large quantities of data are stored (maybe in S3, maybe on a local archive disk, etc).
The price git-annex pays for this flexability is you have to configure it, and run some additional commands. And, it has to keep track of what content is located where, since it can't assume the answer is "in the central server".
The simplicity of git-lfs also means that the user doesn't have much control over what files are present in their checkout of a repository. git-lfs downloads all the files in the work tree. It doesn't have facilities for dropping the content of some files to free up space, or for configuring a repository to only want to get a subset of files in the first place. On the other hand, git-annex has excellent support for all those things, and this comes largely for free from its decentralized design.
If git has showed us anything, it's perhaps that a little added complexity to support a fully distributed system won't prevent people using it. Even if many of them end up using it in a mostly centralized way. And that being decentralized can have benefits beyond the obvious ones.
Oh yeah, one other advantage of git-annex over git-lfs. It can use half as much disk space!
A clone of a git-lfs repository contains one copy of each file in the work
tree. Since the user can edit that file at any time, or checking out a
different branch can delete the file, it also stashes a copy inside
.git/lfs/objects/
.
One of the main reasons git-annex used locked files, from the very beginning, was to avoid that second copy. A second local copy of a large file can be too expensive to put up with. When I added unlocked files in git-annex v6, I found it needed a second copy of them, same as git-lfs does. That's the default behavior. But, I decided to complicate git-annex with a config setting:
git config annex.thin true
git annex fix
Run those two commands, and now only one copy is needed for unlocked files! How's it work? Well, it comes down to hard links. But there is a tradeoff here, which is why this is not the default: When you edit a file, no local backup is preserved of its old content. So you have to make sure to let git-annex upload files to another repository before editing them or the old version could get lost. So it's a tradeoff, and maybe it could be improved. (Only thin out a file after a copy has been uploaded?)
This adds a small amount of complexity to git-annex, but I feel it's well worth it to let unlocked files use half the disk space. If the git-lfs developers are reading this, that would probably be my first suggestion for a feature to consider adding to git-lfs. I hope for more opportunities to catch-up to git-lfs in turn.
git annex v6 is a great idea. while i still have to use and test it, i find it admirable that you went through the trouble of evaluating a competing approach to the similar problem space, adopt and even extend it in your own project. this is the true hacker spirit at work, and while it's unfortunate that the implementation isn't directly compatible with LFS (it couldn't, since it's decentralised) or that github didn't consider using git-annex for its implementation (it's a commercial silo after all), it is great that the ideas are percolating around like this. it will hopefully make both projects better. :)
thanks again for git-annex!
Wow, I've been folowing git-annex from the beginning. First I thought it is way to complicated. Thus I tried to find a different way to solve my needs. After thinking about it really hard I came to the concolusion that a system that solve all my needs has to be as complicated as git annex. Now I just discovered version 6 and largefiles with mimetype support (https://git-annex.branchable.com/tips/largefiles/). This solution still fullfills all my needs and now it is easy to use, too. Just great! Thank you very much! You really superseeded git lfs. Best regards Oliver