I've just released git-annex version 3, which stops cluttering
the filesystem with .git-annex
directories. Instead it stores its
data in a git-annex branch, which it manages entirely transparently
to the user. It is essentially now using git as a distributed NOSQL database.
Let's call it a databranch.
This is not an unheard of thing to do with git. The git notes
built into
recent git does something similar, using a dynamically balanced tree in
a hidden branch to store notes. My own pristine-tar injects data into
a git branch. (Thanks to Alexander Wirt for showing me how to do that
when I was a git newbie.) Some
distributed bug trackers store
their data in git in various ways.
What I think takes git-annex beyond these is that it not only injects data into git, but it does it in a way that's efficient for large quantities of changing data, and it automates merging remote changes into its databranch. This is novel enough to write up how I did it, especially the latter which tends to be a weak spot in things that use git this way.
Indeed, it's important to approach your design for using git as a database from the perspective of automated merging. Get the merging right and the rest will follow. I've chosen to use the simplest possible merge, the union merge: When merging parent trees A and B, the result will have all files that are in either A or B, and files present in both will have their lines merged (and possibly reordered or uniqed).
The main thing git-annex stores in its databranch is a bunch of presence logs. Each log file corresponds to one item, and has lines with this form:
timestamp [0|1] id
This records whether the item was present at the specified id at a given time. It can be easily union merged, since only the newest timestamp for an id is relevant. Older lines can be compacted away whenever the log is updated. Generalizing this technique for other kinds of data is probably an interesting problem. :)
While git can union merge changes into the currently checked out branch,
when using git as a database, you want to merge into your internal-use
databranch instead, and maintaining a checkout of that branch is inefficient.
So git-annex includes a general purpose
git-union-merge command
that can union merge changes into a git branch, efficiently, without
needing the branch to be checked out. Another problem is how to trigger the
merge when git pulls changes from remotes. There is no suitible git hook
(post-merge won't do because the checked out branch may not change at all).
git-annex works around this problem by automatically merging */git-annex
into git-annex
each time it is run. I hope that git might eventually get
such capabilities built into it to better support this type of thing.
So that's the data. Now, how to efficiently inject it into your databranch? And how to efficiently retrieve it?
The second question is easier to answer, although it took me a while to
find the right way ... Which is two orders of magnitude faster than the
wrong way, and fairly close in speed to reading data files directly
from the filesystem.
The right choice is to use git-cat-file --batch
; starting it up the
first time data is requested, and leaving it running for further queries.
This would be straightforward, except Takes some careful
parsing, but straightforward.git-cat-file --batch
is a little
difficult when a file is requested that does not exist. To detect that,
you'll have to examine its stderr for error messages too. Perhaps
git-cat-file --batch
could be improved to print something machine
parseable to stdout when it cannot find a file.
Efficiently injecting changes into the databranch was another place where
my first attempt was an order of magnitude slower than my final code.
The key trick is to maintain a separate index file for the branch.
(Set GIT_INDEX_FILE
to make git use it.) Then changes can be fed
into git by using git hash-object
, and those hashes recorded into
the branch's index file with git update-index --index-info
. Finally,
just commit the separate index file and update the branch's ref.
That works ok, but the sad truth is that git's index files don't scale well as the number of files in the tree grows. Once you have a hundred thousand or so files, updating an index file becomes slow, since for every update, git has to rewrite the entire file. I hope that git will be improved to scale better, perhaps by some git wizard who understands index files (does anyone except Junio and Linus?) arranging for them to be modified in-place.
In the meantime, I use a workaround: Each change that will be committed to
the databranch is first recorded into a journal file, and when git-annex
shuts down, it runs git hash-object
just once, passing it all the journal
files, and feeds the resulting hashes into a single call to git
update-index
. Of course, my database code has to make sure to check the
journal when retrieving data. And of course, it has to deal with possibly
being interrupted in the middle of updating the journal, or before it can
commit it, and so forth. If gory details interest you, the complete code
for using a git branch as a database, with journaling, is
here.
After all that, git-annex turned out to be nearly as fast as before
when it was simply reading files from the filesystem, and actually faster
in some cases. And without the clutter of the .git-annex/
directory,
git use is overall faster, commits are uncluttered, and there's no difficulty
with branching. Using a git branch as a database is not always the right
choice, and git's plumbing could be improved to better support it, but it
is an interesting technique.
Save the database on the torrent worldwide network, it allows for:
This duo: git-torrent is hitting the streets soon.
It is safe, even where/when internet is forbidden by law.
@Jonathan thanks for correction
@Mark I should take another look at git-mktree .. as long as the files in the database are hashed into small trees, it should be fairly efficient to use -- but due to the hashing it'd need to be run to generate the subdirectory trees and up to the parent, so fairly complex.
Hi, Joey.
I think that it is painful to keep ikiwiki configuration files outside of its repository, as with the plain repository I can't quite exactly reproduce a site compiled with ikiwiki. In other words, I would love to see a repository self hosted.
What about using a branch (say,
configuration
orconfig
) to store the configuration on the sources of a ikiwiki site? If ikiwiki knew this, if it were not passed a explicit configuration file, then it could try to look for a configuration branch and the setup of the site there (if the user so desired).What about this idea? This would make it easier to make self-hosted ikiwiki repositories, wouldn't it?
Thanks.