databranches: using git as a database

I've just released git-annex version 3, which stops cluttering the filesystem with .git-annex directories. Instead it stores its data in a git-annex branch, which it manages entirely transparently to the user. It is essentially now using git as a distributed NOSQL database. Let's call it a databranch.

This is not an unheard of thing to do with git. The git notes built into recent git does something similar, using a dynamically balanced tree in a hidden branch to store notes. My own pristine-tar injects data into a git branch. (Thanks to Alexander Wirt for showing me how to do that when I was a git newbie.) Some distributed bug trackers store their data in git in various ways.

What I think takes git-annex beyond these is that it not only injects data into git, but it does it in a way that's efficient for large quantities of changing data, and it automates merging remote changes into its databranch. This is novel enough to write up how I did it, especially the latter which tends to be a weak spot in things that use git this way.

Indeed, it's important to approach your design for using git as a database from the perspective of automated merging. Get the merging right and the rest will follow. I've chosen to use the simplest possible merge, the union merge: When merging parent trees A and B, the result will have all files that are in either A or B, and files present in both will have their lines merged (and possibly reordered or uniqed).

The main thing git-annex stores in its databranch is a bunch of presence logs. Each log file corresponds to one item, and has lines with this form:

timestamp [0|1] id

This records whether the item was present at the specified id at a given time. It can be easily union merged, since only the newest timestamp for an id is relevant. Older lines can be compacted away whenever the log is updated. Generalizing this technique for other kinds of data is probably an interesting problem. :)

While git can union merge changes into the currently checked out branch, when using git as a database, you want to merge into your internal-use databranch instead, and maintaining a checkout of that branch is inefficient. So git-annex includes a general purpose git-union-merge command that can union merge changes into a git branch, efficiently, without needing the branch to be checked out. Another problem is how to trigger the merge when git pulls changes from remotes. There is no suitible git hook (post-merge won't do because the checked out branch may not change at all). git-annex works around this problem by automatically merging */git-annex into git-annex each time it is run. I hope that git might eventually get such capabilities built into it to better support this type of thing.

So that's the data. Now, how to efficiently inject it into your databranch? And how to efficiently retrieve it?

The second question is easier to answer, although it took me a while to find the right way ... Which is two orders of magnitude faster than the wrong way, and fairly close in speed to reading data files directly from the filesystem. The right choice is to use git-cat-file --batch; starting it up the first time data is requested, and leaving it running for further queries. This would be straightforward, except git-cat-file --batch is a little difficult when a file is requested that does not exist. To detect that, you'll have to examine its stderr for error messages too. Perhaps git-cat-file --batch could be improved to print something machine parseable to stdout when it cannot find a file. Takes some careful parsing, but straightforward.

Efficiently injecting changes into the databranch was another place where my first attempt was an order of magnitude slower than my final code. The key trick is to maintain a separate index file for the branch. (Set GIT_INDEX_FILE to make git use it.) Then changes can be fed into git by using git hash-object, and those hashes recorded into the branch's index file with git update-index --index-info. Finally, just commit the separate index file and update the branch's ref.

That works ok, but the sad truth is that git's index files don't scale well as the number of files in the tree grows. Once you have a hundred thousand or so files, updating an index file becomes slow, since for every update, git has to rewrite the entire file. I hope that git will be improved to scale better, perhaps by some git wizard who understands index files (does anyone except Junio and Linus?) arranging for them to be modified in-place.

In the meantime, I use a workaround: Each change that will be committed to the databranch is first recorded into a journal file, and when git-annex shuts down, it runs git hash-object just once, passing it all the journal files, and feeds the resulting hashes into a single call to git update-index. Of course, my database code has to make sure to check the journal when retrieving data. And of course, it has to deal with possibly being interrupted in the middle of updating the journal, or before it can commit it, and so forth. If gory details interest you, the complete code for using a git branch as a database, with journaling, is here.

After all that, git-annex turned out to be nearly as fast as before when it was simply reading files from the filesystem, and actually faster in some cases. And without the clutter of the .git-annex/ directory, git use is overall faster, commits are uncluttered, and there's no difficulty with branching. Using a git branch as a database is not always the right choice, and git's plumbing could be improved to better support it, but it is an interesting technique.

weird git tricks

I wrote this code today to verify setup branch pushes on Branchable. When I was writing it I was just down in the trenches coding until it worked, but it's rather surprising that what it does to git does work.

The following code runs in git's update hook. The fast path of the hook (written in C) notices that the user is committing a change to the setup branch, and hands the incoming git ref off to the setup verifier.

                # doing a shared clone makes the setupref, which has
                # not landed on any branch, be available for checkout
                shell("git", "clone", "--quiet", "--shared",
                        "--no-checkout", repository($hostname), $tmpcheckout);
                chdir($tmpcheckout) || error "chdir $tmpcheckout: $!";
                shell("git", "checkout", "--quiet", $setupref, "-b", "setup");

I got lucky here, since I initially passed --shared only to avoid the overhead of a clone of the site's entire git repository (which can be quite large, since Branchable doesn't have any real limits on site size). Without the --shared, the clone wouldn't see the incoming ref at all.

In the setup branch is an ikiwiki.setup file, and we only want to allow safe changes to be committed to it. Ikiwiki has metadata about which configurations are safe. Checking that and various other amusing scenarios (what if someone makes ikiwiki.setup a symlink etc) takes a hundred lines of fairly hairy code, but that doesn't matter here. Eventually it decides the setup file is ok as-is, or it's already died with an error message.

                # Check out setup file in toplevel. This is slightly tricky
                # as the commit has not landed in the bare git repo yet --
                # but it is available in the tmpcheckout.
                shell("git", "pull", "-q", $tmpcheckout, "setup");

                # Refresh or rebuild site to reflect setup changes.
                print STDERR "Updating site to reflect setup changes...\n";
                shell("ikiwiki", "-setup", "ikiwiki.setup", "-v",
                        ($rebuild_needed ? ("-rebuild") : ("-refresh", "-wrappers"))
                );

When this code runs there are three repositories, each with a different view of the setup branch. The main bare repository is waiting for the hook to succeed before it updates the ref to point to what was pushed. The temporary clone has what was pushed already checked out. And the site's home directory still has the old version of the setup branch checked out. Possibly even a version that has diverged from what's in the bare repository.

It's rather odd that the update hook goes and causes that latter repository to be updated, before the change has finished landing in the bare repository. But it does work; it ensures that if there is some kind of bizzare merge problem the user doing the push sees it, and I probably won't regret it.

The result certianly is nice -- edit ikiwiki.setup file locally, commit and push it, and ikiwiki automatically reconfigures itself and even rebuilds your whole site if you've changed something significant.

Posted
couchdb

Couchdb came onto my radar since distributed stuff is interesting to me these days. But most of what was being written about it put me off, since it seemed to be very web-oriented, with javascript and html and stuff stored in the database, served right out of it to web browsers in an AJAXy mess.

Also, it's a database. I decided a long, long time ago not to mess with traditional databases. (They're great, they're just not great for me. Said the guy leaving after 5 years in the coal mines.)

Then I saw Damien Katz's talk about how he gave up everything to go off and create couchdb. Was very inspirational. Seemed it must be worth another look, with that story behind it.

Now I'm reading the draft O'Rielly book, like some things, as expected don't like others[1], and am not sure what to think overall (plus still have half the book to get through yet), but it has spurred some early thoughts:

... vs DVCS

Couchdb is very unlike a distributed VCS, and yet it's moved from traditional database country much closer to VCS land. It's document oriented, not normalized; the data stored in it has significant structure, but is also in a sense freeform. It doesn't necessarily preserve all history, but it does support multiple branches, merging, and conflict resolution.

Oddly, the thing I dislike most about it is possibly its biggest strength compared to a VCS, and that is that code is stored in the database alongside the data. That means that changes to the data can trigger processing, so it is mapped, reduced, views are updated, etc, on demand. This is done using code that is included in the database, and so is always available, and runs in an environment couchdb provides -- so replicating the database automatically deploys it.

Compare with a VCS, where anything that is triggered by changes to the data is tacked onto the side in hooks, has to be manually set up, and so is poorly integrated overall.

Basically, what I've been doing with ikiwiki is adding some smarts about handling a particular kind of data, on top of the VCS. But this is done via a few narrow hooks; cloning the VCS repository does not get you a wiki set up and ready to go.

There are good reasons why cloning a VCS repository does not clone the hooks associated with it. The idea of doing so seems insane; how could you trust those hooks? How could they work when cloned to another environment? And so that's Never Been Done[2]. But with couchdb's example, this is looking to me like a blind spot, that has probably stunted the range of things VCSs are used for.

If you feel, like I do, that it's great we have these amazing distributed VCSs, with so many advanced capabilities, but a shame that they're only used by software developers, then that is an exciting thought.


[1] Javascript? Mixed all in a database with data it runs on? Imperative code that's supposed to be side-effect free? (I assume the Haskell guys have already been all over that.) Code stored without real version control? Still having a hard time with this. :)

[2] I hope someone will give a counterexample of a VCS that does so in the comments?

Posted
git as an alternative to unison

I've used unison for a long while for keeping things like my music in sync between machines. But it's never felt entirely safe, or right. (Or fast!) Using a VCS would be better, but would consume a lot more space.

Well, space still matters on laptops, with their smallish SSDs, but I have terabytes of disk on my file servers, so VCS space overhead there is no longer of much concern for files smaller than videos. So, here's a way I've been experimenting with to get rid of unison in this situation.

  • Set up some sort of networked filesystem connection to the file server. I hate to admit I'm still using NFS.

  • Log into the file server, init a git repo, and check all your music (or whatever) into it.

  • When checking out on each client, use git clone --shared. This avoids including any objects in the client's local .git directory.

    git clone --shared /mnt/fileserver/stuff.git stuff
  • Now you can just use git as usual, to add/remove stuff, commit, update, etc.

Caveats:

  • git add is not very fast. Reading, checksumming, and writing out gig after gig of data can be slow. Think hours. Maybe days. (OTOH, I ran that on an Thecus.)
  • Overall, I'm happy with the speed, after the initial setup. Git pushes data around faster than unison, despite not really being intended to be used this way.
  • Note that use of git clone --shared, and read the caveats about this mode in git-clone(1).
  • git repack is not recommended on clients because it would read and write the whole git repo over NFS.
  • Make sure your NFS server has large file support. (The userspace one doesn't; kernel one does.) You don't just need it for enormous pack files. The failure mode I saw was git failing in amusing ways that involved creating empty files.
  • Git doesn't deal very well with a bit flipping somewhere in the middle of a 32 gigabyte pack file. And since this method avoids duplicating the data in .git, the clones are not available as backups if something goes wrong. So if regenerating your entire repo doesn't appeal, keep a backup of it.

(Thanks to Ted T'so for the hint about using --shared, which makes this work significantly better, and simpler.)

proposing rel-vcs

I'm working on designing a microformat that can be used to indicate the location of VCS (git, svn, etc) repositories related to a web page.

I'd appreciate some web standards-savvy eyes on my rel-vcs microformat rfc.

If it looks good, next steps will be making things like gitweb, viewvc, ikiwiki, etc, support it. I've already written a preliminary webcheckout tool that will download an url, parse the microformat, and run the appropriate VCS program(s).

(Followed by, with any luck, github, ohloh, etc using the microformat in both the pages they publish, and perhaps, in their data importers.)

Why? Well,

  1. A similar approach worked great for Debian source packages with the XS-VCS-* fields.
  2. Pasting git urls from download pages of software projects gets old.
  3. I'm tired of having to do serious digging to find where to clone the source to websites like Keith Packard's blog, or cariographics.org, or St Hugh of Lincoln Primary School. Sites that I know live in a git repo, somewhere.
  4. With the downturn, hosting sites are going down left and right, and users who trusted their data to these sites are losing it. Examples include AOL Hometown and Ficlets, Google lively, Journalspace, podango, etc etc. Even livejournal's future is looking shakey. Various people are trying to archive some of this data before it vanishes for good. I'm more interested in establishing best practices that make it easy and attractive to let all the data on your website be cloned/forked/preserved. Things that people bitten by these closures just might demand in the future. This will be one small step in that direction.a
Posted
merge conflict

I'm writing a piece of autobiography/alternate world fiction, using git. Whether it will get finished or be any good, or be too personal to share I don't know. The idea though is sorta interesting -- a series of descriptions of inflection points in a life, each committed into git at the time it describes. As the life paths diverge, branches form, but never quite merge.

Reading this would not be quite like reading one of those choose your own adventure books. Rather you'd start at the end of a path and read back through the choices and events that led there. Or browse around for interesting nuggets in gitk. Or perhaps the point isn't that it be read at all, but is instead in the writing, and the committing.

discussion

Posted
anonymous git push to ikiwiki

So, ikiwiki keeps wikis in git. But until today, that's only meant that the wiki's owners can edit it via git. Everyone else was stuck using the web interface.

Wouldn't it be nice then if anyone could check out the wiki source, modify it, and push it back? Now you can!

git clone git://git.ikiwiki.info/
cd git.ikiwiki.info
vim doc/sandbox.mdwn
git commit -a -m "I'm in your git, editing your wiki."
git push

The secret sauce, that makes this not a recipe for disaster but just a nice feature, is that ikiwiki checks each change as it's pushed in, and rejects any changes that couldn't be made to the wiki with a web browser.

So if you use ikiwiki for a wiki, you might want to turn on untrusted git push.

Posted
graphical annotate

I'm envisioning a graphical app that displays a file. Like a pager, the up and down arrows move through the file. But the left and right arrows move through time. As each successive change to the file is displayed, the committer's name appears in a column to the left of the lines changed in that commit. Hover the mouse over it to see the commit message. Names of old committers will fade out as time advances, but still be visible for a while. (A menu option will disable the fade out entirely.)

A nice bonus feature would be to allow opening multiple windows, with multiple files from the same repo. Moving back and forward in time would affect them all at once.

A nice, but getting harder feature would be to have a horizontal timeline at the bottom, including branches, so you could click on a specific branch to visit it. (Without this, when passing a fork or merge point, it would have to choose a branch heuristically?)

A tricky subtle feature would be to attempt to keep the current code block centered in the display as lines are added/removed from the file, adjusting scroll bar position to compensate.

There seems to be a gannotate for bzr, that may do something like this. Offline so I can't try it.

Google-and-caffine-fed update: bzr gannotate is closest to what I envisoned, though without a few of the bonuses (fade-out, smart scrolling, multiple files). qgit's "tree view" includes the same functionality, but the interface isn't as nice.

discussion

Posted
lazyweb: git cia hooks

Dear LazyWeb,

I use the standard ciabot.pl script in a git post-receive hook. This works ok, except in the case where changes are made in a published branch, and then that branch is merged into a second published branch.

In that case, ciabot.pl reports all the changes twice, once when they're committed to the published branch, and again when the branch is merged. This is worst when I sync up two branches; if there were a lot of changes made on either branch, they all flood into the irc channel again.

Am I using the ciabot.pl script wrong, or is there a better script I should use? Or maybe there's a CIA alternative that is smarter about git commits, so it will filter out duplicates?

Here, FWIW, is how I currently use it in my post-receive hook.

while read oldrev newrev refname; do
    refname=${refname#refs/heads/}
    [ "$refname" = "master" ] && refname=
    for merged in $(git rev-list --reverse $newrev ^$oldrev); do
        ciabot_git.pl $merged $refname
    done
done

Update: After hints and discussion from Buxy, I arrived at the following:

while read oldrev newrev refname; do
    branchname=${refname#refs/heads/}
    [ "$branchname" = "master" ] && branchname=
    for merged in $(git rev-parse --symbolic-full-name --not --branches | egrep -v "^\^$refname$" | git rev-list --reverse --stdin $oldrev..$newrev); do
        ciabot_git.pl $merged $branchname
    done
done

With this, changes available in another published branch are not sent to CIA.

There might still be some bugs with this.

distributed wikis

Done some interesting stuff in ikiwiki this evening..

Maybe you want to set up a mirror of a wiki. It's easy enough to do with an ikiwiki that's backed by git since you can just clone its repository and set up the mirror. But how to know when there's an update of the origin wiki, to update your mirror? I've added a plugin that allows you to edit a page on the origin wiki, and ask it to ping your wiki. And another plugin that your wiki can use to listen for pings and update itself, pulling down the changes from version control.

Nice thing about this is that any ikiwiki wiki that publishes its revision control, and enables the pinger plugin, can then be mirrored by anyone, with no coordination needed with its admin. Even multilevel mirror networks are possible to set up. (The astute may notice that loops are also possible.. but they will will be broken after 1 cycle.)

But this doesn't only allow mirroring. If you're using distributed version control, it also allows branching of a wiki. Just mirror as usual, but then make changes to the mirror, and don't send them back to the origin. Instant branch, that will be kept up-to-date with changes made to the origin. (Unless there's a conflict, that would need to be manually resolved, obviously.)

Wouldn't it be nice if you could git clone git://wikipedia.org/ or git://wiki.debian.org/ and go off and make it into something you're really happy with? Only thing standing in the way is that neither site uses ikiwiki. For now, you'll have to settle with cloning and branching git://git.ikiwiki.info/ :-)

Technical details here.

Posted