difficulties in backing up live git repositories

But you can’t just tar.gz up the bare repositories on the server and hope for the best. Maybe a given repository will be in a valid state; maybe it won’t.

-- Jeff Mitchell in a followup to the recent KDE near git disaster

This was a surprising statement to me. I seem to remember that one of (many) selling points for git talked about back in the day was that it avoided the problem that making a simple cp (or backup) of a repository could lead to an inconsistent result. A problem that subversion repositories had, and required annoying commands to work around. (svnadmin $something -- iirc the backend FSFS fixed or avoided most of this issue.)

This prompted me to check how I handle it in ikiwiki-hosting. I must have anticipated a problem at some point, since ikisite backup takes care to lock the git repository in a way that prevents eg, incoming pushes while a backup is running. Probably, like the KDE developers, I was simply exercising reasonable caution.

The following analysis has probably been written up before (train; limited network availability; can't check), but here are some scenarios to consider:

  • A non-bare repository has two parts that can clearly get out of sync during a backup: The work tree and the .git directory.

    • The .git directory will likely be backed up first, since getdirent will typically return it first, since it gets created first . If a change is made to the work tree during that backup, and committed while the work tree is being backed up, the backup won't include that commit -- which is no particular problem and would not be surprising upon restore. Make commit again and get on with life.

    • However, if (part of) the work tree is backed up before .git, then any changes that are committed to git during the backup would not be reflected in the restored work tree, and git diff would show a reversion of those changes. After restore, care would need to be taken to reset the work tree (without losing any legitimate uncommitted changes).

  • A non-bare repository can also become broken in other ways if just the wrong state is snapshotted. For example, if a commit is in progress during a backup, .git/index.lock may exist, and prevent future commits from happening, until it's deleted. These problems can also occur if the machine dies at just the right time during a commit. Git tells you how to recover. (git could go further to avoid these problems than it does; for example it could check if .git/index.lock is actually locked using fcntl. Something I do in git-annex to make the .git/annex/index.lock file crash safe.)

  • A bare repository could be receiving a push (or a non-bare repository a pull) while the backup occurs. These are fairly similar cases, with the main difference being that a non-bare repository has the reflog, which can be used to recover from some inconsist states that could be backed up. Let's concentrate on pushes to bare repositories.

    • A pack could be in the process of being uploaded during a backup. The KDE developers apparently worried that this could result in a corrupt or inconsistent repository, but TTBOMK it cannot; git transfers the pack to a temp file and atomically renames it into place once the transfer is complete. A backup may include an excess temp file, but this can also happen if the system goes down while a push is in progress. Git cleans these things up.

    • A push first transfers the .git/objects, and then updates .git/refs. A backup might first back up the refs, and then the objects. In this case, it would lose the record that refs were pushed. After being restored, any push from another repository would update the refs, even using the objects that did get backed up. So git recovers from this, and it's not really a concern.

    • Perhaps a backup chooses to first back up the objects, and then the refs. In this case, it could back up a newly changed ref, without having backed up the referenced objects (because they arrived after the backup had finished with the objects). When this happens, your bare repository is inconsistent; you have to somehow hunt down the correct ref for the objects you do have.

      This is a bad failure mode. git could improve this, perhaps, by maintaining a reflog for bare repositories. (Update: core.logAllRefUpdates can be set to true for bare repositories, but is disabled by default.)

  • A "backup" of a git repository can consist of other clones of it. Which do not include .git/hooks/ scripts, .git/config settings, and potentially other valuable information, that strangely, we do not check into revision control despite having this nice revision control system available. This is the most likely failure mode with "git backups". :P

I think that it's important git support naive backups of git repositories as well as possible, because that's probably how most backups of git repositories are made. We don't all have time to carefully tune our backup systems to do something special around our git repositories to ensure we get them in a consistent state like the KDE project did, and as their experience shows, even if we do it, we can easily introduce other, unanticipated problems.

Can anyone else think of any other failure modes like these, or find holes in my slightly rushed analysis?


PS: git-annex is itself entirely crash-safe, to the best of my abilities, and also safe for naive backups. But inherits any problems with naive backups of git repositories.

Posted
Goodreads vs LibraryThing vs Free software

Four years ago I started using Goodreads to maintain the list of books I've read (which had lived in a flat text file for a decade+ before that).

Now it's been aquired by Amazon. I doubt it will survive in its current form for more than 2 years. Anyway, while Goodreads has been a quite good way to find what my friends are reading, I've been increasingly annoyed by the quality of its recommendations, and its paucity of other features I need. It really doesn't seem to help me keep up with new and interesting fiction at all, unless my friends happen to read it.

So I looked at LibraryThing. Actually, I seem to have looked at it several times before, since it had accounts named "joey", "joeyh", and "joeyhess" that were all mine. Which is what happens to me on sites that lack Openid or Browserid.

Digging a little deeper this time, I am finding its recommendations much better than Goodreads' -- although it seems to sometimes recommend books I've already read. And it has some nice features like tracking series, so you can easily tell when you've read all the books in a series or not. The analytics overall seem quite impressive. The UI is cluttered and it seems to take 5 clicks to add and rate a single book. It supports half stars.

Overall I get the feeling this was designed for a set of needs that doesn't quite match mine. For example, it seems it doesn't have a single database entry per book; instead each time I add a book, it seems to pull in data from primary sources (library of congress, Amazon cough) and treat this as a separate (but related) entry somehow. Weird. Perhaps this makes sense to say, librarians. I'm willing to adjust how I think about things if there's an underlying reason that can be grasped.

There's a quite interesting thread on LibraryThing where the founder says:

Don't say we should open-source the code. That would be a nightmare! And I have limited confidence in APIs. LibraryThing has the book geeks, but not so much the computers geeks.

I assume that the nightmare is that there would be dozens of clones of the site, all balkanized, with no data transfer, no federation between them.

Except, that's the current situation, as every Goodreads user who is now trying to use LibraryThing is discovering.

Before I ever started using Goodreads, I made sure it met my minimum criteria for putting my data into a proprietary silo: That I could get the data back out. I can, and have. LibraryThing can import it. But the import process loses data! And it's majorly clunky. If I want to continue using Goodreads due to its better UI, and get the data into LibraryThing, for its better analytics, I have to do periodic dumps and loads of CSV files with manual fixups.

This is why we have standards. This is why we're building federated social networks like status.net and the upcoming pump.io that can pass structured data between nodes transparently. It doesn't have to be a nightmare. It doesn't have to rely on proprietary APIs. We have the computer geeks.

Thing is, sites like GoodReads and LibraryThing need domain-specific knowledge, and communities to curate data, and stuff like that. Things that work well in a smallish company. (LibraryThing even has a business model that makes sense, yearly payments to store more books in it.)

With free software, it's much more appealing to sink the time we have into the most general-purpose solution we can. Why build a LibraryThing when we could build something that tracks not only books but movies and music? Why build that when we could build a generic federated network for structured social data? And that's great, as infrastructure, but if that infrastructure is only used to build a succession of proprietary data silos, what was the point?

So, could some computer & book geeks please build a free software alternative to these things, focused on books, that federates using any of the fine APIs we have available? Bear in mind that there is already a nice start at a comprehensive collection of book data in the Open Library. I'd happily contribute to a crowd funded project doing this.

Posted