But you can’t just tar.gz up the bare repositories on the server and hope for the best. Maybe a given repository will be in a valid state; maybe it won’t.

-- Jeff Mitchell in a followup to the recent KDE near git disaster

This was a surprising statement to me. I seem to remember that one of (many) selling points for git talked about back in the day was that it avoided the problem that making a simple cp (or backup) of a repository could lead to an inconsistent result. A problem that subversion repositories had, and required annoying commands to work around. (svnadmin $something -- iirc the backend FSFS fixed or avoided most of this issue.)

This prompted me to check how I handle it in ikiwiki-hosting. I must have anticipated a problem at some point, since ikisite backup takes care to lock the git repository in a way that prevents eg, incoming pushes while a backup is running. Probably, like the KDE developers, I was simply exercising reasonable caution.

The following analysis has probably been written up before (train; limited network availability; can't check), but here are some scenarios to consider:

  • A non-bare repository has two parts that can clearly get out of sync during a backup: The work tree and the .git directory.

    • The .git directory will likely be backed up first, since getdirent will typically return it first, since it gets created first . If a change is made to the work tree during that backup, and committed while the work tree is being backed up, the backup won't include that commit -- which is no particular problem and would not be surprising upon restore. Make commit again and get on with life.

    • However, if (part of) the work tree is backed up before .git, then any changes that are committed to git during the backup would not be reflected in the restored work tree, and git diff would show a reversion of those changes. After restore, care would need to be taken to reset the work tree (without losing any legitimate uncommitted changes).

  • A non-bare repository can also become broken in other ways if just the wrong state is snapshotted. For example, if a commit is in progress during a backup, .git/index.lock may exist, and prevent future commits from happening, until it's deleted. These problems can also occur if the machine dies at just the right time during a commit. Git tells you how to recover. (git could go further to avoid these problems than it does; for example it could check if .git/index.lock is actually locked using fcntl. Something I do in git-annex to make the .git/annex/index.lock file crash safe.)

  • A bare repository could be receiving a push (or a non-bare repository a pull) while the backup occurs. These are fairly similar cases, with the main difference being that a non-bare repository has the reflog, which can be used to recover from some inconsist states that could be backed up. Let's concentrate on pushes to bare repositories.

    • A pack could be in the process of being uploaded during a backup. The KDE developers apparently worried that this could result in a corrupt or inconsistent repository, but TTBOMK it cannot; git transfers the pack to a temp file and atomically renames it into place once the transfer is complete. A backup may include an excess temp file, but this can also happen if the system goes down while a push is in progress. Git cleans these things up.

    • A push first transfers the .git/objects, and then updates .git/refs. A backup might first back up the refs, and then the objects. In this case, it would lose the record that refs were pushed. After being restored, any push from another repository would update the refs, even using the objects that did get backed up. So git recovers from this, and it's not really a concern.

    • Perhaps a backup chooses to first back up the objects, and then the refs. In this case, it could back up a newly changed ref, without having backed up the referenced objects (because they arrived after the backup had finished with the objects). When this happens, your bare repository is inconsistent; you have to somehow hunt down the correct ref for the objects you do have.

      This is a bad failure mode. git could improve this, perhaps, by maintaining a reflog for bare repositories. (Update: core.logAllRefUpdates can be set to true for bare repositories, but is disabled by default.)

  • A "backup" of a git repository can consist of other clones of it. Which do not include .git/hooks/ scripts, .git/config settings, and potentially other valuable information, that strangely, we do not check into revision control despite having this nice revision control system available. This is the most likely failure mode with "git backups". :P

I think that it's important git support naive backups of git repositories as well as possible, because that's probably how most backups of git repositories are made. We don't all have time to carefully tune our backup systems to do something special around our git repositories to ensure we get them in a consistent state like the KDE project did, and as their experience shows, even if we do it, we can easily introduce other, unanticipated problems.

Can anyone else think of any other failure modes like these, or find holes in my slightly rushed analysis?


PS: git-annex is itself entirely crash-safe, to the best of my abilities, and also safe for naive backups. But inherits any problems with naive backups of git repositories.

comment 1

Hi Joey,

I should make it clear that that statement is based on our thinking, and one experience of mine, and if the Git guys say it should be robust about that then I could well be wrong.

We were thinking of bare repositories not just in the process of being pushed to, but also in the process of being garbage collected or repacked. It's possible that all of those cases have been thought of and worked around already, and my statement is not true. But...

At some point I had a git repository on a local machine that suffered a power outage while in the process of fetching. When I booted back up, the repository was corrupt. Yes, it's entirely possible that it could be due to filesystem problems (FWIW, it was a journaling filesystem), but when thinking about the server, it seemed better to not rely on rsync or tar to interact properly with git and for both of those to do the right thing in all possible cases. Especially when Git itself already provides consistency checking. So backing up by making multiple clones (which were not true offline backups as they then were read-only to clients) seemed a straightforward way to go, except, of course, for our implementation flaw.

Comment by Jeff
Update from the Git guys

Got an update on the git list -- your analysis seems to be pretty much correct:

http://marc.info/?l=git&m=136422341014631&w=2

Comment by Jeff