The great thing about git and other distributed version control systems is that once you clone (or fork) a repository, you have all the data. You don't have to trust that Github will preserve it; everyone who develops the project is a backup.

Github carries this principle quite far amoung the features they provide. But not all the way. Today I have surveyed their features, and where the data for each is stored.

  • source code -- in git, of course!
  • user and project pages and wiki -- in git
  • gists -- in git
  • issues -- in a database accessible by an API
  • notes on commits -- in a database accessible by an API
  • relationships between repos (who forked what, pull requests) -- in a database accessible by an API
  • your account details and activity -- in a database, accessible by you via an API
  • list of all projects and users -- in a closed database (AFAIK)

The two that really stand out are the issues and notes not being stored in git. This means that, if a project uses github, it gets locked into github to a degree. The records of bugs and features, all the planning, and communication, is locked away in a database where it cannot be cloned, where every developer is not a backup.

Github's intent here is not to control this data to lock you in (to the extent they want to lock you in, they do that by providing a proprietary UI that people rave about); it was probably only expedient to use some sort of database, rather than git, when implementing these features.

They should automatically produce git repository branches containing a project's issues, and notes, based on the contents of their database. (For notes, git notes is the obviously right storage location.) Along with ensuring every developer checkout is a backup, this would allow accessing that data while offline, which is one of the reasons we use distributed version control.

The lack of a global list of projects is problimatic in a more global sense. It means that we can't make a backup of all the (public) repositories in Github (assuming that we had the bandwidth and storage to do it). I recently backed up all the repositories on Berlios.de, when it looked to be shutting down; this was only possible because they allowed enumerating them all.

People at The Internet Archive say that their archival coverage of free software is actually quite bad. We trust our version control systems to save our free software data, but while this works individually, it will result in data loss globally over time. I'd encourage Github to help the Internet Archive improve their collections by donating periodic snapshots of their public git repositories to the Archive. You're located in the same city, 5 miles apart; they have lots of hard drives (though less right now during the shortage than usual); this should be pretty easy to do.


Full disclosure: Github has bought me dinner and seemed like stand-up guys to me.

Right here.
Jason Scott, Internet Archive, nominally in charge of software preservation. Up and ready to talk about this now or when I'm in town on the 16th of January. Want to connect us?
Comment by Jason