snapshot.debian.org launched
with all Debian packages produced in the past 5 years
6.5 terabytes

first Wikipedia database dump in 4.5 years
5.6 terabytes (32 gigabytes compressed)

My reflex on seeing both of these was to think about putting them into git repositories.

For snapshot.debian.org, injecting source packages into git repositories is easy with git-import-dsc (and it can also use pristine-tar to make the original tarballs accessible with a miminal overhead). I hope the snapshot.debian.org admins find time & space to do that, because being able to easily access git annotate data spanning 5 years for any package in the archive would be very useful.

To produce a usable git repository from the Wikipedia dump would probably involve writing a custom git-fast-import that processed the huge xml dump, and chunked it up into individual files, and commits changing those files. Frankly, I do prefer wikis that store data in git in that format natively and don't have multi-month dump procedures. ;)

How big would the git repos be? My SWAG is well under 1 terabyte for snapshot.debian.org, and between 30 and 300 gigabytes for the wikipedia data.

Prior art

Hey,

I know it's not going to satisfy you, but I can't help but point out

https://code.launchpad.net/debian/

Which has Bazaar branches of every Debian package with some historical information, including pristine-tar data.

Unfortunately they were created before snapshot.debian.org and while snapshot.debian.net was far from reliable, so we have very granular history at this point, but it will only get better. Plus we may rebase them all to pull in the newly available history at some point.

Thanks,

James

Comment by Y8Fb8C7 [login.launchpad.net/+id]
comment 3

I had actually tried to find that before posting, but lanchpad's UI defeated me. Thanks.

However, I don't see pristine-tar data (and if pristine-tar has been modified to support checking it into bzr, I've not been told about it ...). Does launchpad actually provide a way to get the original tarballs out of the bzr repo?

Comment by joey [kitenet.net]
pristine-tar and bzr

However, I don't see pristine-tar data (and if pristine-tar has been modified to support checking it into bzr, I've not been told about it ...).

It's there. pristine-tar doesn't know how to check in to bzr as the way it does it for git doesn't work for bzr, and without python there's currently no way to do it the way I chose. Therefore bzr-builddeb (like *-buildpackage), which is the code that generated the branches, knows how to do it instead, and just calls pristine-tar gendelta itself.

Does launchpad actually provide a way to get the original tarballs out of the bzr repo?

Launchpad doesn't, but bzr-builddeb does (though only as a side effect of building right now). If you install the plugin, grab any of those branches and run "bzr builddeb" it will materialise the tarball for you from the pristine-tar data. I plan to add a command to just produce the tarball, but it was less pressing.

We're currently working on extending Launchpad to understand some of this, so that it will go from bzr branch -> source package -> binary packages. I don't forsee a button to say "give me that tarball from pristine-tar" at any point though.

Thanks,

James

Comment by Y8Fb8C7 [login.launchpad.net/+id]