snapshot.debian.org launched
with all Debian packages produced in the past 5 years
6.5 terabytes
first Wikipedia database dump in 4.5 years
5.6 terabytes (32 gigabytes compressed)
My reflex on seeing both of these was to think about putting them into git repositories.
For snapshot.debian.org, injecting source packages into git repositories is
easy with git-import-dsc
(and it can also use pristine-tar to
make the original tarballs accessible with a miminal overhead). I hope the
snapshot.debian.org admins find time & space to do that, because being able
to easily access git annotate
data spanning 5 years for any package in
the archive would be very useful.
To produce a usable git repository from the Wikipedia dump would
probably involve writing a custom git-fast-import
that processed the huge
xml dump, and chunked it up into individual files, and commits changing
those files. Frankly, I do prefer wikis that store data in git in that
format natively and don't have multi-month dump procedures. ;)
How big would the git repos be? My SWAG is well under 1 terabyte for snapshot.debian.org, and between 30 and 300 gigabytes for the wikipedia data.
Hey,
I know it's not going to satisfy you, but I can't help but point out
https://code.launchpad.net/debian/
Which has Bazaar branches of every Debian package with some historical information, including pristine-tar data.
Unfortunately they were created before snapshot.debian.org and while snapshot.debian.net was far from reliable, so we have very granular history at this point, but it will only get better. Plus we may rebase them all to pull in the newly available history at some point.
Thanks,
James
I had actually tried to find that before posting, but lanchpad's UI defeated me. Thanks.
However, I don't see pristine-tar data (and if pristine-tar has been modified to support checking it into bzr, I've not been told about it ...). Does launchpad actually provide a way to get the original tarballs out of the bzr repo?
It's there. pristine-tar doesn't know how to check in to bzr as the way it does it for git doesn't work for bzr, and without python there's currently no way to do it the way I chose. Therefore bzr-builddeb (like *-buildpackage), which is the code that generated the branches, knows how to do it instead, and just calls pristine-tar gendelta itself.
Launchpad doesn't, but bzr-builddeb does (though only as a side effect of building right now). If you install the plugin, grab any of those branches and run "bzr builddeb" it will materialise the tarball for you from the pristine-tar data. I plan to add a command to just produce the tarball, but it was less pressing.
We're currently working on extending Launchpad to understand some of this, so that it will go from bzr branch -> source package -> binary packages. I don't forsee a button to say "give me that tarball from pristine-tar" at any point though.
Thanks,
James