with all Debian packages produced in the past 5 years
first Wikipedia database dump in 4.5 years
5.6 terabytes (32 gigabytes compressed)
My reflex on seeing both of these was to think about putting them into git repositories.
For snapshot.debian.org, injecting source packages into git repositories is
git-import-dsc (and it can also use pristine-tar to
make the original tarballs accessible with a miminal overhead). I hope the
snapshot.debian.org admins find time & space to do that, because being able
to easily access
git annotate data spanning 5 years for any package in
the archive would be very useful.
To produce a usable git repository from the Wikipedia dump would
probably involve writing a custom
git-fast-import that processed the huge
xml dump, and chunked it up into individual files, and commits changing
those files. Frankly, I do prefer wikis that store data in git in that
format natively and don't have multi-month dump procedures. ;)
How big would the git repos be? My SWAG is well under 1 terabyte for snapshot.debian.org, and between 30 and 300 gigabytes for the wikipedia data.