Thanks to Faidon Liambotis, pristine-tar has been extended to support recreating pristine gz files, as well as tar files. It's been tested against the entire corpus of .tar.gz files in the debian archive, and succeeds on 98.7% of them.

We're using the whole archive as a test suite for pristine-tar, which is such a nice change -- excellent test coverange, and no need to write tests. :-) I've put up all the deltas it generated during its most recent run at http://hydra.kitenet.net/~joey/tmp/tar/. The deltas for all the tarballs in the archive use only 175 mb of space.

The next step, once it gets out of incoming will be for tools that inject sources into revision control to get support for generating pristine-tar deltas and checking the deltas in too. And for tools that build packages from revision control to get support for using the deltas to reproduce the original tarball. If you maintain such a tool, I'm happy to help you do that.

how pristine-gz works

Recreating gz files is tricky, since you generally can't binary diff them as pristine-tar does with tar files. Instead, gzip has to be fed exactly the same conditions that applied when the original gz file was created. This includes time stamps, filenames, file content, and compression level. These are figured out by looking at the header of the gz file.

But that's only the easy part, because there's a lot of variation in gz creation programs. The debian archive contains gz files produced on BSD systems by a libz based compressor, other built on MS-DOS, Windows NT, and many other strange and often buggy things. All of these differences result in different gz files.

To deal with this, pristine-gz uses a gz creator that can reproduce any of the known variants on demand, and just tries different varients until it finds a match.

There are still 132 files that it fails to reproduce, so if this sounds interesting to you, you can try to figure out how they were created and add support for them.

It would be possible to make pristine-gz succeed even for gz files it can't reproduce. It could just generate xdeltas to the files it tries, and store the smallest one. I am undecided if that would be a good idea, since the delta wouldn't be very small.