10

Madduck wondered why a package using revision control needed pristine tarballs at all.

Firstly, the example he gave of tagging an upstream release and creating a tarball of it to build the package against is exactly what pristine-tar does. Except it preserves permissions and timestamps and other tar cruft.

Sometimes timestamp or permissions issues do cause issues with package builds, so being able to build against the original files from the pristine tarball is potentially useful to catch and fix such problems.

As to other uses of pristine tarballs...

It's hard to list the advantages to using pristine tarballs, since Debian (rpm too) has had them forever, and lots of people may depend on them in ways that we don't anticipate. For example, I've seen FreeBSD point at ftp.debian.org to get tarballs for their packages, since they don't archive tarballs on their own, and sometimes the tarball isn't available upstream (or ftp,debian.org is the upstream). That's a wacky example, but I've seen it more often than I've seen pristine tarballs used for the often-cited md5sum comparison case.

Rather than try to enumerate all the possible reasons for pristine tarballs, and then try to whittle the list down to reasons that actually make sense, and cost-benefit analise those, it seemed better to just try to write a solution to the problem. (Or at least, to most of it -- a hopefully small subset of the set of reasons for pristine tarballs are not satisfied by pristine-tar.)

I've also been theoretically interested in this problem of generating pristine tarballs from version controlled source ever since Scott Remnant mentioned at a meeting (140 mb video of meeting here) that he had a tool to do it -- and AFAIK never shared the tool! So the challange factor was another reason to write it.

However, pristine-tar has already saved me hundreds of megabytes of disk space, sped up my release process (avoiding Debian bug #225483) and simplified the way I work, so it's already a net win for me, even if it turns out we don't need pristine tarballs for anything at all!

Black bean soup and garlic shrimp home cookin'.

Thanks to Faidon Liambotis, pristine-tar has been extended to support recreating pristine gz files, as well as tar files. It's been tested against the entire corpus of .tar.gz files in the debian archive, and succeeds on 98.7% of them.

We're using the whole archive as a test suite for pristine-tar, which is such a nice change -- excellent test coverange, and no need to write tests. :-) I've put up all the deltas it generated during its most recent run at http://hydra.kitenet.net/~joey/tmp/tar/. The deltas for all the tarballs in the archive use only 175 mb of space.

The next step, once it gets out of incoming will be for tools that inject sources into revision control to get support for generating pristine-tar deltas and checking the deltas in too. And for tools that build packages from revision control to get support for using the deltas to reproduce the original tarball. If you maintain such a tool, I'm happy to help you do that.

how pristine-gz works

Recreating gz files is tricky, since you generally can't binary diff them as pristine-tar does with tar files. Instead, gzip has to be fed exactly the same conditions that applied when the original gz file was created. This includes time stamps, filenames, file content, and compression level. These are figured out by looking at the header of the gz file.

But that's only the easy part, because there's a lot of variation in gz creation programs. The debian archive contains gz files produced on BSD systems by a libz based compressor, other built on MS-DOS, Windows NT, and many other strange and often buggy things. All of these differences result in different gz files.

To deal with this, pristine-gz uses a gz creator that can reproduce any of the known variants on demand, and just tries different varients until it finds a match.

There are still 132 files that it fails to reproduce, so if this sounds interesting to you, you can try to figure out how they were created and add support for them.

It would be possible to make pristine-gz succeed even for gz files it can't reproduce. It could just generate xdeltas to the files it tries, and store the smallest one. I am undecided if that would be a good idea, since the delta wouldn't be very small.

Back around 1999, I was really interested in getting all of Debian imported into CVS (ugh!) so we could have all the benefits of pervasive version control. I've always been sad it didn't happen, especially since ubuntu did it. Although they seem to get less benefits from it than I would have thought at the time, go figure.

I actually feel though that the model Debian has developed with alioth and now with VCS- fields, in which packages use version control, and debian integrates support for it without mandating it, or mandating which version control system is used, is has better legs than ubuntu's model. Ubuntu's model potentially leaves you where freebsd is now, nursing a cvs equivilant along as the world has moved on.

Except, well, Debian actually picked a different version control system, and has been stuck with it for years. I refer to the diff, which is great for passing patches around by email, but not so great as a mechanism for recording the history of changes needed to debianise a package. So we have this whole set of things piled on top of the diff, like the loathsome dbs. We also have this whole set of problems in the source package format that cannot be expressed by diff and have to be messily worked around, like not being able to add/modify binary files, and not being able to (re)move files.

(Wig and Pen addresses some of the worst limitations of the current format, but AFAIK noone is working on implementing support for generating Wig and Pen format packages, since doing so is unavoidably complicated.)

Once I started looking at the diff in .diff.gz as a version control system, the natural thing was to think about adding support to other revision control systems, following down our path of debian supporting maintainers who choose to use a different one. So I arrived at the idea of a .git.tar.gz.

Let a debian source package consist of just a .dsc and package_version.git.tar.gz, which contains only package/.git/*. Making changes to the source becomes very pleasant, since you can commit any change you like and not have to worry about how it will be repesented in the .diff.gz. And of course there's all kinds of benefits of having the history and branches and tagged upstream source available in there. Far too many benefits to list here, and only a few downsides.

I look at this as very much of an evolutionary change, not a revolutionary change, to the debian source format. Most packages will continue to use .diff.gz for a long while, but packages whose maintainers chafe under that format will have another one to choose from.

So, well, I implemented it. A dpkg-source that understands this format is available in the sourcev3 branch at git://kitenet.net/dpkg. I'll be posting some technical details and the patches to debian-dpkg, and I've put a FAQ about it in the wiki.

A sample dpkg source package built using this is temporarily here. This demo package includes only the last 200 commits to the dpkg git repo, so it's more than 1 mb smaller than dpkg's normal .tar.gz!

discussion

Mr. Who? No..

mr is a Multiple Repository management tool. With lots of people using svn with no real reason to switch, and some people using git because it's the cool thing of the day, I need a way to be able to checkout and update from multiple repositories, in multiple revison control systems. There are probably many scripts already written to do this; mr(1) is my attempt, and since it's very configurable and has a very short name, perhaps it will be useful to others in this situation.

An example is probably worth 1347 words (according to wc -w mr). One thing not shown here is that it can look for .mrconfig files not just in the home directory, but inside the repositories it checks out.

joey@kodama:~/src> mr update
mr update: in /home/joey/src/dpkg
Already up-to-date.
Already up-to-date.

mr update: in /home/joey/src/linux-2.6
Already up-to-date.

mr update: in /home/joey/src/mr
Already up-to-date.

mr update: finished (3 sucessful; 1 skipped)

Here's my current ~/.mrconfig file.

[src/mr]
checkout = git clone ssh://git.joeyh.name/srv/git.joeyh.name/mr

[src/linux-2.6]
checkout = git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
# I only check this out on kodama, otherwise skip it.
skip = test $(hostname) != kodama

[src/dpkg]
# A merge of the upstream dpkg git repo and my own personal branch.
checkout = git clone git://git.debian.org/git/dpkg/dpkg.git && \
        cd dpkg && \
        git remote add kite ssh://kitenet.net/srv/git.joeyh.name/dpkg && \
        git fetch kite && \
        git checkout -b sourcev3 kite/sourcev3
update = git pull origin master && git pull kite sourcev3
commit = git push kite

# My home directory, which I keep in svn.
[]
checkout = svn co svn+ssh://svn.kitenet.net/srv/svn/joey/trunk/home-$(hostname) joey
# run svnfix after each update
update = svn update && svnfix

PS, what should "mr clean" do?

I've hacked a lot of handy features into mr. Some more examples follow. I've ommitted noisy output from the various revision control systems.

Registering existing repos:

joey@kodama:~/tmp/scratch> git clone git://git.joeyh.name/mr
joey@kodama:~/tmp/scratch> svn co svn://svn.kitenet.net/joey/trunk/src/packages/alien
joey@kodama:~/tmp/scratch> mr register mr
Registering git url: git://git.joeyh.name/mr
joey@kodama:~/tmp/scratch> mr register alien
Registering svn url: svn://svn.kitenet.net/joey/trunk/src/packages/alien
joey@kodama:~/tmp/scratch> mr up
mr update: /home/joey/tmp/scratch/alien
At revision 13492.

mr update: /home/joey/tmp/scratch/mr
Already up-to-date.

mr update: finished (2 successful)

Command-line modification of the ~/.mrconfig file, and the ability to skip a repo based on an arbitrary shell test:

joey@kodama:~/tmp/scratch> mr config tmp/scratch/alien skip='[ $(hostname) != foo]'
joey@kodama:~/tmp/scratch> mr up
mr update: /home/joey/tmp/scratch/mr
Already up-to-date.

mr update: finished (1 successful; 1 skipped)

Tracking and warning about deleted repositories. Useful if, like me, you have dozens of repos checked out on dozens of machines:

joey@kodama:~/tmp/scratch> mr config tmp/scratch/mr deleted=true 
joey@kodama:~/tmp/scratch> mr up
mr error: /home/joey/tmp/scratch/mr/ should be deleted yet still exists

mr update: finished (1 failed; 1 skipped)
joey@kodama:~/tmp/scratch> rm -rf mr
joey@kodama:~/tmp/scratch> mr up
mr update: finished (1 skipped)

(Tracking of repo moving is still TODO.)

Running mr from inside a repo operates on only that repo. Useful when you can't be bothered to remember what revision control system a particular repo uses. ;-)

joey@kodama:~/src/mr/debian> mr commit -m "bugfixes developed while writing blog post"
mr commit: /home/joey/src/mr (in subdir /home/joey/src/mr/debian)
Created commit 2f861a4: bugfixes developed while writing blog post

It also supports cvs and bzr now.

The good:

I had fun building an enclosure for my Atari <-> PC serial adapter board out of legos, and I love the result. Been too long since I played with legos. I need to dig up the rest of my legos so i can make some enclosures for my ARM boards.

The bad:

I picked up a 1 gigabyte USB drive in the shape of an off-brand lego block today. The price was right (< $10), but the implementation is depressing -- the fake legos don't stick well and so the cap can't be mounted on the drive body when it's in use. And the size isn't lego standard, so it can't attach to other blocks.

The ugly:

In a better world, Lego would be an open standard. The patent expired in 1988 (according to Wikipedia. But the company still has lawyers, who seem busy doing what (corporate) lawyers do.

Meh. Legos were more fun to play with when playing with them didn't involve worrying about patents and copyrights and the market's propensity to deter standardisation.

discussion

Sometimes I'm told that a program has a bug. I trust the bug submitter, they are seeing a bug. But I can't reproduce the bug. I can't figure out what the bug could be by inspecting the code. I'm stuck. Much time is wasted.

Eventually, the bug submitter does a lot of work to help me reproduce the bug. I see the bug happen. Very shortly afterwards, I notice some peice of information that I hadn't paid attention to before, and I suddenly understand the bug, and can now quicly fix it.

It's not that reproducing the bug has helped me figure it out, it's just that some corner of my brain refused to work on the bug until I saw it happen.

"To make a thief, make an owner; to create crime, create laws."
(Le Guin, The Dispossessed)

It's sad that the amazing author of that quote has gotten involved in a kerfluffle over copyright issues. Copyright seems to be where even the best SF authors stop being able to make credible guesses about the future, or even acknowlege the reality of the present.

Disappointing that Le Guin has been cheapened by this nonsensical, transitory concept of owning one's words.

Aj's mr update time is 50 seconds. Mine is 18 minutes. I must have too many repositories (78). What's your mr update time?

Update: Mine is down to 43 seconds, with the new mr -j running at -j 10. Fantabulous!

I've been moving some things to git. My blog is the first thing that really benefitted from distributed revision control though, since I can use ikiwiki with git to have a mirror/branch of the blog on my laptop. That worked out quite nicely. I've described the setup here.

With a full moon behind the blue clouds. Gorgeous tonight! And windy, and brr..

Weird popcorn experience. I screwed up the first pan, burnt and mispopped everything. So I cleaned it out and put it back on the stove, still quite wet, tossed in a small amount of (olive) oil and popcorn (no test kernels, just all at once), and tried again. This time the corn took quite a while to start to pop, I could hear the water/oil mixture sizzling and splattering the inside of the pan. When it started to pop, it was all over in under 5 seconds, no trail off of slow to pop kernels like I typically get. And only 2 kernels didn't pop. Perfect.

Something went on with that water/oil mixture. My guess is that the water regulated the temperature, as long as the water was in there everything was kept right around the boiling point. When the last water evaporated, the temperature must have shot up to popping point very uniformly and quickly.

I like to record NASA TV and watch it in fast foward. Generally the best bits are long exterior views, watching solar panels turn to catch the light as the earth rotates below, but I also sometimes see amusing things.

This is Clay Anderson on the ISS. He spent quite a while floating around in this makeshift cape today. Of course even vampire astronauts have the ever-present clipboard which just makes him look more scary.

←	Oct 2007					→
S	M	T	W	T	F	S
	1 pristine-tar followup	2	3 mmmmmmmmm	4 pristine-tar now supporting .tar.gz	5 an evolutionary change to the Debian source package format	6
7	8	9	10	11 introducing mr	12 more mr fun	13 lego
14 I hate it when this happens SF authors: sheesh	15	16	17	18	19 mr update time	20
21 git transitions	22	23	24	25 mackerel sky	26	27
28	29 perfect popcorn	30	31 from NASA TV