05

Debian wheezy includes a bunch of excellent new Haskell libraries. I'm going to highlight one that should be interesting to non-Haskell developers, who may have struggled with writing non-buggy threaded programs in other languages: libghc-stm-dev

I had given up on most threaded programs before learning about Software Transactional Memory. Writing a correct threaded program, when multiple threads needed to modify the same state, needed careful uses of locking. In my experience, locking is almost never gotten right the first time.

A real life example I encountered is an app that displays a queue of files to be downloaded, and a list of files currently downloading. Starting a new download would go something like this:

startDownload = do
    file <- getQueuedFile
    startDownloadThread file
    push file currentDownLoads

But there's a point in time in which another thread, that refreshes the display, could then see an inconsistent state, where the file is in neither place. To fix this, you'd need to add lock checking around all accesses to the download queue and current downloads list, and lock them both here. (And be sure to always take the locks in the same order!)

But, it's worse than that, because how is getQueuedFile implemented? If the queue is empty, it needs to wait on a file being added. But how can a file be added the queue if we've locked it in order to perform this larger startDownload operation? What should be really simple code has become really complex juggling of locks.

STM deals with this in a much nicer way:

startDownload = atomically $ do
    file <- getQueuedFile
    startDownloadThread file
    push file currentDownLoads

Now the two operations are performed as one atomic transaction. It's not possible for any other thread to see an inconsistent state. No explicit locking is needed.

And, getQueuedFile can do whatever waiting it needs to, also using STM. This becomes part of the same larger transaction, in a way that cannot deadlock. It might be implemented like this:

getQueuedFile = atomically $
    size <- getSize downLoadQueue
    if size == 0
        then retry
        else pop downloadQueue

When the queue is empty and this calls "retry", STM automatically waits for the queue to change before restarting the transaction. So this blocks until a file becomes available. It does it without any locking, and without you needing to explicitly tell STM what you're waiting on.

I find this beautiful, and am happier with it the more I use it in my code. Functions like getQueuedFile that run entirely in STM are building blocks that can be snapped together without worries to build more and more complex things.

For non-Haskell developers, STM is also available in Clojure, and work is underway to add it to gcc. There is also Hardware Transactional Memory coming, to speed it up. Although in my experience it's quite acceptably fast already.

However, as far as I know, all these other implementations of STM leave developers with a problem nearly as thorny as the original problem with locking. STM inherently works by detecting when a change is made that conflicts with another transaction, throwing away the change, and retrying. This means that code inside a STM transaction may run more than once.

Wait a second.. Doesn't that mean this code has a problem?

startDownload = atomically $ do
    file <- getQueuedFile
    startDownloadThread file
    push file currentDownLoads

Yes, this code is buggy! If the download thread is started, but then STM restarts the transaction, the same file will be downloaded repeatedly.

The C, Clojure, etc, STM implementations all let you write this buggy code.

Haskell, however, does not. The buggy code I showed won't even compile. The way it prevents this involves, well, monads. But essentially, it is able to use type checking to automatically determine that startDownloadThread is not safe to put in the middle of a STM transaction. You're left with no choice but to change things so the thread is only spawned once the transaction succeeds:

startDownload = do
    file <- atomically $ do
        f <- getQueuedFile
        push f currentDownLoads
        return f
    startDownloadThread file

If you appreciate that, you may want to check out some other #newinwheezy stuff like libghc-yesod-dev, a web framework that uses type checking to avoid broken urls and XSS attacks, and also makes heavy use of threading, so is a great fit for using with STM. And libghc-quickcheck2-dev, which leverages the type system to automatically test properties about your program.

With wheezy released, the floodgates are opened on a lot of debhelper changes that have been piling up. Most of these should be pretty minor, but I released one yesterday that will affect all users of dh. Hopefully in a good way.

I made dh smarter about selecting which debhelper commands it runs. It can tell when a package does not use the stuff done by a particular command, and skips running the command entirely.

So the debian/rules binary of a package using dh will now often look like this:

dh binary
   dh_testroot
   dh_prep
   dh_auto_install
   dh_installdocs
   dh_installchangelogs
   dh_perl
   dh_link
   dh_compress
   dh_fixperms
   dh_installdeb
   dh_gencontrol
   dh_md5sums
   dh_builddeb

Which is pretty close to the optimal hand-crafted debian/rules file (and just about as fast, too). But with the benefit that if you later add, say, cron job files, dh_installcron will automatically start being run too.

Hopefully this will not result in any behavior changes, other than packages building faster and with less noise. If there is a bug it'll probably be something missing in the specification of when a command needs to be run.

Beyond speed, I hope that this will help to lower the bar to adding new commands to debhelper, and to the default dh sequences. Before, every such new command slowed things down and was annoying. Now more special-purpose commands won't get in the way of packages that don't need them.

The way this works is that debhelper commands can include a "PROMISE" directive. An example from dh_installexamples

# PROMISE: DH NOOP WITHOUT examples

Mostly this specifies the files in debian/ that are used by the command, and whose presence triggers the command to run. There is also a syntax to specify items that can be present in the package build directory to trigger the command to run.

(Unfortunatly, dh_perl can't use this. There's no good way to specify when dh_perl needs to run, short of doing nearly as much work as dh_perl would do when run. Oh well.)

Note that third-party dh_ commands can include these directives too, if that makes sense.

I'm happy how this turned out, but I could be happier about the implementation. The PROMISE directives need to be maintained along with the code of the command. If another config file is added, they obviously must be updated. Other changes to a command can invalidate the PROMISE directive, and cause unexpected bugs.

What would be ideal is to not repeat the inputs of the command in these directives, but instead write the command such that its inputs can be automatically extracted. I played around with some code like this:

$behavior = main_behavior("docs tmp(usr/share/doc/)", sub {
       my $package=shift;
       my $docs=shift;
       my $docdir=shift;

       install($docs, $docdir);
});
$behavior->($package);

But refactoring all debhelper commands to be written in this style would be a big job. And I was not happy enough with the flexability and expressiveness of this to continue with it.

I can however, dream about what this would look like if debhelper were written in Haskell. Then I would have a Debhelper a monad, within which each command executes.

main = runDebhelperIO installDocs

installDocs :: Monad a => Debhelper a
installDocs = do
    docs <- configFile "docs"
    docdir <- tmpDir "usr/share/doc"
    lift $ install docs docdir

To run the command, runDebhelperIO would loop over all the packages and run the action, in the Debhelper IO monad.

But, this also allows making an examineDebhelper that takes an action like installDocs, and runs it in a Debhelper Writer monad. That would accumulate a list of all the inputs used by the action, and return it, without performing any side effecting IO actions.

It's been 15 years since I last changed the language debhelper was written in. I did that for less gains than this, really. (The issue back then was that shell getopt sucked.) IIRC it was not very hard, and only took a few days. Still, I don't really anticipate reimplementing debhelper in Haskell any time soon.

For one thing, individual Haskell binaries are quite large, statically linking all Haskell libraries they use, and so the installed size of debhelper would go up quite a bit. I hope that forthcoming changes will move things toward dynamically linked haskell libraries, and make it more appealing for projects that involve a lot of small commands.

So, just a thought experiment for now..

←	May 2013					→
S	M	T	W	T	F	S
			1	2 the #newinwheezy game: STM	3	4
5	6	7	8 faster dh	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31