Debian wheezy includes a bunch of excellent new Haskell libraries. I'm going to highlight one that should be interesting to non-Haskell developers, who may have struggled with writing non-buggy threaded programs in other languages: libghc-stm-dev
I had given up on most threaded programs before learning about Software Transactional Memory. Writing a correct threaded program, when multiple threads needed to modify the same state, needed careful uses of locking. In my experience, locking is almost never gotten right the first time.
A real life example I encountered is an app that displays a queue of files to be downloaded, and a list of files currently downloading. Starting a new download would go something like this:
startDownload = do
file <- getQueuedFile
startDownloadThread file
push file currentDownLoads
But there's a point in time in which another thread, that refreshes the display, could then see an inconsistent state, where the file is in neither place. To fix this, you'd need to add lock checking around all accesses to the download queue and current downloads list, and lock them both here. (And be sure to always take the locks in the same order!)
But, it's worse than that, because how is getQueuedFile
implemented?
If the queue is empty, it needs to wait on a file being added. But how
can a file be added the queue if we've locked it in order to perform this
larger startDownload
operation? What should be really simple code
has become really complex juggling of locks.
STM deals with this in a much nicer way:
startDownload = atomically $ do
file <- getQueuedFile
startDownloadThread file
push file currentDownLoads
Now the two operations are performed as one atomic transaction. It's not possible for any other thread to see an inconsistent state. No explicit locking is needed.
And, getQueuedFile
can do whatever waiting it needs to, also using STM.
This becomes part of the same larger transaction, in a way that cannot
deadlock. It might be implemented like this:
getQueuedFile = atomically $
size <- getSize downLoadQueue
if size == 0
then retry
else pop downloadQueue
When the queue is empty and this calls "retry", STM automatically waits for the queue to change before restarting the transaction. So this blocks until a file becomes available. It does it without any locking, and without you needing to explicitly tell STM what you're waiting on.
I find this beautiful, and am happier with it the more I use it in my code.
Functions like getQueuedFile
that run entirely in STM are building blocks
that can be snapped together without worries to build more and more complex
things.
For non-Haskell developers, STM is also available in Clojure, and work is underway to add it to gcc. There is also Hardware Transactional Memory coming, to speed it up. Although in my experience it's quite acceptably fast already.
However, as far as I know, all these other implementations of STM leave developers with a problem nearly as thorny as the original problem with locking. STM inherently works by detecting when a change is made that conflicts with another transaction, throwing away the change, and retrying. This means that code inside a STM transaction may run more than once.
Wait a second.. Doesn't that mean this code has a problem?
startDownload = atomically $ do
file <- getQueuedFile
startDownloadThread file
push file currentDownLoads
Yes, this code is buggy! If the download thread is started, but then STM restarts the transaction, the same file will be downloaded repeatedly.
The C, Clojure, etc, STM implementations all let you write this buggy code.
Haskell, however, does not. The buggy code I showed won't even compile. The
way it prevents this involves, well, monads. But essentially, it is able to
use type checking to automatically determine that startDownloadThread
is
not safe to put in the middle of a STM transaction. You're left with no
choice but to change things so the thread is only
spawned once the transaction succeeds:
startDownload = do
file <- atomically $ do
f <- getQueuedFile
push f currentDownLoads
return f
startDownloadThread file
If you appreciate that, you may want to check out some other #newinwheezy stuff like libghc-yesod-dev, a web framework that uses type checking to avoid broken urls and XSS attacks, and also makes heavy use of threading, so is a great fit for using with STM. And libghc-quickcheck2-dev, which leverages the type system to automatically test properties about your program.
With wheezy released, the floodgates are opened on a lot of debhelper
changes that have been piling up. Most of these should be pretty minor, but
I released one yesterday that will affect all users of dh
. Hopefully in a
good way.
I made dh
smarter about selecting which debhelper commands it runs.
It can tell when a package does not use the stuff done by a particular
command, and skips running the command entirely.
So the debian/rules binary
of a package using dh
will now often look like this:
dh binary dh_testroot dh_prep dh_auto_install dh_installdocs dh_installchangelogs dh_perl dh_link dh_compress dh_fixperms dh_installdeb dh_gencontrol dh_md5sums dh_builddeb
Which is pretty close to the optimal hand-crafted debian/rules
file (and just
about as fast, too). But with the benefit that if you later add, say, cron job
files, dh_installcron
will automatically start being run too.
Hopefully this will not result in any behavior changes, other than packages building faster and with less noise. If there is a bug it'll probably be something missing in the specification of when a command needs to be run.
Beyond speed, I hope that this will help to lower the bar to adding new commands to debhelper, and to the default dh sequences. Before, every such new command slowed things down and was annoying. Now more special-purpose commands won't get in the way of packages that don't need them.
The way this works is that debhelper commands can include a "PROMISE"
directive. An example from dh_installexamples
# PROMISE: DH NOOP WITHOUT examples
Mostly this specifies the files in debian/
that are used by the command, and
whose presence triggers the command to run. There is also a syntax to specify
items that can be present in the package build directory to trigger the command
to run.
(Unfortunatly, dh_perl
can't use this. There's no good way to specify
when dh_perl
needs to run, short of doing nearly as much work as dh_perl
would do when run. Oh well.)
Note that third-party dh_
commands can include these directives too, if that
makes sense.
I'm happy how this turned out, but I could be happier about the implementation. The PROMISE directives need to be maintained along with the code of the command. If another config file is added, they obviously must be updated. Other changes to a command can invalidate the PROMISE directive, and cause unexpected bugs.
What would be ideal is to not repeat the inputs of the command in these directives, but instead write the command such that its inputs can be automatically extracted. I played around with some code like this:
$behavior = main_behavior("docs tmp(usr/share/doc/)", sub {
my $package=shift;
my $docs=shift;
my $docdir=shift;
install($docs, $docdir);
});
$behavior->($package);
But refactoring all debhelper commands to be written in this style would be a big job. And I was not happy enough with the flexability and expressiveness of this to continue with it.
I can however, dream about what this would look like if debhelper were written
in Haskell. Then I would have a Debhelper a
monad, within which each command
executes.
main = runDebhelperIO installDocs
installDocs :: Monad a => Debhelper a
installDocs = do
docs <- configFile "docs"
docdir <- tmpDir "usr/share/doc"
lift $ install docs docdir
To run the command, runDebhelperIO
would loop over all the packages
and run the action, in the Debhelper IO
monad.
But, this also allows making an examineDebhelper
that takes an action
like installDocs
, and runs it in a Debhelper Writer
monad. That would
accumulate a list of all the inputs used by the action, and return it,
without performing any side effecting IO actions.
It's been 15 years since I last changed the language debhelper was written
in. I did that for less gains than this, really. (The issue back then was
that shell getopt
sucked.) IIRC it was not very hard, and only took a few
days. Still, I don't really anticipate reimplementing debhelper in Haskell
any time soon.
For one thing, individual Haskell binaries are quite large, statically linking all Haskell libraries they use, and so the installed size of debhelper would go up quite a bit. I hope that forthcoming changes will move things toward dynamically linked haskell libraries, and make it more appealing for projects that involve a lot of small commands.
So, just a thought experiment for now..