unicode ate my homework

I've just spent several days trying to adapt git-annex to changes in ghc 7.4's handling of unicode in filenames. And by spent, I mean, time withdrawn from the bank, and frittered away.

In kindergarten, the top of the classrom wall was encircled by the aA bB cC of the alphabet. I'll bet they still put that up on the walls. And all the kids who grow up to become involved with computers learn that was a lie. The alphabet doesn't stop at zZ. It wouldn't all fit on a wall anymore.

So we're in a transition period, where we've all learnt deeply the alphabet, but the reality is much more complicated. And the collision between that intuitive sense of the world and the real world makes things more complicated still. And so, until we get much farther along in this transition period, you have to be very lucky indeed to not have wasted time dealing with that complexity, or at least having encountered Mojibake.

Most of the pain centers around programming languages, and libraries, which are all at different stages of the transition from ascii and other legacy encodings to unicode.

  • If you're using C, you likely deal with all characters as raw bytes, and rely on the backwards compatability built into UTF-8, or you go to long lengths to manually deal with wide characters, so you can intelligently manipulate strings. The transition has barely begin, and will, apparently, never end.
  • If you're using perl (at least like I do in ikiwiki), everything is (probably) unicode internally, but every time you call a library or do IO you have to manually deal with conversions, that are generally not even documented. You constantly find new encoding bugs. (If you're lucky, you don't find outright language bugs... I have.) You're at a very uncomfortable midpoint of the transition.
  • If you're using haskell, or probably lots of other languages like python and ruby, everything is unicode all the time.. except for when it's not.
  • If you're using javascript, the transition is basically complete.

My most recent pain is because the haskell GHC compiler is moving along in the transition, getting closer to the end. Or at least finishing the second 80% and moving into the third 80%. (This is not a quick transition..)

The change involves filename encodings, a situation that, at least on unix systems, is a vast mess of its own. Any filename, anywhere, can be in any encoding, and there's no way to know what's the right one, if you dislike guessing.

Haskell folk like strongly typed stuff, so this ambiguity about what type of data is contained in a FilePath type was surely anathama. So GHC is changing to always use UTF-8 for operations on FilePath. (Or whatever the system encoding is set to, but let's just assume it's UTF-8.)

Which is great and all, unless you need to write a Haskell program that can deal with arbitrary files. Let's say you want to delete a file. Just a simple rm. Now there are two problems:

  1. The input filename is assumed to be in the system encoding aka unicode. What if it cannot be validly interpreted in that encoding? Probably your rm throws an exception.
  2. Once the FilePath is loaded, it's been decoded to unicode characters. In order to call unlink, these have to be re-encoded to get a filename. Will that be the same bytes as the input filename and the filename on disk? Possibly not, and then the rm will delete the wrong thing, or fail.

But haskell people are smart, so they thought of this problem, and provided a separate type that can deal with it. RawFilePath hearks back to kindergarten; the filename is simply a series of bytes with no encoding. Which means it cannot be converted to a FilePath without encountering the above problems. But does let you write a safe rm in ghc 7.4.

So I set out to make something more complicated than a rm, that still needs to deal with arbitrary filename encodings. And I soon saw it would be problimatic. Because the things ghc can do with RawFilePaths are limited. It can't even split the directory from the filename. We often do need to manipulate filenames in such ways, even if we don't know their encoding, when we're doing something more complicated than rm.

If you use a library that does anything useful with FilePath, it's not available for RawFilePath. If you used standard haskell stuff like readFile and writeFile, it's not available for RawFilePath either. Enjoy your low-level POSIX interface!

So, I went lowlevel, and wrote my own RawFilePath versions of pretty much all of System.FilePath, and System.Directory, and parts of MissingH and other libraries. (And noticed that I can understand all this Haskell code.. yay!) And I got it close enough to working that, I'm sure, if I wanted to chase type errors for a week, I could get git-annex, with ghc 7.4, to fully work on any encoding of filenames.

But, now I'm left wondering what to do, because all this work is regressive; it's swimming against the tide of the transition. GHC's change is certainly the right change to make for most programs, that are not like rm. And so most programs and libraries won't use RawFilePath. This risks leaving a program that does a fish out of water.

At this point, I'm inclined to make git-annex support only unicode (or the system encoding). That's easy. And maybe have a branch that uses RawFilePath, in a hackish and type-unsafe way, with no guarantees of correctness, for those who really need it.


Previously: unicode eye chart wanted on a bumper sticker abc boxes unpacking boxes

Posted
more on ghc filename encodings

My last post missed an important thing about GHC 7.4's handling of encodings for FileName. It can in fact be safe to use FilePath to write a command like rm. This is because GHC internally uses a special encoding for FilePath data, that is documented to allow "arbitrary undecodable bytes to be round-tripped through it". (It seems to do this by encoding the undecodable bytes as very high unicode code points.) So, when presented with a filename that cannot be decoded using utf-8 (or whatever the system encoding is), it still handles it, and using the resulting FilePath will in fact operate on the right file. Whew!

Moral of the story is that if you're going to be using GHC 7.4 to read or write filenames from a pipe, or a file, you need to arrange for the Handle you're reading or writing to use this special encoding too. I use this to set up my Handles:

import System.IO
import GHC.IO.Encoding
import GHC.IO.Handle

fileEncoding :: Handle -> IO ()
fileEncoding h = hSetEncoding h =<< getFileSystemEncoding

Even if you're only going to write a FilePath to stdout, you need to do this. Otherwise, your program will crash on some filenames! This doesn't seem quite right to me, but I hesitate to file a bug report. (And this is not a new problem in GHC anyway.) If I did, it would have this testcase:

# touch "me¡"
# LANG=C ghc
Prelude> :m System.Directory
Prelude System.Directory> mapM_ putStrLn =<< getDirectoryContents "."
me*** Exception: <stdout>: hPutChar: invalid argument (invalid character)

Since git-annex reads lots of filenames from git commands and other places, I had to deal with this extensively. Unfortunatly I have not found a way to read Text from a Handle using the fileSystemEncoding. So I'm stuck with slow Strings. But, it does seem to work now.


PS: I found a bug in GHC 7.4 today where one of those famous Haskell immutable values seems to get well, mutated. Specifically a [FilePath] that is non-empty at the top of a function ends up empty at the bottom. Unless IO is done involving it at the top. Really. Hope to develop a test case soon. Happily, the code that triggered it did so while working around a bug in GHC that is fixed in 7.4. Language bugs.. gotta love em.

Posted
addicted to $

One of the weird historical accidents of programming languages is that so many of them use $ for important things. The reason is just that out of the available punctuation, nearly all of it has a mathmatical or other predefined use that makes sense to retain in a programming language context, while $ (and also @ and #) do not. Still, $ annoys me, it's so asymetric that we use it all over our code, and never a £ or ฿ to be seen.

The one language that manages to use $ nicely, IMHO, is Haskell. Recently I noticed that it has an actual visual mnemonic in its use of $. And it's used for something I've not seen in other languages.

The visual mnemonic of $ is that it looks like an opening parenthesis, with the related closing parenthesis on a line below it.

(something (that
    (lisp folks
        (are (very (familiar with)))
    )
))

And this is also the problem that $ solves:

something $ that $
    haskell folks $
        are $ very $ familiar with

This is a trivial feature.. but oh so useful. The implementation in Haskell of $ is simply:

f $ x = f x
infixr 0 $

Just function application, but at a different precedence than usual.

I am now very addicted to my $. Out of 15 thousand lines of code, only 87 contain )), while 10% use $.

Posted
leap day

This leap day saw me driving along the river on a rainy, with 4 chickens in the car's trunk, and 3 terabytes of disk (and a half a bale of straw) in the back seat. I may have not been blogging much lately about life, because these situations can be hard to explain. (Or because "joined the Debian haskell team and spent two days working on rebuilds for the ghc 7.4 transition" is not thrilling reading.)

hens in a car

The Light Sussex chickens are my sister's spare flock, which are "too tame". They're now cozily installed into a coop we built last weekend. In return I gave her a 6 foot long APC power strip, which had been mounted on the wall of my office. I'm preparing my house in town to be rented, and have little need for two dozen power outlets here in solar power land.

Google <3s Your Work

Indeed, today is a gift economy day all around -- when I arrived at the cabin, there on the porch was an unexpected package from Google. Particularly surprising since I never get deliveries here, since the driveway is a mile long and often seems like it could dead-end into the woods at any moment.

The combination of technological wackiness (I also debugged a laptop whose USB hub hangs when a particular trackball is plugged in) and in your face country texture (including coal trains, being stuck behind a tractor, and miles of amazing tree-height mist) made this a memorable day.

Posted