unicode ate my homework

I've just spent several days trying to adapt git-annex to changes in ghc 7.4's handling of unicode in filenames. And by spent, I mean, time withdrawn from the bank, and frittered away.

In kindergarten, the top of the classrom wall was encircled by the aA bB cC of the alphabet. I'll bet they still put that up on the walls. And all the kids who grow up to become involved with computers learn that was a lie. The alphabet doesn't stop at zZ. It wouldn't all fit on a wall anymore.

So we're in a transition period, where we've all learnt deeply the alphabet, but the reality is much more complicated. And the collision between that intuitive sense of the world and the real world makes things more complicated still. And so, until we get much farther along in this transition period, you have to be very lucky indeed to not have wasted time dealing with that complexity, or at least having encountered Mojibake.

Most of the pain centers around programming languages, and libraries, which are all at different stages of the transition from ascii and other legacy encodings to unicode.

  • If you're using C, you likely deal with all characters as raw bytes, and rely on the backwards compatability built into UTF-8, or you go to long lengths to manually deal with wide characters, so you can intelligently manipulate strings. The transition has barely begin, and will, apparently, never end.
  • If you're using perl (at least like I do in ikiwiki), everything is (probably) unicode internally, but every time you call a library or do IO you have to manually deal with conversions, that are generally not even documented. You constantly find new encoding bugs. (If you're lucky, you don't find outright language bugs... I have.) You're at a very uncomfortable midpoint of the transition.
  • If you're using haskell, or probably lots of other languages like python and ruby, everything is unicode all the time.. except for when it's not.
  • If you're using javascript, the transition is basically complete.

My most recent pain is because the haskell GHC compiler is moving along in the transition, getting closer to the end. Or at least finishing the second 80% and moving into the third 80%. (This is not a quick transition..)

The change involves filename encodings, a situation that, at least on unix systems, is a vast mess of its own. Any filename, anywhere, can be in any encoding, and there's no way to know what's the right one, if you dislike guessing.

Haskell folk like strongly typed stuff, so this ambiguity about what type of data is contained in a FilePath type was surely anathama. So GHC is changing to always use UTF-8 for operations on FilePath. (Or whatever the system encoding is set to, but let's just assume it's UTF-8.)

Which is great and all, unless you need to write a Haskell program that can deal with arbitrary files. Let's say you want to delete a file. Just a simple rm. Now there are two problems:

  1. The input filename is assumed to be in the system encoding aka unicode. What if it cannot be validly interpreted in that encoding? Probably your rm throws an exception.
  2. Once the FilePath is loaded, it's been decoded to unicode characters. In order to call unlink, these have to be re-encoded to get a filename. Will that be the same bytes as the input filename and the filename on disk? Possibly not, and then the rm will delete the wrong thing, or fail.

But haskell people are smart, so they thought of this problem, and provided a separate type that can deal with it. RawFilePath hearks back to kindergarten; the filename is simply a series of bytes with no encoding. Which means it cannot be converted to a FilePath without encountering the above problems. But does let you write a safe rm in ghc 7.4.

So I set out to make something more complicated than a rm, that still needs to deal with arbitrary filename encodings. And I soon saw it would be problimatic. Because the things ghc can do with RawFilePaths are limited. It can't even split the directory from the filename. We often do need to manipulate filenames in such ways, even if we don't know their encoding, when we're doing something more complicated than rm.

If you use a library that does anything useful with FilePath, it's not available for RawFilePath. If you used standard haskell stuff like readFile and writeFile, it's not available for RawFilePath either. Enjoy your low-level POSIX interface!

So, I went lowlevel, and wrote my own RawFilePath versions of pretty much all of System.FilePath, and System.Directory, and parts of MissingH and other libraries. (And noticed that I can understand all this Haskell code.. yay!) And I got it close enough to working that, I'm sure, if I wanted to chase type errors for a week, I could get git-annex, with ghc 7.4, to fully work on any encoding of filenames.

But, now I'm left wondering what to do, because all this work is regressive; it's swimming against the tide of the transition. GHC's change is certainly the right change to make for most programs, that are not like rm. And so most programs and libraries won't use RawFilePath. This risks leaving a program that does a fish out of water.

At this point, I'm inclined to make git-annex support only unicode (or the system encoding). That's easy. And maybe have a branch that uses RawFilePath, in a hackish and type-unsafe way, with no guarantees of correctness, for those who really need it.


Previously: unicode eye chart wanted on a bumper sticker abc boxes unpacking boxes

unicode eye chart

E
￿ ڿ
ᛯ ℇ ✈
🍺 ಅ ΐ ʐ 𝍇
Щ অ ℻ ⌬ ⌨ ⌣
₰ ⠝ ‱ ‽ ח ֆ ∜ ⨀ IJ
Ⴊ ⇠ ਐ ῼ இ ╁ ଠ ୭ ⅙ ㈣
⧒ ₔ ⅷ ﭗ ゛ 〃 ・ ↂ ﻩ ✞ ℼ ⌧





Reference key for the Optometrist:

LATIN CAPITAL LETTER E

THAT DAMN UNICODE BOX WITH F'S IN, ARABIC LETTER TCHEH WITH DOT ABOVE

RUNIC TVIMADUR SYMBOL, EULER CONSTANT, AIRPLANE

BEER MUG, GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS, LATIN SMALL LETTER Z WITH RETROFLEX HOOK, TETRAGRAM FOR DEPARTURE

CYRILLIC CAPITAL LETTER SHCHA, BENGALI LETTER A, FACSIMILE SIGN, BENZENE RING, KEYBOARD, SMILE

GERMAN PENNY SYMBOL, BRAILLE PATTERN DOTS-1345 (or GLIDER), PER TEN THOUSAND SIGN, INTERROBANG, HEBREW LETTER HET, ARMENIAN SMALL LETTER FEH, FOURTH ROOT, N-ARY CIRCLED DOT OPERATOR, LATIN CAPITAL LIGATURE IJ

GEORGIAN CAPITAL LETTER LAS, LEFTWARDS DASHED ARROW, GURMUKHI LETTER AI, GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI, TAMIL LETTER I, BOX DRAWINGS DOWN HEAVY AND UP HORIZONTAL LIGHT, ORIYA LETTER TTHA, ORIYA DIGIT SEVEN (or INVERTED DEBIAN LOGO), VULGAR FRACTION ONE SIXTH, PARENTHESIZED IDEOGRAPH FOUR

BOWTIE WITH RIGHT HALF BLACK, LATIN SUBSCRIPT SMALL LETTER SCHWA, SMALL ROMAN NUMERAL EIGHT, ARABIC LETTER PEH FINAL FORM, KATAKANA-HIRAGANA VOICED SOUND MARK, DITTO MARK, KATAKANA MIDDLE DOT, ROMAN NUMERAL TEN THOUSAND, ARABIC LETTER HEH ISOLATED FORM, SHADOWED WHITE LATIN CROSS, DOUBLE-STRUCK SMALL PI, X IN A RECTANGLE BOX


Now updated for Unicode 6.0!

Posted
unpacking boxes

A few people have asked what I meant in my "boxes" comic. Other just look at me strangely. Some hackers have enjoyed the comic, so I thought I'd try to unpack the meaning of it a bit for the layperson. Of course this will spoil the joke, but what the hey.

A hacker who looks at the comic probably thinks about things like recursion, fractals, unicode, fonts, and bugs.

fonts

A font is how the computer knows how to draw the letters on its screen. Before computers, a font was a bunch of metal bits that could be arranged and fed into a printing press. As with most things involving drawing on a screen, font technology is much more complicated these days than you'd ever imagine. Fonts and the programs that draw them (font engines) are something of a mystery even to most hackers.

unicode

Unicode is an absurdly complicated version of the alphabet. Or rather, of every alphabet and similar thing used by anyone on earth or in Star Trek. It's not a font, it's a way for the computer to store text internally, by basically assigning a unique number to each letter.

Since most hackers grew up in a world without unicode, and suffered through the (still continuing) transition to it, they're familiar with lots of quirks and weird and broken behaviors you get from it.

One of these quicks is what to do when the computer wants to display a particular letter from unicode, but that letter is not available in the font. Since unicode contains effectively an infinite number of symbols, and each symbol in a font is carefully crafted by hand, this occurs pretty often.

Handling of this situation varies, but often the computer will just display a little empty square box for the letter that it doesn't know how to draw. Some font engines go a step further and write the number of the unicode character in little tiny numbers inside the box. Presumably so that a hacker can look at it and tell what character is missing. The numbers are typically written one in each corner of the box, and you'd have to squint to see them.

fractals

A fractal is a special kind of picture, where parts of the picture repeat over and over at different scales. You've probably seen pictures of the most famous fractal, the Maldebrot set. That one is generated by a kind of formula, but there are much simpler ones, like the Koch snowflake.

Hackers love fractals, because they're pretty, and complex, while based on very simple math.

recursion

"Recursion: See recursion."

Ha! That's the best hacker joke ever, a true classic. Recursion is when the computer sees an instruction like that and mindlessly repeats it over and over. This actually turns out to be useful, and hackers see a kind of beauty in it.

bugs

Bugs are what happens when a computer program, which is supposed to be internally clear and simple, and have well-defined behavior, instead turns out to exhibit unintentional complexity. Kinda like a fractal. This begs the question of whether hackers like bugs, or not. Most of them spend most of their time creating more, so you be the judge.

The bug I'm alluding to in the comic is this: What if the computer, when drawing some word, stumbled over a unicode character that was not in the font. And then say it went to draw the little box instead, and fill in the numbers to indicate what character it was. Except it turned out that its font didn't tell it how to draw the numbers either. And thanks to a really unlikely bug, it then proceeded to recurse, drawing smaller and smaller boxes for these characters that it didn't know how to draw. Until it drew a fractal instead.

See: It's just hilarous once I've explained it.

(And that explanation of the word "cat", by the way, is why I don't try to write most of the technical posts in my blog in a form that a layman could understand.)

Posted
boxes

boxes.png

Why they don't let me implement font renderers.

(svg source)

Posted
uniencode

I just had too much fun with unicode, see filters 2.34.

Posted