three thousand lines of Haskell

Three thousand lines of code is not a huge program, but it is enough to get a pretty good feel for a language. Now that I've completed my first real Haskell program I feel that I've gotten over several of the humps in the learning curve and am starting to get a good feel for it.

Actually, I've written closer to five thousand lines, since there were several big refactorings. One was when I stopped manually threading my program state around and added a StateT monad. I did know from the beginning I would need one, but it seemed easier and a better learning exercise to let the program start out with a vesigial tail and gills before growing up into a modern Haskell program. (I suppose it's still written in baby-Haskell, really..)

Another refactoring came when I realized I needed to use a custom data type, not String, to represent keys. That was a great experience in type-based refactoring. Being able to keep typing ':make' and landing on the next bit of code that needed fixing was great, and simply adding that one type exposed several non-obvious bugs.

I found myself writing code that is much more solid and reusable than normally comes easily. And yet it's also very malleable. Actually, pulling out better data types and abstractions can get a bit addictive.

When I realized that I had a similar three-stage control flow being used for each of git-annex's subcommands, and factored that control flow out into a function that used the 3 data types below, I felt I'd gone down that rabbit hole perhaps far enough for now.

type SubCmdStart = String -> Annex (Maybe SubCmdPerform)
type SubCmdPerform = Annex (Maybe SubCmdCleanup)
type SubCmdCleanup = Annex Bool

(That will allow for some nice parallelism later though, and removed dozens of lines of code, so was worth it.)

Since git-annex is a very Real World Haskell type program, there is a lot of impure code in it. I could probably do better at factoring out more pure code. I count 117 impure functions, and only 37 pure.

Anyhow, from my perspective of a long-time perl programmer, some other random impressions..

ghc --make is handy, but every time it spits out a new 13 mb executable I can feel my laptop's SSD groan!
It was surpisingly easy to get into nasty situations with recursive dependencies between the 19 haskell modules I wrote. Sometimes solving them was really messy. I lost hours to this. More time than I've lost to the problem in all other languages combined over 15 years. It's not clear to me if it was due to the overall design of my program, or if Haskell's types tend to encourage this problem. Or if there's some simple "please let me have recursive dependencies" switch to ghc that I missed..
I'm used to being able to use man to get at mutiple books worth of detailed documentation for perl, and work easily offline or with limited bandwidth. With Haskell, I spend much more time searching online for documentation than I an comfortable with (although Hoogle is pretty neat). And the haddock-produced documentation is often pretty sketchy. The saving grace is that the source to any library function is a click away, and tends to be very readable.
I'm used to being able to use pretty much any Unix syscall by name from perl: mkdir, chmod, rename, etc. In Haskell, there is a Windows smell to the names, like createDirectoryIfMissing and setPermissions. And there are pointless distinctions like renameFile vs renameDirectory. These long names are not memorable and I have to look them up every time. Most of POSIX is available, but it's scattered amoung many disparate libraries, and I can't find an interface for sysconf(3) at all. There is a certian temptation, that I am so far resisting, to make a library for C/perl refugees that exports the sane Unix names for everything.
Anything involving the IO monad, or probably most monads, has a certian level of syntactic clumsiness about it. Compare:
```
  if ($flag{foo} && length $l = <>) {
```
vs
```
  foo <- getFlag "foo"
  l <- getLine
  if (foo && not $ null l)
      then do
```
When writing lots of impure code, that got old, and while I could use ifM, or make up some other similar thing, its syntax would also be somewhat clumsy.
The fixity levels for a lot of stuff seems a bit off. I too often found myself writing error $ "foo: " ++ (show bar) or return $ Just $ ... (Still a lot better than Scheme thanks to $!)
I've leveled up a couple times now, but this particular video game seems to have more levels going up and up, forever. Can't even see the top from here!

RSS

Haskell suggestions

Reading your post and browsing your Haskell code, a few things jumped out at me that might help.

It takes a lot of getting used to for a C or scripting language hacker, but functions really do bind more tightly than operators, so you can write things like this:

error $ show errnum ++ " failed"

Or:

uriUserInfo a ++ uriRegName a ++ uriPort a

You don't need to put parentheses around a function application.

Also, language constructs like if have higher syntactic "precedence" than anything else, so you can write:

if foo && not $ null l then ...

You might also find the when function useful; it works like if in a monad, but it has no else, and assumes you don't want a result; so, for instance, in tryRun':

when (errnum > 0) $ error $ show errnum ++ " failed"

(Note the lack of an else return ().)

And since you write Perl, you'll probably appreciate the corresponding unless.

In the GitRepo module, you have two different constructors for Repo, but they share three out of four fields, and they only differ depending on whether you have a FilePath or URI. You might consider changing that to use a single constructor and have one of the fields use a data type that itself can contain either a FilePath or a URI. That would simplify workTree, repoFromPath, and repoFromUrl, among many other functions.

You really want to use pattern matching rather than conditionals. In general, you rarely want if; think of pattern matching as your primary control structure, and if as a special case. You can write things like:

workTree (UrlRepo { url = repoUrl }) = repoUrl
workTree (Repo { top = path }) = path

Or (assuming you unify the two Repo constructors):

repoDescribe (Repo { remoteName = Just r }) = r
repoDescribe (...)

Or:

bare repo = case Map.lookup "core.bare" $ config repo of
    Just value -> value == "true"
    Nothing -> error ...

(Pattern matching has become such a natural control structure for me that I had to resist the impulse to write the one-line-longer-but-more-pattern-matchy Just "true" -> .... Also, I think Git has a default for deciding about bare/non-bare repositories, based on the path ending in .git or not; you might consider using that default to avoid error.)

You should probably also have far fewer calls to error, but that's a different and somewhat more difficult problem. At a minimum, you might consider using something like the ErrorT monad rather than thrown and caught exceptions. You also almost never want pure code generating exceptions of any kind.

Comment by Josh — in the wee hours of Wednesday night, October 28th, 2010

Fixity

By the way, functions binding tighter than operators explains why you have to write error $ foo ++ bar with the $; that same fixity also explains why you can write show foo ++ "str" with no parentheses.

Comment by Josh — in the wee hours of Wednesday night, October 28th, 2010

comment 3

Josh, thanks for the comments. I know I'm writing baby haskell -- had not thought about using pattern matching to dig inside record types, and will take that on board. I use ifs when I'm thinking procedurally. :) I suspect I should also use guards more (well, at al), but I've not internalized that syntax yet either.

I had thought about unifying the two Repo types, and could very well not have made the best decision there, but it seemed that adding a new type like RepoLocation = Url URI | Dir FilePath would involve lots more tedious pulling apart the nested data types in the places that need to get at those values. With @ pattern matching inside record types, it's not too bad.

workTree r@(Repo { location = Url _ }) = urlPath r
workTree (Repo { location = Dir d }) = d

Managing exceptions seems like one of the bigger cans of worms in Haskell.

(when is nice to know -- pity about all that punctuation needed though..)

BTW, Git's use of dir.git for bare repos is mostly a heuristic or UI abbreviation and not to be trusted.

Comment by joey — at lunch time on Thursday, October 28th, 2010

comment 4

Regarding the bare repo detection, even if git's own detection only works as a heuristic, making git-annex match git's behavior seems preferable to simply erroring out if you don't have the configuration option.

And yes, exceptions do indeed represent quite a can of worms. I personally fall in the camp that thinks Haskell ought to require declaring possible thrown exceptions as part of types, or not have them at all. They feel like a dynamic sore thumb sticking out of an otherwise static language, and they break my usual heuristic of "it compiles, it must be correct". :)

That said, I won't necessarily argue that you need to get rid of your use of exceptions entirely. And obviously you have to deal with the exceptions generated by other code regardless. But you might consider adding your own explicit exception type, and then only catching the exceptions you know how to deal with. And I'd highly recommend not catching the exception "error" generates, and not using that for expected failures. Most Haskell code I've seen reserves "error" for "program logic error that I couldn't detect until runtime" (such as head [] or fromJust Nothing).

Comment by Josh — at lunch time on Sunday, October 31st, 2010

open development

One of the great (and perhaps under-appreciated) advantages of open development: I have a problem that I suspect the State monad will solve for me. I can read this commit as a really useful, illustrative example of how it works. -- Jon

Comment by jmtd [livejournal.com] — at lunch time on Monday, November 1st, 2010

Add a comment