Haskell for Readers (just the types excerpt)

Preface

This document is uniquely tailored to those who need to read, rather than write Haskell code: auditors, scientists, managers, testers etc.

This excerpt of Haskell for Readers focuses on reading Haskell's types, without getting bogged down in the details and complexities of the rest of the Haskell code. By understanding the types, we can understand much of the architecture of a Haskell program.

Form

This document is not (necessarily) a self-contained tutorial; it is rather the base for an interactive lecture, given by a real instructor. In such a lecture, some holes will be filled as we go, and the questions from the audience form a crucial part of the learning experience.

This document is also meant to be more on the concise side, assuming the audience is shorter on time than on wits, and in a small, live workshop, the lecturer can add details, come up with more examples and slow down as needed.

This makes these notes less ideal for independent study, but that said, it should be possible to work attentively through them and still learn a lot.

Audience

I expect the audience to be familiar with programming and computer science in general, but do not assume prior knowledge of functional programming (or, in case you are worried about this, category theory).

Acknowledgments and license

The creation of this material was sponsored by the DFINITY Foundation, and is shared with the public under the terms of the Creative Commons Attribution 4.0 International License. You can view the source on GitHub of this document, and submit improvements there.

Reading Haskell modules

In Haskell, every file is a Haskell module, and modules are used to organize and namespace the code.

Normally, a Haskell module named Foo.Bar.Baz lives in a file Foo/Bar/Baz.hs, and begins with

module Foo.Bar.Baz where

Haskell module names are always capitalized.

The rest of the module consists of declarations: Values, functions, types, type synonyms etc. The order of declarations in a module is completely irrelevant; things do not need to be declared before they are used. This allows the author to sort functions by topic, or by relevance, rather than by dependency, and it is not uncommon to first show the main entry-point of a module, and put all the helper functions it uses below.

A good starting point when reading a module is to look at what types and functions are exported from the module. Many modules list their exports in the module definition.

module Foo.Bar.Baz (Baz, parseBaz, formatBaz, promptBaz, amazingBaz) where

When there's no explicit export list, a module exports all of its top-level declarations.

So, what is a top-level declaration in Haskell? Well, Haskell is a language with significant indentation (similar to Python), and top-level declarations are the ones that are not indented at all.

Try to scan over the example of a Haskell module below and see if you can find its top-level declarations. Don't get hung up on trying to understand the syntax, or what the code does, just look at the shape of the code and pick out the names of the things this module exports.

module Foo.Bar.Baz where

import Data.Maybe
import Control.Applicative

data Baz = Baz Integer
    deriving (Eq, Ord)

promptBaz :: IO Baz
promptBaz = do
    putStr "Input desired Baz value, from 1-100: "
    response <- getLine
    case parseBaz response of
        Nothing -> invalidinput "Sorry, but you need to enter a number."
        Just b
            | b < Baz 1 -> invalidinput "Baz value can't be smaller than 1"
            | b > Baz 100 -> invalieinput "Baz value can't be larger than 100"
            | b == amazingBaz -> do
                putStrLn "That's an amazing Baz!"
                return b
            | otherwise -> return b
  where
    invalidinput reason = do
        putStrLn reason
        promptBaz -- loop

parseBaz :: String -> Maybe Baz
parseBaz s = Baz <$> readMaybe s

formatBaz :: Baz -> String
formatBaz (Baz n) = show n

amazingBaz :: Baz
amazingBaz = Baz 42

If you guessed that the module exports Baz, promptBaz, parseBaz, formatBaz, and amazingBaz, you were right! Those are all of the top-level declarations in the module. The important lines for our purposes are these:

data Baz = Baz Integer

promptBaz :: IO Baz

parseBaz :: String -> Maybe Baz

formatBaz :: Baz -> String

amazingBaz :: Baz

If you can read a Haskell module and pick out top-level declarations like those from the rest of the code, you're well on your way to being able to think about the interface provided by that module. To fully understand it, we need to talk about types.

Types

Haskell has a strong static type system, which is essentially a way for you to communicate with the compiler. You can ask the compiler “what do you know about this function? what can it take, what kind of things does it return?”. And you can tell the compiler “this function ought to take this and return that (and please tell me if you disagree)”.

In fact, many Haskellers prefer to do type-driven development: First think about and write down the types of the functions they need to create, and then think about implementing them.

Besides communicating with the compiler, types are also crucial in communicating with your fellow developers and/or users of your API. For many functions, the type alone, or the type and the name, is sufficient to tell you what it does.

Basic types

Each top-level value and function in a Haskell module has a type, and these types are almost always written down before the code.

amazingBaz :: Baz
amazingBaz = Baz 42

The first line is a type annotation, i.e. a term (amazingBaz) followed by two colons, followed by its type (Baz). So amazingBaz is a constant value of type Baz.

More common than constant values are functions:

formatBaz :: Baz -> String
formatBaz (Baz n) = show n

The arrow in the type of formatBaz tells us that it is a function. It takes a Baz as an argument and returns a String as a result.

A function that takes two values will have a type with two arrows in it, for example:

increaseBaz :: Baz -> Integer -> Baz
increaseBaz (Baz a) b = Baz (a + b)

The types of functions can get much more complicated than that, with many arguments passed to it. And a function can take another function as an argument, making it a higher-order function.

Can you guess how many parameters this function takes from its type annotation?

genericPromptBaz :: (String -> Maybe Baz) -> Baz -> (Baz -> String) -> IO Baz

If you guessed three, you're right; it takes a function of type String -> Maybe Baz, a value of type Baz and another function of type Baz -> String.

Such type annotations often get too long to fit on a single line, so bear in mind that this is another way to format the same type annotation:

genericPromptBaz
    :: (String -> Maybe Baz)
    -> Baz
    -> (Baz -> String)
    -> IO Baz

Polymorphism

So far we've seen functions that operate on fixed data types like String and Baz. The name of a data type always starts with a capital letter. But sometimes you'll see a function whose type annotation contains lower-case types.

genericPrompt
    :: (String -> Maybe t)
    -> t
    -> (t -> String)
    -> IO t

But what is this type t? There is not, actually, a type called t. Instead, this is a type variable, meaning that the function genericPrompt can be used with any type. Any lower-case identifier in a type is a type variable (not just t), and concrete types are always upper-case.

To use genericPrompt with any particular type, t gets filled in with the type. It could be used with parameters based on Baz or instead with numbers, Booleans, or even with functions.

Bear in mind that type variables like t can become any type, but it has the be the same type everywhere in the type annotation.

Constrained types (a first glimpse)

Type annotations for polymorphic functions often include one more bit of syntax.

genericPrompt :: Eq t => (String -> Maybe t) -> t -> IO t

The part after the => is what we expect: The arguments of the function and its return type. The part before the => is new: It is a constraint, and it limits which types t can be instantiated with to those that can be compared for equality.

This Eq thing is not some built-in magic, but rather a type class, another very powerful and important feature of Haskell, which we will dive into separately later.

Algebraic data types

The function type is very expressive, and one can model many data structures purely with functions. But of course it is more convenient to use dedicated data structures. There are a number of data structure types that come with the standard library, in particular tuples, lists, the Maybe type. But it is more instructive to first look at how we can define our own.

We can declare new data types using the data keyword, the name of the type, the = sign, and then a list of constructors, separated by pipes (|). These declarations are a bit odd, since they use = although it is not really an equality, but let us look at it step by step.

Enumerations

In the simplest case, we can use this to declare an enumeration type:

data Suit = Diamonds | Clubs | Hearts | Spades

From now on, we can use the constructors, e.g. Diamonds as values of type Suit. This is how we create values of type Suit -- and it is the only way, so we know that every value of type Suit is, indeed, one of these four constructors.

Note that Suit is a type, i.e. something you can use in type signatures, while Diamonds and the other constructors are values, i.e. something you can use in your function definitions.

Product types

So far, the constructors were just plain values. But we can also turn them into “containers” of sort, where we can store other values. As an basic example, maybe we want to introduce a type for complex numbers:

data Complex = C Integer Integer

(Mathematically educated readers please excuse the use of Integers here.)

This creates a new type Complex, with a constructor C. But C itself is not a value of type Complex, but rather it is a function that creates values of type Complex and, crucially, it is the only way of creating values of type Complex. In the ghci repl, we can ask for the type of C and see that it is indeed just a function:

Prelude> :type C
C :: Integer -> Integer -> Complex

Sum types

A type like Complex, with exactly one constructor, is called a product type. But we can of course have types with more than one constructor and constructor arguments:

data Riemann = Complex Complex | Infinity

This declares a new type Riemann that can be built using one of these two constructors:

The constructor Complex, which takes one argument, of type Complex. Types and terms (including constructors) have different namespaces, so we can have a type called Complex, and a constructor called Complex, and they can be completely unrelated. This can be confusing, but is rather idiomatic.
The constructor Infinity takes no arguments, and simply is a value of type Riemann itself.

A data type that has more than one constructor is commonly called a sum type. Because data allows you to build types from sums and products, these types are called algebraic data types (ADTs).

Recursive data types

It it worth pointing out that it is completely fine to have a constructor argument of the type that we are currently defining. This way, we obtain a recursive data type, and this is the foundation for many important data structures, in particular lists and trees of various sorts. Here is a simple example, a binary tree with numbers on all internal nodes:

data Tree = Leaf | Node Integer Tree Tree

This can be read as “a value of type Tree is either a Leaf, or it is a Node that contains a value of type Integer and references to two subtrees.”

Polymorphic data types

The tree data type declared in the previous section ought to be useful not just for integers, but maybe for any type. But it would be seriously annoying to have to create a new tree data type for each type we want to store in the tree. Therefore, we have polymorphic data types. In the example of the tree, we can write:

data Tree a = Leaf | Node a (Tree a) (Tree a)

The a here is a type variable, because it occurs in the place of a type, but is lower-case. When we want to use this tree data type with a concrete type, say Integer, we simply write Tree Integer:

insert :: Integer -> Tree Integer -> Tree Integer

We can also write functions that work on polymorphic trees, i.e. which we can use on any Tree, no matter what type the values in the nodes are. A good example is:

size :: Tree a -> Integer
size Leaf = 0
size (Node _ t1 t2) = 1 + size t1 + size t2

Functions in data types

Maybe this is obvious to you, after the emphasis on functions earlier, but it is still worth pointing out that data type can also store functions. This blurs the distinction between data and code some more, as this nice example shows:

data Stream a b
    = NeedInput (a -> Stream a b)
    | HasOutput b (Stream a b)
    | Done

The type Stream a b models a state machine that consumes values of type a, produces values of type b, and maybe eventually stops. Such a machine is in one of three states:

Waiting for input. This uses the NeedInput constructor, which carries a function that consumes a type of value a and returns the new state of the machine.
Producing output. This uses the HasOutput constructor, which stores the output value of type b, and the subsequent state of the machine.
Done.

Predefined data types

We intentionally discussed the mechanisms of algebraic data types first, so that we can explain the most common data types in the standard library easily.

Booleans

The values True and False are simply the constructors of a data type defined as

data Bool = False | True

`Maybe`

A very common use case for algebraic data types is to capture the idea of a type whose values “maybe contain nothing, or just a value of type a”. Because this is so common, such a data type is predefined:

data Maybe a = Nothing | Just a

You might see Maybe, for example, in the return type of a function that deserializes a binary or textual representation of a type, for example:

parseBar :: String -> Maybe Bar

Such an operation can fail, and if the input is invalid, it would return Nothing. As a user of such a function, the only way to get to the value inside the Maybe is to pattern-match on the result, which forces you to think about and handle the case where the result is Nothing.

This is much more robust than the common idiom in C, where you have to remember to check for particular error values (-1, or NULL), or Go, where you get both a result and a separate error code, but you can still be lazy and use the result without checking the error code.

A big part of Haskell’s reputation as a language that makes it easier to write correct code relies on the use of data types to precisely describe the values you are dealing with.

`Either`

With maybe we can express “one or none”. Sometimes we want “one or another” type. For this, the standard library provides

data Either a b = Left a | Right b

Commonly, this type is used for computations that can fail, but that provide some useful error messages when they fail:

parseBar :: String -> Either ParseError Bar

This gives us the same robustness benefits of Maybe, but also a more helpful error messages. If used in this way, then the Left value is always used for the error or failure case, and the Right value for when everything went all right.

Tuples

Imagine you are writing a function that wants to return two numbers -- say, the last digit and the rest of the number. The way to do that that you know so far would require defining a data type:

data TwoIntegers = TwoIntegers Integer Integer
splitLastDigit :: Integer -> TwoIntegers
splitLastDigit n = TwoIntegers (n `div` 10) (n `mod` 10)

Clearly, the concept of “passing around two values together“ is not particularly tied to Integer, and we can use polymorphism to generalize this definition:

data Two a b = Two a b
splitLastDigit :: Integer -> Two Integer Integer
splitLastDigit n = Two (n `div` 10) (n `mod` 10)

And because this is so useful, Haskell comes with built-in support for such pairs, including a nice and slim syntax:

data (a,b) = (a,b)
splitLastDigit :: Integer -> (Integer, Integer)
splitLastDigit n = (n `div` 10, n `mod` 10)

Besides tuples, which store two values, there are triples, quadruples and, in general, n-tuples of any size you might encounter. But really, if these tuples get larger than two or three, the code starts to smell.

Their size is always fixed and statically known at compile time, and you can have values of different types as components of the tuple. This distinguishes them from the lists we will see shortly.

The unit type

There is also a zero-tuple, so to say: The unit type written () with only the value ():

data () = ()

While this does not look very useful yet, we will see that it plays a crucial role later when we encounter types such as IO ()

Lists

In the previous section we defined trees using a recursive data type. It should be obvious that we can define lists in a very analogous way:

data List a = Empty | Link a (List a)

This data structure is so ubiquitous in functional programming that it not only comes with the standard library, it also has very special, magic syntax:

data [a] = [] | a : [a]

In words: The type [a] is the type of lists with values of type a. Such a list is either the empty list, written as [], or it is a non-empty list containing of a head x of type a, and a tail xs, and is written as x:xs. Note that the constructor (:), called “cons”, is using operator syntax.

Lists are very useful for many applications, but they are not a particularly high-performance data structure -- random access and concatenation is expensive, and they use quite a bit of memory. Depending on the application, other types like arrays/vectors, finger trees, difference lists, maps or sets might be more suitable.

Characters and strings

Haskell has built-in support for characters and text. A single character has type Char, and is written in single quotes, e.g. 'a', '☃', '\'', '\0', '\xcafe'. These characters are Unicode code points, and not just 7 or 8 bit characters.

The built-in type String is just an alias for [Char], i.e. a list of characters. Haskell supports special built-in syntax for strings, using double quotes, but this is just syntactic sugar to the list syntax.

String also has the same performance issues as lists: While it is fine to use them in non-critical parts of the code (diagnostic and error messages, command line and configuration file parsing, filenames), String is usually the wrong choice if large amounts of strings need to be processed, e.g. in a templating library. Additionally libraries provide more suitable data structures, in particular ByteString for binary data and Text for human-readable text.

Records

Assume you want to create a type that represents an employee in a HR database. There are a fair number of fields to store -- name, date of birth, employee number, room, login handle, public key etc. You could use a tuple with many fields, or create your own data type with a constructor with many fields, but either way you will have to address the various fields by their position, which is verbose, easy to get wrong, and hard to extend.

In such a case, you can use records. These allow you to give names to the fields of a constructor.

data Employee = Employee
    { name :: String
    , room :: Integer
    , pubkey :: ByteString
    }

Newtypes

Sometimes you will see a type declaration that uses newtype instead of data:

newtype Baz = Baz Integer

For all purposes relevant to us so far you can mentally replace newtype with data. There are difference in memory representation (a newtype is “free” in some sense), but that is, at this level, irrelevant for us.

Type synonyms

Haskell allows you to introduce new names for existing types. One example is the type String, which is defined as

type String = [Char]

With this declaration, you can use String instead of [Char] in your type signatures. They are completely interchangeable, and a value of type String is still just a list of characters.

So type synonyms do not introduce any kind of type safety, they merely make types more readable.

Code structure small and large

The next big topic we need to learn is how programmers structure their code. This happens on multiple levels

within a file: functions, type signatures, and documentation fragments are arranged.
within a project (library, package): code is spread out in different files, and imported from other files.
between projects: packages are versioned, equipped with meta-data, and depend on each other.

Comments

Of course, Haskell supports comments. There are line comments and multi-line comments:

answer = 42 -- but what is the question?

{-
In the following code, we write a function that correctly tells
us whether a Turing machine halts:
-}
halts :: TuringMachine -> Bool
halts turing_machine = halts turing_machine

Importing other modules

Obviously, the point of having multiple files of Haskell code is to use the code from one in the other. This is achieved using import statements, which must come right after the module header, and before any declarations.

So if we have a file Target.hs with content

module Target where

who :: String
who = "world"

and another file Tropes.hs with content

module Tropes where

import Target

greeting :: String
greeting = "Hello " ++ who ++ "!"

then the use of who in greeting refers to the definition in the file Target.hs.

We could also write

greeting :: String
greeting = "Hello " ++ Target.who ++ "!"

and use the fully qualified name of who. This can be useful for disambiguation, or simply for clarity.

If we only ever intend to refer the things we import from a module by their qualified names, then we can use a qualified import:

import qualified Target

This does not bring any unqualified names into scope.

And if the module has a long name, we can shorten it:

import qualified Target as T

and write T.who. This is a common idiom for modules like Data.Text that export many names that would otherwise clash with names from the prelude.

The standard library, called base, comes with many modules you can import in addition to the Prelude module, which is always imported implicitly.

Import lists

If we do not want to import all names of another module, we can import just a specific selection, e.g.:

import Data.Maybe (mapMaybe)

This makes it easier for someone reading the code to locate where a certain function or data type is from, and it makes the code more robust against breakage when a new version of the other module starts exporting additional names. These would not silently override other names, but cause compiler errors about ambiguous names.

You can import types just as well, just include them in the list. To import constructors (which look like types), you have to list them after the type they belong to. So if we put our definitions of Complex and Riemann into a file Riemann.hs, namely

module Riemann where
data Complex = C Integer Integer
data Riemann = Complex Complex | Infinity

then you can import everything using

import Riemann (Complex(C), Riemann(Complex, Infinity))

or, shorter,

import Riemann (Complex(..), Riemann(..))

Export lists and abstract types

You can not only restrict what you import, but also what you export. To do so, you list the names of functions, types, etc. that you want to export after the module name:

module Riemann (Complex, Riemann(..)) where

A short export list is a great help when trying to understand the role and purpose of a module: If it only exports one or a small number of functions, it is clear that these are the (only) entry points to the code, and that all other declarations are purely internal, and may be refactored without affecting anything else.

By excluding the constructors of a data type from the export list, as we did in this example with the Complex type, we can make this type abstract: Users of our module now have no knowledge of the internal structure of Complex, and they are unable to create or arbitrarily inspect values of type Complex. Instead, they are only able to do so using the other functions that we export along with Complex. This way we can ensure certain invariant in our types – think of a search tree with the invariant that it is sorted – or reserve the ability to change the shape of the type without breaking depending code.

Proper user of abstract types greatly helps to make code more readable, more maintainable and more robust.

Haskell packages

Zooming out some more, we come across packages: A package is a collection of modules that are bundled under a single package name. A package contains meta-data (name, version number, author, license...). Packages declare which other packages they depend upon, together with version ranges. All this meta-data can be found in the Cabal file called foo.cabal in the root directory of the project.

Almost all publicly available Haskell packages are hosted centrally on Hackage, including the haddock-generated documentation and cross-linked source code. They can be easily installed using the cabal tool, or alternative systems like stack or nix. The packages on Hackage cover many common needs and it is expected that a serious Haskell project depends on dozen of Haskell packages from Hackage.

Imperative (looking) Haskell

Beware: This chapter is full of half-truths and glossing over technical details. Imagine plenty of “it looks as if” and “one can think of this as” sprinkled throughout it. Nevertheless, it is useful to get you started.

`IO`-functions

Haskell functions are pure functions in the mathematical sense: Given some input, they calculate some output, but nothing else can happen, and nothing besides the arguments can influence the result. This is great, but how can Haskell programs then write to files, or respond to network requests, or come up with random numbers?

The solution are IO-functions. These functions can be executed, and when such a function is executed, it can do all these nasty things, before returning a value. Here is a selection of IO-functions available by default:

getLine :: IO String
putStrLn :: String -> IO ()

readFile :: FilePath -> IO String
writeFile :: FilePath -> String -> IO ()

You can see that these functions may have arguments, just as normal functions. The important bit is the return type, which is IO Something. This indicates that these functions can be executed, and that they have to be executed before we get our hands on the result.

Not all functions have an interesting result (e.g. putStrLn does not); this is where the unit type comes in handy.

The M word

Sometimes the types of functions will look similar to the IO-functions, but using some other word than "IO".

sendMessage :: String -> Net ()
receiveMessage :: Net String

Much as readFile has a return type of IO String, receiveMessage has Net String, and this is is a good hint that it also can be executed and when executed it probably does some impure IO under the hood before returning a value.

IO and Net are what is known as monads, but you only need to worry about that when digging into their implementation.

`do` notation

Often code using IO or other monads uses a special syntax, called do-notation, that allows writing code in an imperative style. Here is an example:

main :: IO ()
main = do
    putStrLn "Which file do you want to copy?"
    from <- getLine
    putStrLn "Where do you want to copy it to?"
    to <- getLine
    content <- readFile from
    putStrLn ("Read " ++ show (length content) ++ " bytes.")
    writeFile to content
    putStrLn "Done copying."

We can compile and run this program, and it indeed copies a file:

$ ghc --make copy.hs
[1 of 1] Compiling Main             ( copy.hs, copy.o )
Linking copy ...
$ ./copy
Which file do you want to copy?
copy.hs
Where do you want to copy it to?
copy2.hs
Read 287 bytes.
Done copying.
$ diff copy.hs copy2.hs

Looking at the code, it doesn’t look much different than the equivalent in a language like C or Python. Things to notice:

The main function of the module is special. Just like in C, it is the entry point for a compiled Haskell program. When we run the program, then the main function is executed. This is the only way to start executing IO-functions -- we cannot do that just nilly-willy within other code.
The body of the main function is written as a do block, which clearly signposts the imperative nature of this code: It is a sequence of things to do.
Every line below the do block is one execution of an IO-function. The first one, for example, prints a question on the terminal.
The main function has type IO (). So it is one of these IO-functions as well.

Type classes

The language features we have seen so far can be found, with slight variations, in most functional programming languages. In this chapter, we will look at a feature that Haskell is particularly renowned for: Type classes.

Let me start by pointing out what type classes are not: They are not classes as we know them from object oriented programming, so please do not try to attempt to understand them by analogy to that.

We've already seen the Eq type class used in a constraint on a polymorphic function type:

genericPrompt :: Eq t => (String -> Maybe t) -> t -> IO t

The other parts of syntax that you will see are declaring type classes, and adding instances to type classes.

The Eq type class is declared in the Prelude with this code:

class Eq a where
    (==) :: a -> a -> Bool

That says that == is an operator name that can be overloaded, by declaring a class with it as a method. With that defined, it's allowed to use == with every type that is an instance of Eq.

Here is code that declares instances of Eq for Complex and Riemann:

instance Eq Complex where
    C x1 y1 == C x2 y2 = x1 == x2 && y1 == y2

instance Eq Riemann where
    Complex c1 == Complex c2 = c1 == c2
    Infinity == Infinity = True
    _ == _ = False

Don't worry about the implementation details, but notice that this defines different == functions for the different types.

Often instances of common type classes are derived as part of the definition of a data type.

data Baz = Baz Integer
    deriving (Eq, Ord)

The deriving causes the Haskell compiler to automatically generate the necessary code to make Eq and Ord instances for the Baz data type.

Modules will often declare type classes of their own, as well as making data types that are instances of many other type classes.