This feed contains some of my blog entries that link to software code that I've developed.

shell monad day 3

I have been hard at work on the shell-monad ever since it was born on Christmas Eve. It's now up to 820 lines of code, and has nearly comprehensive coverage of all shell features.

Time to use it for something interesting! Let's make a shell script and a haskell program that both speak a simple protocol. This kind of thing could be used by propellor when it's deploying itself to a new host. The haskell program can ssh to a remote host and run the shell program, and talk back and forth over stdio with it, using the protocol they both speak.

abstract beginnings

First, we'll write a data type for the commands in the protocol.

data Proto
    = Foo String
    | Bar
    | Baz Integer
    deriving (Show)

Now, let's go type class crazy!

class Monad t => OutputsProto t where
    output :: Proto -> t ()

instance OutputsProto IO where
    output = putStrLn . fromProto

So far, nothing interesting; this makes the IO monad an instance of the OutputsProto type class, and gives a simple implementation to output a line of the protocol.

instance OutputsProto Script where
    output = cmd "echo" . fromProto

Now it gets interesting. The Script monad is now also a member of the OutputsProto. To output a line of the protocol, it just uses echo. Yeah -- shell code is a member of a haskell type class. Awesome -- most abstract shell code evar!

Similarly, we can add another type class for monads that can input the protocol:

class Monad t => InputsProto t p where
    input :: t p

instance InputsProto IO Proto where
    input = toProto <$> readLn

instance InputsProto Script Var where
    input = do
        v <- newVar ()
        readVar v
        return v

While the IO version reads and deserializes a line back to a Proto, the shell script version of this returns a Var, which has the newly read line in it, not yet deserialized. Why the difference? Well, Haskell has data types, and shell does not ...

speaking the protocol

Now we have enough groundwork to write haskell code in the IO monad that speaks the protocol in arbitrary ways. For example:

protoExchangeIO :: Proto -> IO Proto
protoExchangeIO p = do
    output p

fooIO :: IO ()
fooIO = do
    resp <- protoExchangeIO (Foo "starting up")
    -- etc

But that's trivial and uninteresting. Anyone who has read to here certianly knows how to write haskell code in the IO monad. The interesting part is making the shell program speak the protocol, including doing various things when it receives the commands.

foo :: Script ()
foo = do
    stopOnFailure True
    handler <- func (NamedLike "handler") $
        handleProto =<< input
    output (Foo "starting up")
    output Bar

handleFoo :: Var -> Script ()
handleFoo v = toStderr $ cmd "echo" "yay, I got a Foo" v

handleBar :: Script ()
handleBar = toStderr $ cmd "echo" "yay, I got a Bar"

handleBaz :: Var -> Script ()
handleBaz num = forCmd (cmd "seq" (Val (1 :: Int)) num) $
    toStderr . cmd "echo" "yay, I got a Baz"


I've left out a few serialization functions. fromProto is used in both instances of OutputsProto. The haskell program and the script will both use this to serialize Proto.

fromProto :: Proto -> String
fromProto (Foo s) = pFOO ++ " " ++ s
fromProto Bar = pBAR ++ " "
fromProto (Baz i) = pBAZ ++ " " ++ show i

pFOO, pBAR, pBAZ :: String
(pFOO, pBAR, pBAZ) = ("FOO", "BAR", "BAZ")

And here's the haskell function to convert the other direction, which was also used earlier.

toProto :: String -> Proto
toProto s = case break (== ' ') s of
    (w, ' ':rest)
        | w == pFOO -> Foo rest
        | w == pBAR && null rest -> Bar
        | w == pBAZ -> Baz (read rest)
        | otherwise -> error $ "unknown protocol command: " ++ w
    (_, _) -> error "protocol splitting error"

We also need a version of that written in the Script monad. Here it is. Compare and contrast the function below with the one above. They're really quite similar. (Sadly, not so similar to allow refactoring out a common function..)

handleProto :: Var -> Script ()
handleProto v = do
    w <- getProtoCommand v
    let rest = getProtoRest v
    caseOf w
        [ (quote (T.pack pFOO), handleFoo =<< rest)
        , (quote (T.pack pBAR), handleBar)
        , (quote (T.pack pBAZ), handleBaz =<< rest)
        , (glob "*", do
            toStderr $ cmd "echo" "unknown protocol command" w
            cmd "false"

Both toProto and handleProto split the incoming line apart into the first word, and the rest of the line, then match the first word against the commands in the protocol, and dispatches to appropriate actions. So, how do we split a variable apart like that in the Shell monad? Like this...

getProtoCommand :: Var -> Script Var
getProtoCommand v = trimVar LongestMatch FromEnd v (glob " *")

getProtoRest :: Var -> Script Var
getProtoRest v = trimVar ShortestMatch FromBeginning v (glob "[! ]*[ ]")

(This could probably be improved by using a DSL to generate the globs too..)


And finally, here's a main to generate the shell script!

main :: IO ()
main = T.writeFile "" $ script foo

The pretty-printed shell script that produces is not very interesting, but I'll include it at the end for completeness. More interestingly for the purposes of sshing to a host and running the command there, we can use linearScript to generate a version of the script that's all contained on a single line. Also included below.

I could easily have written the pretty-printed version of the shell script in twice the time that it took to write the haskell program that generates it and also speaks the protocol itself.

I would certianly have had to test the hand-written shell script repeatedly. Code like for _x in $(seq 1 "${_v#[!\ ]*[\ ]}") doesn't just write and debug itself. (Until now!)

But, the generated scrpt worked 100% on the first try! Well, it worked as soon as I got the Haskell program to compile...

But the best part is that the Haskell program and the shell script don't just speak the same protocol. They both rely on the same definition of Proto. So this is fairly close to the kind of type-safe protocol serialization that Fay provides, when compiling Haskell to javascript.

I'm getting the feeling that I won't be writing too many nontrivial shell scripts by hand anymore! :)

the complete haskell program

Is here, all 99 lines of it.

the pretty-printed shell program

set -x
_handler () { :
    read _v
    case "${_v%%\ *}" in FOO) :
        echo 'yay, I got a Foo' "${_v#[!\ ]*[\ ]}" >&2
    : ;; BAR) :
        echo 'yay, I got a Bar' >&2
    : ;; BAZ) :
        for _x in $(seq 1 "${_v#[!\ ]*[\ ]}")
        do :
            echo 'yay, I got a Baz' "$_x" >&2
    : ;; *) :
        echo 'unknown protocol command' "${_v%%\ *}" >&2
    : ;; esac
echo 'FOO starting up'
echo 'BAR '

the one-liner shell program

set -p; _handler () { :;    _v=;    read _v;    case "${_v%%\ *}" in FOO) :;        echo 'yay, I got a Foo' "${_v#[!\ ]*[\ ]}" >&2;     : ;; BAR) :;        echo 'yay, I got a Bar' >&2;    : ;; BAZ) :;        for _x in $(seq 1 "${_v#[!\ ]*[\ ]}");      do :;           echo 'yay, I got a Baz' "$_x" >&2;      done;   : ;; *) :;      echo 'unknown protocol command' "${_v%%\ *}" >&2;       false;  : ;; esac; }; echo 'FOO starting up'; _handler; echo 'BAR '; _handler
generating shell scripts from haskell using a shell monad

Shell script is the lingua franca of Unix, it's available everywhere and often the only reasonable choice to Get Stuff Done. But it's also clumsy and it's easy to write unsafe shell scripts, that forget to quote variables, typo names of functions, etc.

Wouldn't it be nice if we could write code in some better language, that generated nicely formed shell scripts and avoided such gotchas? Today, I've built a Haskell monad that can generate shell code.

Here's a fairly involved example. This demonstrates several features, including the variadic cmd, the ability to define shell functions, to bind and use shell variables, to build pipes (with the -:- operator), and to factor out generally useful haskell functions like pipeLess and promptFor ...

santa = script $ do
    hohoho <- func $
        cmd "echo" "Ho, ho, ho!" "Merry xmas!"

    promptFor "What's your name?" $ \name -> pipeLess $ do
        cmd "echo" "Let's see what's in" (val name <> quote "'s") "stocking!"
        forCmd (cmd "ls" "-1" (quote "/home/" <> val name)) $ \f -> do
            cmd "echo" "a shiny new" f

    cmd "rm" "/table/cookies" "/table/milk"

pipeLess :: Script () -> Script ()
pipeLess c = c -|- cmd "less"

promptFor :: T.Text -> (Var -> Script ()) -> Script ()
promptFor prompt cont = do
    cmd "printf" (prompt <> " ")
    var <- newVar "prompt"
    readVar var
    cont var

When run, that haskell program generates this shell code. Which, while machine-generated, has nice indentation, and is generally pretty readable.

f1 () { :
    echo 'Ho, ho, ho!' 'Merry xmas!'
printf 'What'"'"'s your name?  '
read '_prompt1'
    echo 'Let'"'"'s see what'"'"'s in' "$_prompt1"''"'"'s' 'stocking!'
    for _x1 in $(ls '-1' '/home/'"$_prompt1")
    do :
        echo 'a shiny new' "$_x1"
) | (
rm '/table/cookies' '/table/milk'

Santa has already uploaded shell-monad to hackage and git.

There's a lot of things that could be added to this library (if, while, redirection, etc), but I can already see using it in various parts of propellor and git-annex that need to generate shell code.

propellor is d-i 2.0

I think I've been writing the second system to replace d-i with in my spare time for a couple months, and never noticed.

I'm as suprised as you are, but consider this design:

  • Installation system consists of debian live + haskell + propellor + web browser.

  • Entire installation UI consists of a web-based (and entirely pictographic and prompt based, so does not need to be translated) selection of the installation target.

  • Installation target can be local disk, remote system via ssh (wiping out crufty hacked-up pre-installed debian), local VM, live ISO, etc.

  • Really, no other questions. Not even user name/password! The installed system will only allow login via the same method that was used to install it. So a locally installed system will accept console/X login with no password and then a forced password change. Or a system installed via ssh will only allow login using the same ssh key that was used to install it.

  • The entire installation process consists of a disk format, followed by debootstrap, followed by running propellor in the target system. This also means that the installed system includes a propellor config file which now describes the properties of the system as installed (so can be edited to tweak the installation, or reused as starting point for next installation).

  • Users who want to configure installation in any way write down properties of system using a simple propellor config file. I suppose some people still use more than one partiton or gnome or some such customization, so they'd use:

main :: IO
main = Installer.main
    & Installer.partition First "/boot" Ext3 (MiB 256)
    & Installer.partition Next "/" Ext4 (GiB 5)
    & Installer.partition Next "/home" Ext4 FreeSpace
    & Installer.grubBoots "hd0"
    & os (System (Debian Stable) "amd64")
    & Apt.stdSourcesList
    & Apt.installed ["task-gnome-desktop"]
  • The installation system is itself built using propellor. A free feature given the above design, so basically all it will take to build an installation iso is this code:
main :: IO
main = Installer.main
    & CdImage "installer.iso"
    & os (System (Debian Stable) "amd64")
    & Apt.stdSourcesList
    & Apt.installed ["task-xfce-desktop", "ghc", "propellor"]
    & User.autoLogin "root"
    & User.loginStarts "propellor --installer"
  • Propellor has a nice display of what it's doing so there is no freaking progress bar.

Well, now I know where propellor might end up if I felt like spending a month and adding a few thousand lines of code to it.

using a debian package as the remote for a local config repo

Today I did something interesting with the Debian packaging for propellor, which seems like it could be a useful technique for other Debian packages as well.

Propellor is configured by a directory, which is maintained as a local git repository. In propellor's case, it's ~/.propellor/. This contains a lot of haskell files, in fact the entire source code of propellor! That's really unusual, but I think this can be generalized to any package whose configuration is maintained in its own git repository on the user's system. For now on, I'll refer to this as the config repo.

The config repo is set up the first time a user runs propellor. But, until now, I didn't provide an easy way to update the config repo when the propellor package was updated. Nothing would break, but the old version would be used until the user updated it themselves somehow (probably by pulling from a git repository over the network, bypassing apt's signature validation).

So, what I wanted was a way to update the config repo, merging in any changes from the new version of the Debian package, while preserving the user's local modifications. Ideally, the user could just run git merge upstream/master, where the upstream repo was included in the Debian package.

But, that can't work! The Debian package can't reasonably include the full git repository of propellor with all its history. So, any git repository included in the Debian binary package would need to be a synthetic one, that only contains probably one commit that is not connected to anything else. Which means that if the config repo was cloned from that repo in version 1.0, then when version 1.1 came around, git would see no common parent when merging 1.1 into the config repo, and the merge would fail horribly.

To solve this, let's assume that the config repo's master branch has a parent commit that can be identified, somehow, as coming from a past version of the Debian package. It doesn't matter which version, although the last one merged with will be best. (The easy way to do this is to set refs/heads/upstream/master to point to it when creating the config repo.)

Once we have that parent commit, we have three things:

  1. The current content of the config repo.
  2. The content from some old version of the Debian package.
  3. The new content of the Debian package.

Now git can be used to merge #3 onto #2, with -Xtheirs, so the result is a git commit with parents of #3 and #2, and content of #3. (This can be done using a temporary clone of the config repo to avoid touching its contents.)

Such a git commit can be merged into the config repo, without any conflicts other than those the user might have caused with their own edits.

So, propellor will tell the user when updates are available, and they can simply run git merge upstream/master to get them. The resulting history looks like this:

* Merge remote-tracking branch 'upstream/master'
| * merging upstream version
| |\  
| | * upstream version
* | user change
* upstream version

So, generalizing this, if a package has a lot of config files, and creates a git repository containing them when the user uses it (or automatically when it's installed), this method can be used to provide an easily mergable branch that tracks the files as distributed with the package.

It would perhaps not be hard to get from here to a full git-backed version of ucf. Note that the Debian binary package doesn't have to ship a git repisitory, it can just as easily ship the current version of the config files somewhere in /usr, and check them into a new empty repository as part of the generation of the upstream/master branch.

propellor-driven DNS and backups

Took a while to get here, but Propellor 0.4.0 can deploy DNS servers and I just had it deploy mine. Including generating DNS zone files.

Configuration is dead simple, as far as DNS goes:

     & alias ""
        & Dns.secondary hosts ""
                & Dns.primary hosts ""
                        (Dns.mkSOA "" 100)
                        [ (RootDomain, NS $ AbsDomain "")
            , (RootDomain, NS $ AbsDomain "")

The awesome thing is that propellor fills in all the other information in the zone file by looking at the properties of the hosts it knows about.

 , host ""
        & ipv4 ""
        & ipv6 "fe80::26fd:52ff:feea:2294"

        & alias ""
        & alias ""
        & alias ""
        & Docker.docked hosts "webserver"
            `requres` backedup "/var/www"
        & alias ""
        & Dns.secondary hosts ""

When it sees this host, Propellor adds its IP addresses to the DNS zone file, for both its main hostname (""), and also its relevant aliases. (The .museum alias would go into a different zone file.)

Multiple hosts can define the same alias, and then you automaticlly get round-robin DNS.

The web server part of of the config can be cut and pasted to another host in order to move its web server to the other host, including updating the DNS. That's really all there is to is, just cut, paste, and commit!

I'm quite happy with how that worked out. And curious if Puppet etc have anything similar.

One tricky part of this was how to ensure that the serial number automtically updates when changes are made. The way this is handled is Propellor starts with a base serial number (100 in the example above), and then it adds to it the number of commits in its git repository. The zone file is only updated when something in it besides the serial number needs to change.

The result is nice small serial numbers that don't risk overflowing the (so 90's) 32 bit limit, and will be consistent even if the configuration had Propellor setting up multiple independent master DNS servers for the same domain.

Another recent feature in Propellor is that it can use Obnam to back up a directory. With the awesome feature that if the backed up directory is empty/missing, Propellor will automcatically restore it from the backup.

Here's how the backedup property used in the example above might be implemented:

backedup :: FilePath -> Property
backedup dir = Obnam.backup dir daily
    [ "--repository=s"
    ] Obnam.OnlyClient
    `requires` Ssh.keyImported SshRsa "root"
    `requires` Ssh.knownHost hosts "" "root"
    `requires` Gpg.keyImported "1B169BE1" "root"

Notice that the Ssh.knownHost makes root trust the ssh host key belonging to So Propellor needs to be told what that host key is, like so:

 , host ""
        & ipv4 ""
        & sshPubKey "ssh-rsa blahblahblah"

Which of course ties back into the DNS and gets this hostname set in it. But also, the ssh public key is available for this host and visible to the DNS zone file generator, and that could also be set in the DNS, in a SSHFP record. I haven't gotten around to implementing that, but hope at some point to make Propellor support DNSSEC, and then this will all combine even more nicely.

By the way, Propellor is now up to 3 thousand lines of code (not including Utility library). In 20 days, as a 10% time side project.

propellor introspection for DNS

In just released Propellor 0.3.0, I've improved improved Propellor's config file DSL significantly. Now properties can set attributes of a host, that can be looked up by its other properties, using a Reader monad.

This saves needing to repeat yourself:

hosts = [ host ""
        & stdSourcesList Unstable
        & Hostname.sane -- uses hostname from above

And it simplifies docker setup, with no longer a need to differentiate between properties that configure docker vs properties of the container:

 -- A generic webserver in a Docker container.
    , Docker.container "webserver" "joeyh/debian-unstable"
        & Docker.publish "80:80"
        & Docker.volume "/var/www:/var/www"
        & Apt.serviceInstalledRunning "apache2"

But the really useful thing is, it allows automating DNS zone file creation, using attributes of hosts that are set and used alongside their other properties:

hosts =
    [ host ""
        & ipv4 ""

        & cname ""
        & Docker.docked hosts "openid-provider"

        & cname ""
        & Docker.docked hosts "ancient-kitenet"
    , host ""
        & Dns.primary "" hosts

Notice that hosts is passed into Dns.primary, inside the definition of hosts! Tying the knot like this is a fun haskell laziness trick. :)

Now I just need to write a little function to look over the hosts and generate a zone file from their hostname, cname, and address attributes:

extractZoneFile :: Domain -> [Host] -> ZoneFile
extractZoneFile = gen . map hostAttr
  where gen = -- TODO

The eventual plan is that the cname property won't be defined as a property of the host, but of the container running inside it. Then I'll be able to cut-n-paste move docker containers between hosts, or duplicate the same container onto several hosts to deal with load, and propellor will provision them, and update the zone file appropriately.

Also, Chris Webber had suggested that Propellor be able to separate values from properties, so that eg, a web wizard could configure the values easily. I think this gets it much of the way there. All that's left to do is two easy functions:

overrideAttrsFromJSON :: Host -> JSON -> Host

exportJSONAttrs :: Host -> JSON

With these, propellor's configuration could be adjusted at run time using JSON from a file or other source. For example, here's a containerized webserver that publishes a directory from the external host, as configured by JSON that it exports:

demo :: Host
demo = Docker.container "webserver" "joeyh/debian-unstable"
    & Docker.publish "80:80"
    & dir_to_publish "/home/mywebsite" -- dummy default
    & Docker.volume (getAttr dir_to_publish ++":/var/www")
    & Apt.serviceInstalledRunning "apache2"

main = do
    json <- readJSON "my.json"
    let demo' = overrideAttrsFromJSON demo
    writeJSON "my.json" (exportJSONAttrs demo')
    defaultMain [demo']
propellor type-safe reversions

Propellor ensures that a list of properties about a system are satisfied. But requirements change, and so you might want to revert a property that had been set up before.

For example, I had a system with a webserver container:

Docker.docked container hostname "webserver"

I don't want a web server there any more. Rather than having a separate property to stop it, wouldn't it be nice to be able to say:

revert (Docker.docked container hostname "webserver")

I've now gotten this working. The really fun part is, some properies support reversion, but other properties certianly do not. Maybe the code to revert them is not worth writing, or maybe the property does something that cannot be reverted.

For example, Docker.garbageCollected is a property that makes sure there are no unused docker images wasting disk space. It can't be reverted. Nor can my personal standardSystem Unstable property, which amoung other things upgrades the system to unstable and sets up my home directory..

I found a way to make Propellor statically check if a property can be reverted at compile time. So revert Docker.garbageCollected will fail to type check!

The tricky part about implementing this is that the user configures Propellor with a list of properties. But now there are two distinct types of properties, revertable ones and non-revertable ones. And Haskell does not support heterogeneous lists..

My solution to this is a typeclass and some syntactic sugar operators. To build a list of properties, with individual elements that might be revertable, and others not:

        & standardSystem Unstable
        & revert (Docker.docked container hostname "webserver")
        & Docker.docked container hostname "amd64-git-annex-builder"
        & Docker.garbageCollected
adding docker support to propellor

Propellor development is churning away! (And leaving no few puns in its wake..)

Now it supports secure handling of private data like passwords (only the host that owns it can see it), and fully end-to-end secured deployment via gpg signed and verified commits.

And, I've just gotten support for Docker to build. Probably not quite work, but it should only be a few bugs away at this point.

Here's how to deploy a dockerized webserver with propellor:

host hostname@"" = Just
    [ Docker.configured
    , File.dirExists "/var/www"
    , Docker.hasContainer hostname "webserver" container

container _ "webserver" = Just $ Docker.containerFromImage "joeyh/debian-unstable"
        [ Docker.publish "80:80"
        , Docker.volume "/var/www:/var/www"
        , Docker.inside
            [ serviceRunning "apache2"
                `requires` Apt.installed ["apache2"]

Docker containers are set up using Properties too, just like regular hosts, but their Properties are run inside the container.

That means that, if I change the web server port above, Propellor will notice the container config is out of date, and stop the container, commit an image based on it, and quickly use that to bring up a new container with the new configuration.

If I change the web server to say, lighttpd, Propellor will run inside the container, and notice that it needs to install lighttpd to satisfy the new property, and so will update the container without needing to take it down.

Adding all this behavior took only 253 lines of code, and none of it impacts the core of Propellor at all; it's all in Propellor.Property.Docker. (Well, I did need another hundred lines to write a daemon that runs inside the container and reads commands to run over a named pipe... Docker makes running ad-hoc commands inside a container a PITA.)

So, I think that this vindicates the approach of making the configuration of Propellor be a list of Properties, which can be constructed by abitrarily interesting Haskell code. I didn't design Propellor to support containers, but it was easy to find a way to express them as shown above.

Compare that with how Puppet supports Docker:

docker::run { 'helloworld':
  image        => 'ubuntu',
  command      => '/bin/sh -c "while true; do echo hello world; sleep 1; done"',
  ports        => ['4444', '4555'],

All puppet manages is running the image and a simple static command inside it. All the complexities that puppet provides for configuring servers cannot easily be brought to bear inside the container, and a large reason for that is, I think, that its configuration file is just not expressive enough.

upstream git repositories

Daniel Pocock posted The multiple repository conundrum in Linux packaging. While a generally good and useful post, which upstream developers will find helpful to understand how Debian packages their software, it contains this statement:

If it is the first download, the maintainer creates a new git repository. If it has been packaged before, he clones the repository. The important point here is that this is not the upstream repository, it is an independent repository for Debian packaging.

The only thing important about that point is that it highlights an unnecessary disconnect between the Debian developer and upstream development. One which upstream will surely find annoying and should certainly not be bothered with.

There is absolutely no technical reason to not use the upstream git repository as the basis for the git repository used in Debian packaging. I would never package software maintained in a git repository upstream and not do so.

The details are as follows:

  • For historical reasons that are continuingly vanishing in importance, Debian fetishises the tarballs produced by upstream. While upstreams increasingly consider them an unimportant distraction, Debian insists in hoarding and rolling around in on its nest of gleaming pristine tarballs.

    I wrote pristine-tar to facilitate this behavior, while also pointing fun at it, and perhaps introducing a weak spot with which to eventually slay this particular dragon. It is widely used within Debian.

    .. Anyway, the point is that it's no problem to import upstream's tarball into a clone if their git repository. It's fine if that tarball includes files not present in their git repository. Indeed, upstream can do this at release time if they like. Or Debian developers can do it and push a small quantity of data back to upstream in a branch.

  • Sometimes tagged releases in upstream git repositories differ from the files in their released tarballs. This is actually, in my experience, less due to autotools generated files, and more due to manual and imperfect release processes, human error, etc. (Arguably, autotools are a form of human error.)

    When this happens, and the Debian developer is tracking upstream git, they can quite easily modify their branch to reflect the contents of the tarball as closely as they desire. Or modify the source package uploaded to Debian to include anything left out of the tarball.

    My favorite example of this is an upstream who forgot to include their README in their released tarball. Not a made up example; as mentioned tarballs are increasingly an irrelevant side-show to upstreams. If I had been treating the tarball as canonical I would have released a package with no documentation.

  • Whenever Debian developers interact with upstream, whether it's by filing bug reports or sending patches, they're going to be referring to refs in the upstream git repository. They need to have that repository available. The closer and better the relationship with upstream, the more the DD will use that repository. Anything that pulls them away from using that repository is going to add friction to dealing with upstream.

    There have, historically, been quite a lot of sources of friction. From upstreams who choose one VCS while the DD preferred using another, to DDs low on disk space who decided to only version control the debian directory, and not the upstream source code. With disk space increasingly absurdly cheap, and the preponderance of development converging on git, there's no reason for this friction to be allowed to continue.

So using the upstream git repository is valuable. And there is absolutely no technical value, and plenty of potential friction in maintaining a history-disconnected git repository for Debian packaging.

difficulties in backing up live git repositories

But you can’t just tar.gz up the bare repositories on the server and hope for the best. Maybe a given repository will be in a valid state; maybe it won’t.

-- Jeff Mitchell in a followup to the recent KDE near git disaster

This was a surprising statement to me. I seem to remember that one of (many) selling points for git talked about back in the day was that it avoided the problem that making a simple cp (or backup) of a repository could lead to an inconsistent result. A problem that subversion repositories had, and required annoying commands to work around. (svnadmin $something -- iirc the backend FSFS fixed or avoided most of this issue.)

This prompted me to check how I handle it in ikiwiki-hosting. I must have anticipated a problem at some point, since ikisite backup takes care to lock the git repository in a way that prevents eg, incoming pushes while a backup is running. Probably, like the KDE developers, I was simply exercising reasonable caution.

The following analysis has probably been written up before (train; limited network availability; can't check), but here are some scenarios to consider:

  • A non-bare repository has two parts that can clearly get out of sync during a backup: The work tree and the .git directory.

    • The .git directory will likely be backed up first, since getdirent will typically return it first, since it gets created first . If a change is made to the work tree during that backup, and committed while the work tree is being backed up, the backup won't include that commit -- which is no particular problem and would not be surprising upon restore. Make commit again and get on with life.

    • However, if (part of) the work tree is backed up before .git, then any changes that are committed to git during the backup would not be reflected in the restored work tree, and git diff would show a reversion of those changes. After restore, care would need to be taken to reset the work tree (without losing any legitimate uncommitted changes).

  • A non-bare repository can also become broken in other ways if just the wrong state is snapshotted. For example, if a commit is in progress during a backup, .git/index.lock may exist, and prevent future commits from happening, until it's deleted. These problems can also occur if the machine dies at just the right time during a commit. Git tells you how to recover. (git could go further to avoid these problems than it does; for example it could check if .git/index.lock is actually locked using fcntl. Something I do in git-annex to make the .git/annex/index.lock file crash safe.)

  • A bare repository could be receiving a push (or a non-bare repository a pull) while the backup occurs. These are fairly similar cases, with the main difference being that a non-bare repository has the reflog, which can be used to recover from some inconsist states that could be backed up. Let's concentrate on pushes to bare repositories.

    • A pack could be in the process of being uploaded during a backup. The KDE developers apparently worried that this could result in a corrupt or inconsistent repository, but TTBOMK it cannot; git transfers the pack to a temp file and atomically renames it into place once the transfer is complete. A backup may include an excess temp file, but this can also happen if the system goes down while a push is in progress. Git cleans these things up.

    • A push first transfers the .git/objects, and then updates .git/refs. A backup might first back up the refs, and then the objects. In this case, it would lose the record that refs were pushed. After being restored, any push from another repository would update the refs, even using the objects that did get backed up. So git recovers from this, and it's not really a concern.

    • Perhaps a backup chooses to first back up the objects, and then the refs. In this case, it could back up a newly changed ref, without having backed up the referenced objects (because they arrived after the backup had finished with the objects). When this happens, your bare repository is inconsistent; you have to somehow hunt down the correct ref for the objects you do have.

      This is a bad failure mode. git could improve this, perhaps, by maintaining a reflog for bare repositories. (Update: core.logAllRefUpdates can be set to true for bare repositories, but is disabled by default.)

  • A "backup" of a git repository can consist of other clones of it. Which do not include .git/hooks/ scripts, .git/config settings, and potentially other valuable information, that strangely, we do not check into revision control despite having this nice revision control system available. This is the most likely failure mode with "git backups". :P

I think that it's important git support naive backups of git repositories as well as possible, because that's probably how most backups of git repositories are made. We don't all have time to carefully tune our backup systems to do something special around our git repositories to ensure we get them in a consistent state like the KDE project did, and as their experience shows, even if we do it, we can easily introduce other, unanticipated problems.

Can anyone else think of any other failure modes like these, or find holes in my slightly rushed analysis?

PS: git-annex is itself entirely crash-safe, to the best of my abilities, and also safe for naive backups. But inherits any problems with naive backups of git repositories.