As a Sunday diversion, I wrote 150 lines of code and turned git-annex into a podcatcher!

I've been using hpodder, a podcatcher written in Haskell. But John Goerzen hasn't had time to maintain it, and it fell out of Debian a while ago. John suggested I maintain it, but I have not found the time, and it'd be another mass of code for me to learn and worry about.

Also, hpodder has some misfeatures common to the "podcatcher" genre:

  • It has some kind of database of feeds and what files have been downloaded from them. And this requires an interface around adding feeds, removing feeds, changing urls, etc.
  • Due to it using a database, there's no particularly good way to run it on the same feeds on multiple computers and sync the results in some way.
  • It doesn't use git annex addurl to register the url where a file came from, so when I check files in with git-annex after the fact they're missing that useful metadata and I can't just git annex get them to re-download them from the podcast.

So, here's a rethink of the podcatcher genre:

cd annex; git annex importfeed http://url/to/podcast http://another/podcast

There is no database of feeds at all. Although of course you can check a list of them right into the same git repository, next to the files it adds. git-annex already keeps track of urls associated with content, so it reuses that to know which urls it's already downloaded. So when you're done with a podcast file and delete it, it won't download it again.

This is a podcatcher that doesn't need to actually download podcast files! With --fast, it only records the existence of files in git, so git annex get will download them from the web (or perhaps from a nearer location that git-annex knows about).

Took just 3 hours to write, and that's including full control over the filenames it uses (--template='${feedtitle)/${itemtitle}${extension}'), and automatic resuming of interrupted downloads. Most of what I needed was already available in git-annex's utility libraries or Hackage.

Technically, the only part of this that was hard at all was efficiently querying the git repository for a list of all known urls. I found a pretty fast way to do it, but might add a local cache file later on.