As a Sunday diversion, I wrote 150 lines of code and turned git-annex into a podcatcher!

I've been using hpodder, a podcatcher written in Haskell. But John Goerzen hasn't had time to maintain it, and it fell out of Debian a while ago. John suggested I maintain it, but I have not found the time, and it'd be another mass of code for me to learn and worry about.

Also, hpodder has some misfeatures common to the "podcatcher" genre:

  • It has some kind of database of feeds and what files have been downloaded from them. And this requires an interface around adding feeds, removing feeds, changing urls, etc.
  • Due to it using a database, there's no particularly good way to run it on the same feeds on multiple computers and sync the results in some way.
  • It doesn't use git annex addurl to register the url where a file came from, so when I check files in with git-annex after the fact they're missing that useful metadata and I can't just git annex get them to re-download them from the podcast.

So, here's a rethink of the podcatcher genre:

cd annex; git annex importfeed http://url/to/podcast http://another/podcast

There is no database of feeds at all. Although of course you can check a list of them right into the same git repository, next to the files it adds. git-annex already keeps track of urls associated with content, so it reuses that to know which urls it's already downloaded. So when you're done with a podcast file and delete it, it won't download it again.

This is a podcatcher that doesn't need to actually download podcast files! With --fast, it only records the existence of files in git, so git annex get will download them from the web (or perhaps from a nearer location that git-annex knows about).

Took just 3 hours to write, and that's including full control over the filenames it uses (--template='${feedtitle)/${itemtitle}${extension}'), and automatic resuming of interrupted downloads. Most of what I needed was already available in git-annex's utility libraries or Hackage.

Technically, the only part of this that was hard at all was efficiently querying the git repository for a list of all known urls. I found a pretty fast way to do it, but might add a local cache file later on.

This is great!
This is my biggest peeve with other podcatchers as well! It only really worked well when I treated my phone as "the podcast machine" even though I really wanted to just use my personal and work laptops. Thanks much!
Comment by greg
This is awesome!
Elegantly illustrates that git-annex and its implementation are powerful abstractions. Neat stuff.
Comment by Amitai
cool idea!
Is the feed URL connected to something in the annex, too? So that you can run an update on it, without doing a full importfeed again? Like for example having an RSS feed connected to a folder and an updatefeed command to just add all new URLs from that feed? Sure, a shell script that just runs importfeed on a list of predefined feeds would work, too, but if the feed url is connected in the metadata of a folder, that would allow a server to just run updatefeed on a defined folder via cron and you to manage the actual folder-url-connections. For me that would be great to build a photocatcher that collects photos from various sources where I push from my mobile and pull them all together into a bigger library on the server, with me being able to selectively get those photos that I actually need.
Comment by hugo
comment 3

Know any good Android podcast players (tuned for long-running audio, with remembered positions and "read" indicators) that could cooperate nicely with this?

I wonder if it would make sense to have an Android intent for "done with this file", and then options in git-annex to "discard files when done" in certain directories?

Comment by Josh