So, I'm considering a backup system that has as its goal to let you backup all your data from third party websites, easily.

The idea I have for its UI is you enter in the urls for all the sites you use, or feed it a bookmark file. By examining the URLs, it determines if it knows how to export data from each site, and prompts for any necessary login info. The result would be a list of sites, and their backup status. I think it's important that you be able to enter websites that the system doesn't know how to handle yet; backup of such sites will show as failed, which is accurate. It seems to almost be appropriate to have a web interface, although command-line setup and cron should also be supported.

For the actual data collection, there would be a plugin system. Some plugins could handle fairly generic things, like any site with an RSS feed. Other plugins would have site-specific logic, or would run third-party programs that already exist for specific sites. There are some subtle issues: Just because a plugin manages to suck an RSS feed or FOAF file off a site does not mean all the relevant info for that site is being backed up. More than one plugin might be able to back up parts of a site using different methods. Some methods will be much too expensive to run often.

As far as storing the backup, dumping it to a directory as files in whatever formats are provided might seem cheap, but then you can check it into git, or let your regular backup system (which you have, right?) handle it.


Thoughts, comments, prior art, cute program idea names (my idea is "?unsilo"), or am I wasting my time thinking about this?

Libre.fm would support this
We'd be willing to work with you make this happen, if you would be willing to release this under an appropriate free software license, such as the GNU GPL v3 or later.
Comment by mattl [myopenid.com]
comment 3
One idea; rather than just ask for urls, perhaps you can scan a browser profile for identity cookies and prompt for likely suspects?
Comment by jldugger [launchpad.net]
Quite hard for the general case.

The biggest problem is ensuring that you tease out every available piece of data.

As an example: I have a blog. It has (several) feeds, but those feeds only contain the most current entries. For a reliable backup you'd need to ensure that there was a public (ACL/firewall/etc) feed to pull all data.

Of course for some applications those feeds aren't available and will need to be added - and if you're in a position where you need to make code changes you kinda think "Hmmm maybe I should just use rsync after all".

Still its an attractive usecase, and even if you only supported wordpress, etc, you would probably be providing a useful thing to the world!

Comment by skx [livejournal.com]
rel="backup"?

Seems like it might help to have some standards for sites to specify how to back up their data. For instance, link rel="export" or link rel="backup".

Of course, even with a standard for backing up your data, you'll probably need site-specific code to log in so you can access that data.

Comment by joshtriplett.myopenid.com//
Flickr

The only site I really care about my data being backed up is Flickr. :-) For that I use a specific photobackup tool.

I really can't imagine you doing a generic "unsilo" on Flickr, but it would be great if you could.

Oh btw, isn't this what the Data portability project is all about? I'm not sure.

Great idea Joey.

Comment by hendry [iki.fi]