on data hoarding

This post, (via Ian M) meshed with some things I've been mulling on for a while about how we treat different types of data today. The relevant quote from that post is this:

I'll bet that we will look back of this era of quasi-networking and wince, "How did we ever live that way?" And the idea of wanting to carry all of your content with you will seem both old-fashioned and rather ridiculous.

I was recently very happy to acquire a large quantity of data files (about 15 thousand), each of which would take me hours to days to do anything with. And I have a few hundred CDs and DVDs with a terabytes or so of achived data on them. There is probably more stuff here than I would ever be able to look at in a regular lifetime if that were all I ever did. I'm adding stuff to the collection regularly.

All of this is data that is not freely available; some of it is private stuff like email archives, some more is music, and every other sort of digital media known to man. Except oddly, software, which is almost nonexistant from the hoard except for a few bits that slipped in by accident.

The chance that I'll ever look at any single one of these files is probably on the order of one in a thousand, although it's certianly possible that a computer might look at them all and search out some that I will be more likely to look at.

Even though I know that I will probably never use most of this stuff, I hold onto it as archives mostly because it's easier to do that, than to decide if I really need to keep it. Or to get it back if I mistakenly discard it. That basic tradeoff, plus the fact that it's very easy these days to aquire large quatities of data and hold onto it cheaply, are the main factors behind this hoarding.

What I've described might seem extreme, but it's not very uncommon and most of the people reading this probably do it too, to some extent. But the really interesting thing to me is that if this data were the only data I was able to use, I'd be very unhappy and bored and unable to accomplish much. The interesting thing is all the other data that I don't bother hoarding, because it seems so likely that it will be easy to get it if I need it.

Some examples of this data -- and there are untold terabytes of it out there -- include: the complete source and binaries for a dozen operating systems; the complete revision control history of a thousand free software projects; tons of archived web sites, concerts, and movies in the internet archive; every usenet post ever made; every mailing list post I've ever read (or ignored); dozens of bug tracking systems with millions of bug reports; probably not very good scans of any artwork of any significance; more ephemeral blog posts than even boing boing can link to, etc.

I don't hoard that data. The tradeoffs dont make hoarding it worthwhile; it's easier to just hope that most of it will always be available, and use my limited resources to keep the small parts of it that I'm responsible for backed up and available to others. Which tends to feel like a better use of time and resources than hoarding data, by the way.

I suspect that there is more of this sort of data out there that's important to me, than there is for the average person right now. After all, most people don't find old bugs in bug tracking systems and the history of software's modification in revision control systems interesting, and I understand there might even be a few to whom usenet archives from the 80's arn't fascinating. And I know that the amount of data out there that's important to me has gone up over time.

So the big question to me is, will I begin to find my hoarded data is useless, will it ever stop being important, and will I stop adding to the hoard?

Some technological advances, like faster bandwidth, could tilt the balance that makes some things worth hoarding now. But at least for me, data still seems to be worth hoarding if the cost of accessing it is more than the cost of keeping it on media, or if its future availability can't be guaranteed. There are a lot of services that don't meet these criteria, things like itunes, and bittorrent, and netflix. These things won't stop hoarding; they'll just make hoarding easier, despite themselves.

If it were still the 80's I'd be hoarding nearly every freely available thing I listed 5 paragraphs above. Some of it no longer needs to be hoarded thanks to technological change, some of it due to new services, but it seems to me that the really important change that tilts the balance from hoarding data to not is a growth in trust, and a change in what kinds of data are valuable.

I trust the Internet Archive. I (sorta kinda) trust Google. And so I trust them to keep their content available, and so I don't try to hoard it. More and more I find freely licensed data, whether it's Free software, CC licensed media, or whatever, to be more valuable than proprietary data. And because it's freely licensed it can be archived online for everyone, by everyone, with network effects, and no need for hoarding.