In order to better integrate my blog with my website, better manage comment spam, and reduce my dependence on Google, this blog has moved to In order to avoid broken links I won't be deleting content from here, but no new content will be added, so please update your bookmarks and feeds.

Tuesday, 9 July 2013

Institutional repositories for data?

Via my Twitter feed:

(And discussion ensuing.)

I'm not an expert in data management. A year ago it was top of my list of Things That Are Clearly Very Important But Also Extremely Scary, Can Someone Else Please Handle It? But then I got a cool job which includes (among other things) investigating what this data management stuff is all about, so I set about investigating.

Sometime in the last half year I dropped the assumption that we needed to be working towards an institutional data repository. In fact, I now believe we need to be working away from that idea. Instead, I think we should be encouraging researchers to deposit their datasets in the discipline-specific (or generalist) data repositories that already exist.

I have a number of reasons for this:
  • My colleague and I, with a certain amount of outsourcing, already have to run a catalogue, the whole rickety edifice of databases and federated searching and link resolving and proxy authentication, library website and social media account, institutional repository, community archive, open journal system, etc etc. Do we look like we need another system to maintain?
  • An institutional archive is great kind of serviceable for pdfs. But datasets come in xls, csv, txt, doc, html, xml, mp3, mp4, and a thousand more formats, no, I'm not exaggerating. They can be maps, interviews, 3D models, spectral images, anything. They can be a few kilobytes or a few petabytes. Yeah, you can throw this stuff into DSpace, but that doesn't mean you should. That's like throwing your textbooks, volumes of abstracts, Kindles, Betamax, newspapers, murals, jigsaw puzzles, mustard seeds, and Broadway musicals (not a recording, the actual theatre performance) onto a single shelf in a locked glass display cabinet and making people browse by the spine labels.
  • If you want a system that can do justice to the variety of datasets out there, you'd better have the resources of UC3 or DCC or Australia or PRISM. Because you're either going to have to build it or you're going to have to pay someone to build it, and then you're going to have to maintain it. And you're going to have to pay for storage and you're going to have to run checksums for data integrity and you're going to have to think about migrating the datasets as time marches on and people forget what the current shiny formats are. And you're going to have to wonder if and how Google Scholar indexes it (and hope Google Scholar lasts longer than Google Reader did) or no-one will ever find it. And a whole lot more else.
  • If anything's in it. Do you know how hard it is to get researchers to put their conference papers into institutional repositories? My own brother flatly refuses. He points out that his papers are already available via his discipline's open access repository. That's where people in his discipline will look for it. It's indexed by Google. Why put it anywhere else? I conceded the point for the sake of our family dinner, and I haven't brought it up again because on reflection he's right. (He's ten years younger than me; he has no business being right, dammit.) And because it's hard enough to get researchers to put their conference papers into institutional repositories even when their copy is the only one in existence.
  • Do you know how hard it is to convince most researchers that they should put their datasets anywhere online other than a private Dropbox account? (Shameless plug: Last week another colleague and I did a talk responding to 8 'myths' or reasons why many researchers hesitate - slides and semi-transcript here. That's summarised from a list we made of 23 reasons, and other people have come up with more objections and necessary counters.) The lack of an institutional repository for data doesn't even rate.
No, forget creating institutional data repositories. What we need to be doing is getting familiar with the discipline data repositories and data practices that already exist, so when we talk to a researcher we can say "Look at what other researchers in your discipline are doing!"

This makes it way easier to prove that this data publishing thing isn't just for Those Other Disciplines, and that there are ways for them to deal with [confidentiality|IP issues|credibility|credit]. And it makes sure the dataset is where other researchers in that discipline are searching for it. And it makes sure the datasets are deposited according to that discipline's standards and that discipline's needs, not according to the standards and needs of whoever was foremost in mind of the developer who created the generic institutional data repository - so the search interface will be more likely to work reasonable for that discipline. And it means the types of data will be at least a little more homogenous (in some cases a lot more) so there's more potential for someone to do cool stuff with linked open data.

And it means we can focus on what we do best, which is helping people find and search and understand and use and cite and publish these resources. Trust me, there is plenty more to do in data management than just setting up an institutional data repository.