In order to better integrate my blog with my website, better manage comment spam, and reduce my dependence on Google, this blog has moved to In order to avoid broken links I won't be deleting content from here, but no new content will be added, so please update your bookmarks and feeds.

Tuesday, 20 September 2011

The death of organised data

I've been hearing rumours that the big IT companies may be giving up on organised data. Which is kind of a big thing for the same reason that it makes perfect sense: there are terabytes upon terabytes of data pouring onto computers and servers all the time, and organising all of that into a useful format takes a heck of a lot of time.

Especially because data organised to suit one need isn't necessarily going to suit most actual needs. If you're a reference librarian (either academic or, I suspect, public) you'll have had the student coming to your desk who can't quite understand why typing their assignment topic into a database doesn't return the single perfect article that explicitly answers all their questions.

So I think there's two ways of organising data:
  • "pre-organising" it - eg a dictionary, which is organised alphabetically, assuming you want to find out about a given word. It has information about which are nouns and what dates they derive from (to a best guess, obviously) but there's no way to search for nouns that were used in the 16th century because the dictionary creator never imagined someone might want to know such a thing.
  • organising it at point of need - eg a database which had all this same information but allowed you to tell it you want only nouns deriving from the 16th century or earlier; or only pronunciations that end in a certain phonetic pattern; or only words that include a certain other word in the definition.
Organising data at point of need solves one problem (it's much more flexible) but it doesn't actually save time on the organising end. In fact, it's likely to take quite a lot more time.

So is humanity doomed to be swimming in yottabytes of undifferentiated, unorganised, and thus useless data? I frowned over this for a while, and after some time I remembered the alternative to organising data: parsing it. (This is just what humans do when we skim a text looking for the information we want.) So, for example, a computer could take an existing dictionary as input and look for the pattern of a line which includes "n." (or s.b. or however the dictionary indicates a noun), and a date matching certain criteria, and returns to the user all the lines that match what was asked for.

Parsing is hard, and computers have historically been bad at it. (Bear in mind though that for a long time humans beat computers at chess.) This is not because computers aren't good at pattern-matching; it's because humans are so good at making typos, or rephrasing things in ways that don't fit the criteria. (One dictionary says "noun", one says "n.", one says "s.b.", one uses "n." but it refers to something else entirely...) A computer parsing data has to account for all the myriad ways something might be said, and all the myriad things a given text might mean.

But if you look around, you'll see parsing is already emerging. One of the things the LibX plugin does is look for the pattern of an ISBN and provide a link to your library's catalogue search. You may have an email program that, when your friend writes "Want to meet at 12:30 tomorrow at the Honeypot Cafe?", gives you a one-click option to put this appointment into your calendar. Machine transcription from videos, recognition of subjects in images, machine translation - none of it's anywhere near perfect, but it's all improving, and all these are important steps in the emergence of parsing as a major player in the field of managing data.

So yes, if I was a big IT company I might want to get out of the dead-end that is organising data, too - and get into the potentially much more productive field of parsing it.