Unlocking the Secrets of 3 Billion Pages: Introducing the HathiTrust Research Center
Keynote from J. Stephen Downie, Associate Dean for Research and a Professor at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign.
Hathi a membership organisation - mostly top-tier US unis, plus three non-US.
"Wow" numbers:
* 10million volumes including 3.4million volumes in the US public domain
* 3.7 billion pages
* 482 TB of data
* 127 miles of books
Of the 3.4 million volumes in the public domain, about a third are in public domain only in the US; the rest are public domain worldwide (4% US govt documents so public domain from point of publication).
48% English, 9% German (probably scientific publications from pre-WWII).
Services to member unis:
* long term preservation
* full text search
* print on demand
* datasets for research
Data:
Bundles have for each page a jpg, OCR text, xml which provides location of words on each page.
METS holds the book together - points to each image/text/xml file. And built into the METS file is structure information et table of contents, chapter start, bibliography, title, etc.
Public domain data available through web interfaces, APIs, data feeds
"Public-domain" datasets still require a signed researcher statement. Stuff digitised by Google has copyright asserted over it by Google. And anything from 1872-1923 is still considered potentially under copyright outside of the US. Working on manual rights determination - have a whole taxonomy for what the status is and how they assessed it that way.
Non-consumptive research paradigm - so no one action by one user, or set of action by a group of users, could be used to reconstruct works and publish. So users submit requests, Hathi does the compute, and sends results back to them. [This reminds me of old Dialog sessions where you had to pay per search so researchers would have to get the librarian to perform the search to find bibliographic data. Kind of clunky but better than nothing I guess...]
Meandre lets researcher set up the processing flow they want to get their results. Includes all the common text processing tasks eg Dunning Loglikelihood (which can be further improved by removing proper nouns). Doesn't replace a close-reading - answers new questions. Correlation-Ngram viewer so can track use of words across time.
OCR noise is a major limitation.
Downie wants to engage in more collaborative projects, more international partnerships, and move beyond text and beyond humanities. Just been awarded a grant for "Work-set Creation for Scholarly Analysis: Prototyping Project". Non-trivial to find a 10,000-work subset of 10million works to do research on - project aims to solve this problem. Also going to be doing some user-needs assessments, and in 2014 will be awarding grants for four sub-projects to create tools. Eg would be great if there was a tool to find what pages have music on.
Ongoing challenges:
How do we unlock the potential of this data?
* Need to improve quality of data; improve metadata. Even just to know what's in what language!
* Need to reconcile various data structure schemes
* May need to accrete metadata (there's no perfect metadata scheme)
* Overcoming copyright barriers
* Moving beyond text
* Building community