In order to better integrate my blog with my website, better manage comment spam, and reduce my dependence on Google, this blog has moved to In order to avoid broken links I won't be deleting content from here, but no new content will be added, so please update your bookmarks and feeds.

Wednesday, 21 November 2012

PDF for digital preservation and delivery #ndf2012

PDF for digital preservation and delivery
John Laurie, University of Auckland Library
PDF is ubiquitous on the web and many organisations in New Zealand are using it as a document storage format. It has been an open standard since 2008, and has been endorsed by key organisations around the world. It is a complex format with many different versions. This paper will look at differences between PDF/A archival formats and other PDF formats, methods for handling born-digital PDFs and PDFs created by scanning, problems with dirty OCR (optical character recognition) and text extraction for indexing, and issues around file sizes for preservation and online display. It will also look at usage of Adobe's RDF and Dublin Core-based XMP metadata and compare PDF with METS-Alto as a format for different types of digitisation.

Doubts about PDF as a format - have sometimes used it and then changed to TEI - but with all its faults it's here to stay.

  • Is PDF good enough?
  • what's a maximum file size
  • pdf/a or simple pdf?
  • searchable text or clearscan?
  • OCR?
  • etc
Various local pdf collections at UofAuckland - past exam papers, Journal of the Polynesian Society, New Zealand Journal of History, early NZ statutes, theses, working papers, course materials.

B-engine platform displays as pdf and extracts text and makes it available for cross-site search.

Pdf continually improving - read aloud versions; now working with citations[1]. But hard to edit.

Focusing on digitising pdfs. Choice to use Adobe's own scanning/ocr or to use other specialised ocr engines? Need to look at outputs you want - many variables to consider. Do you want to save pdf as preservation master copy or keep FineReader tiffs. Have only scanned 300-400dpi for text and haven't seen advantages to greater for his purposes. Need greyscale for ocr. FineReader better than Adobe but doesn't offer ClearScan. Is trainable - useful for fractions. Spellchecking options.

Tables are a particular problem. OCR confuses vertical lines with text. Can't extract tables from PDF to Excel. Could do some training for OCR to recognise the two dots of "blank field" and vertical lines. Thinking of using dirty OCR and making it available as a link from the pdf page.

Compromise between quality and file size. Born digital (usually as Word -> PDF) are usually very small because use fonts. PDFs from scanning balloon out a lot as images. If text is clear can do black and white. Working with 5-10MB TIFF files as preservation master (FineReader creates these automatically).

PDF/A is archival version - ISO-standardised, supposed to be self-contained including embedded fonts. But often if you use "reduce file size" can't save as PDF/A because it substitutes non-embedded fonts. Many files from big publishers aren't pdf/a. But will the smarter computers of the future really need embedded fonts? "As we all get smarter and technology improves the acute concerns about format obsolescence may diminish" - Butch Lazorchak The Signal

PDF/A-1a, A-1b, A-2... Can get quite complicated!

ClearScan vs searchable image - clearscan files are just over half the size. Substitutes a new font - matches shape not OCR'd text. Much clearer, less blurry than searchable image version.

Problems with text extraction using pdftotext applet. Applet preindexes results. But with particular fonts/books you get extra spaces between characters. (Finds examples using search for "t h e".) Problems with macrons won't ruin display but will ruin search.

PDF XMP metadata - has made attempts at adding dublin core metadata. Automatically extracts a lot of its own. Can add elements from any metadata scheme. File > Properties > Additional metadata. Set up a custom file info panel - can populate a whole group of documents. Advanced shows it with Dublin Core elements.

METS-ALTO looks a lot like pdf - has image in front of text / dirty ocr hidden behind it which you can search on and get either text or image. METS (Metadata Encoding and Transmission Standard) is structural metadata linking things together; ALTO (Analyzed Layout and Text Object) stores layout info, OCR text. Can be used to create derivatives eg pdf, tei, xml, epub.

[1] Allusion relates to an article I came across last night, Refurbishing the Camelot of Scholarship: How to Improve the Digital Contribution of the PDF Research Article. -Deborah

Comment: Budget of 0 so upload pdf to Google Docs and let the settings there OCR it. Little success with older material though.

Comment: Someone at Access conference (Art Rhyno from UofWindsor) has had good luck with open source Tesseract.

Comment: Experimented with Tesseract, Abby - problems with the latter.
A: Tried writing to Abby re problems but no luck.

Comment: Option of using multiple search engines to increase chance of getting a hit. Can render marvellously different results. So training package very valuable because it's in context of your collection.
A: Then can use trained package on new documents.

Q: How does file size impact decision on format?
A: Often split it up to keep file to 10MB - per chapter or per 50pages. Otherwise risk compromising quality. Best to do this within FineReader to target dpi/quality. Because this is just the delivery file - we keep preservation masters.

Q: When do you decide the OCR's not good enough and better to transcribe?
A: Outsourced transcription on one project to India and excellent job but expensive, dense text, not in English, hard to proofread. Now use OCR only and provide warnings if quality not good.

Comment: Anyone transcribing? Crowdsourcing transcribing?
Comment: Would need automated software
Comment: Like Trove / National Library of Australia
Comment: This proves there are keen people out there
Comment: Also Project Gutenberg Distributed Proofreaders - volunteers proofread a page at a time and each page proofread multiple time
Comment: Can add layers of rigour

Q: Anyone collecting pdf as born digital?
A: Yes, Journal of Polynesian Society comes born digital, his job is just to split it as appropriate. Once with New Zealand Journal of History an author wanted him to add a section that the journal had missed out. He did it but marked very carefully that he'd done it!

A: Has anyone used XMP metadata?
Comment: We did for Flickr - works but it's not fun. Software around worldview is incomplete.