Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Traub, Myriam; van Ossenbruggen, Jacco; Hardman, Lynda

M.C. Traub (Myriam), J.R. van Ossenbruggen (Jacco) and L. Hardman (Lynda)

2015-09-01

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Presented at the International Conference on Theory and Practice of Digital Libraries, Poznan, Poland

Humanities scholars increasingly rely on digital archives for their research in place of time-consuming visits to physical archives. This shift in research methodology has the hidden cost of working with digi- tally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. Based on the interviews and a literature study, we provide a classification scheme relating schol- arly research tasks to their specific OCR-induced uncertainty and the data required for more reliable uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncer- tainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers and data providers’ side is insufficient and needs further research to be improved.

Additional Metadata
Keywords	OCR Quality, Digital Libraries, Digital Humanities
ACM	General Literature (acm A)
THEME	Information (theme 2)
Project	COMMIT: Socially Enriched Acces to Linked Cultural Media (P06)
Conference	International Conference on Theory and Practice of Digital Libraries
Note	Published paper DOI: 10.1007/978-3-319-24592-8_19
Organisation	Human-Centered Data Analytics
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Traub, M., van Ossenbruggen, J.& Hardman, L. (2015, September). Impact Analysis of OCR Quality on Research Tasks in Digital Archives.

Free Full Text ( Author Manuscript , 251kb )

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Publication

Publication

Address

CWI researchers

Questions or comments?

Impact Analysis of OCR Quality on Research Tasks in Digital Archives

Publication

Publication

Workflow

Workflow

Add Content