Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Traub, Myriam; van Ossenbruggen, Jacco; Samar, Thaer; Hardman, Lynda

doi:10.1145/3197026.3197046

M.C. Traub (Myriam), J.R. van Ossenbruggen (Jacco), T. Samar (Thaer) and L. Hardman (Lynda)

2018-06-03

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Presented at the IEEE/ACM Joint Conference on Digital Libraries (June 2018), Fort, Worth, TX, USA

Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually corrected versions of the same documents, and report on differences in their total sum, the overall retrievability bias, and the distribution of these changes over the documents, queries and query terms. For large collections, often only a fraction of the corpus is manually corrected. Using a mixed corpus, we assess how this mix affects the retrievability of the corrected and uncorrected documents. The correction of OCR errors increased the number of documents retrieved in all conditions. The increase contributed to a less biased retrieval, even when taking the potential lower ranking of uncorrected documents into account.

Additional Metadata
Keywords	data quality, digital library, ocr
Persistent URL	doi.org/10.1145/3197026.3197046
Project	Commit: Time Trails (P019) , Web Archives Retrieval Tools , A Europe-wide Interoperable Virtual Research Environment to Empower Multidisciplinary Research Communities and Accelerate Innovation and Collaboration
Conference	IEEE/ACM Joint Conference on Digital Libraries
Grant	This work was funded by the The Netherlands Organisation for Scientific Research (NWO); grant id nwo/640-005-001 - Web Archives Retrieval Tools, This work was funded by the European Commission 7th Framework Programme; grant id h2020/676247 - A Europe-wide Interoperable Virtual Research Environment to Empower Multidisciplinary Research Communities and Accelerate Innovation and Collaboration (VRE4EIC)
Organisation	Human-Centered Data Analytics
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Traub, M., van Ossenbruggen, J., Samar, T.& Hardman, L. (2018). Impact of Crowdsourcing OCR Improvements on Retrievability Bias. ACM International Conference Proceeding Series, 29–36.https://doi.org/10.1145/3197026.3197046

View at Publisher

Full Text ( Author Manuscript , 1mb )

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Publication

Publication

Address

CWI researchers

Questions or comments?

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Publication

Publication

Workflow

Workflow

Add Content