Querylog-based assessment of retrievability bias in a large newspaper corpus

Traub, Myriam; Samar, Thaer; van Ossenbruggen, Jacco; He, Jiyin; de Vries, Arjen; Hardman, Lynda

M.C. Traub (Myriam), T. Samar (Thaer), J.R. van Ossenbruggen (Jacco), J. He (Jiyin), A.P. de Vries (Arjen) and L. Hardman (Lynda)

2016-06-01

Querylog-based assessment of retrievability bias in a large newspaper corpus

Presented at the IEEE/ACM Joint Conference on Digital Libraries, Newark, NJ, USA

Bias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the retrievability for all documents in a collection. Previous evaluations have been performed on TREC collections using simulated query sets. The question remains, however, how representative this approach is of more realistic settings. To address this question, we investigate the effectiveness of the retrievability measure using a large digitized newspaper corpus, featuring two characteristics that distinguishes our experiments from previous studies: (1) compared to TREC collections, our collection contains noise originating from OCR processing, historical spelling and use of language; and (2) instead of simulated queries, the collection comes with real user query logs including click data. First, we assess the retrievability bias imposed on the newspaper collection by different IR models. We assess the retrievability measure and confirm its ability to capture the retrievability bias in our setup. Second, we show how simulated queries differ from real user queries regarding term frequency and prevalence of named entities, and how this affects the retrievability results.

Additional Metadata
Keywords	Retrievability Bias, User Query Logs, Digital Library, Digital Humanities
ACM	General Literature (acm A)
THEME	Information (theme 2)
Editor	N.R. Adam , B. Cassel , Y. Yesha
Project	COMMIT: Socially Enriched Acces to Linked Cultural Media (P06) , Behavior-aware Search Evaluation for Information Retrieval
Conference	IEEE/ACM Joint Conference on Digital Libraries
Grant	This work was funded by the The Netherlands Organisation for Scientific Research (NWO); grant id nwo/13675 - Behavior-aware Search Evaluation for Information Retrieval
Organisation	Human-Centered Data Analytics
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Traub, M., Samar, T., van Ossenbruggen, J., He, J., de Vries, A., & Hardman, L. (2016). Querylog-based assessment of retrievability bias in a large newspaper corpus. In N. R. Adam, B. Cassel, & Y. Yesha (Eds.), .

Free Full Text ( Author Manuscript , 1mb )

Querylog-based assessment of retrievability bias in a large newspaper corpus

Publication

Publication

Address

CWI researchers

Questions or comments?

Querylog-based assessment of retrievability bias in a large newspaper corpus

Publication

Publication

Workflow

Workflow

Add Content