1 million captioned Dutch newspaper images

Elliott, Desmond; Kleppe, Martijn

Images naturally appear alongside text in a wide variety of media, such as books, magazines, newspapers, and in online articles. This type of multi-modal data offers an interesting basis for vision and language research but most existing datasets use crowdsourced text, which removes the images from their original context. In this paper, we introduce the KBK-1M dataset of 1.6 million images in their original context, with co-occurring texts found in Dutch newspapers from 1922-1994. The images are digitally scanned photographs, cartoons, sketches, and weather forecasts; the text is generated from OCR scanned blocks. The dataset is suitable for experiments in automatic image captioning, image-article matching, object recognition, and data-to-text generation for weather forecasting. It can also be used by humanities scholars to analyse photographic style changes, the representation of people and societal issues, and new tools for exploring photograph reuse via image-similarity-based search.

Additional Metadata
Keywords	Digital humanities, Digitised newspapers, Language and vision
Conference	International Conference on Language Resources and Evaluation
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Elliott, D., & Kleppe, M. (2016). 1 million captioned Dutch newspaper images. In Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 (pp. 3054–3058).

Free Full Text ( Final Version , 1mb )

1 million captioned Dutch newspaper images

Publication

Publication

Address

CWI researchers

Questions or comments?

1 million captioned Dutch newspaper images

Publication

Publication

Workflow

Workflow

Add Content