A corpus of images and text in online news

Hollink, Laura; Bedjeti, Adriatik; van Harmelen, M.; Elliott, Desmond

L. Hollink (Laura), A. Bedjeti (Adriatik), M. van Harmelen and D. Elliott (Desmond)

2016-05-23

A corpus of images and text in online news

Presented at the 10th International Conference on Language Resources and Evaluation, LREC 2016 (May 2016), Portorož, Slovenia

In recent years, several datasets have been released that include images and text, giving impulse to new methods that combine natural language processing and computer vision. However, there is a need for datasets of images in their natural textual context. The ION corpus contains 300K news articles published between August 2014 - 2015 in five online newspapers from two countries. The 1-year coverage over multiple publishers ensures a broad scope in terms of topics, image quality and editorial viewpoints. The corpus consists of JSON-LD files with the following data about each article: the original URL of the article on the news publisher’s website, the date of publication, the headline of the article, the URL of the image displayed with the article (if any), and the caption of that image. Neither the article text nor the images themselves are included in the corpus. Instead, the images are distributed as high-dimensional feature vectors extracted from a Convolutional Neural Network, anticipating their use in computer vision tasks. The article text is represented as a list of automatically generated entity and topic annotations in the form of Wikipedia/DBpedia pages. This facilitates the selection of subsets of the corpus for separate analysis or evaluation.

Additional Metadata
Keywords	Image features, Online news, Topic extraction
THEME	Information (theme 2)
Conference	10th International Conference on Language Resources and Evaluation, LREC 2016
Organisation	Human-Centered Data Analytics
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Hollink, L., Bedjeti, A., van Harmelen, M., & Elliott, D. (2016). A corpus of images and text in online news. In Proceedings of International Conference on Language Resources and Evaluation 2016 (LREC 10) (pp. 1377–1382).

Full Text ( Final Version , 478kb )

Additional Files
24397A.pdf Author Manuscript , 464kb

See Also
dataset The ION corpus L. Hollink (Laura), A. Bedjeti (Adriatik), M. van Harmelen and D. Elliott (Desmond)

A corpus of images and text in online news

Publication

Publication

dataset
The ION corpus

Address

CWI researchers

Questions or comments?

A corpus of images and text in online news

Publication

Publication

dataset The ION corpus

Workflow

Workflow

Add Content

dataset
The ION corpus