Visual Genome: Connecting language and vision using crowdsourced dense image annotations

Krishna, Ranjay; Zhu, Yuke; Groth, Oliver; Johnson, Justin; Hata, Kenji; Kravitz, Joshua; Chen, Stephanie; Kalantidis, Yannis; Li, Li-Jia; Shamma, Ayman; Bernstein, Michael; Fei-Fei, Li

doi:10.1007/s11263-016-0981-7

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked “What vehicle is the person riding?”, computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that “the person is riding a horse-drawn carriage.” In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of (Formula presented.) objects, (Formula presented.) attributes, and (Formula presented.) pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

Additional Metadata
Keywords	Attributes, Computer vision, Crowdsourcing, Dataset, Image, Knowledge, Language, Objects, Question answering, Relationships, Scene graph
Persistent URL	doi.org/10.1007/s11263-016-0981-7
Journal	International Journal of Computer Vision
Organisation	Centrum Wiskunde & Informatica, Amsterdam (CWI), The Netherlands
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, A., Bernstein, M.& Fei-Fei, L. (2017). Visual Genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123, 32–73.https://doi.org/10.1007/s11263-016-0981-7

View at Publisher

Free Full Text ( Final Version , 11mb )

Visual Genome: Connecting language and vision using crowdsourced dense image annotations

Publication

Publication

Address

CWI researchers

Questions or comments?

Visual Genome: Connecting language and vision using crowdsourced dense image annotations

Publication

Publication

Workflow

Workflow

Add Content