Description: CLEVR provides a dataset for visual question answer, which can be formalized as a spatial-graph dataset. There are $10$ objects in the image with different 3D locations. Each object is identified by its shape, such as sphere, cylinder, and cube. The relationship between two objects can be categorized into four types: right, behind, front, left, with directions. Thus, each image can be formalized as a labeled directed graph with different edge types and node types. Thus, the spatial information of each nodes is closely correlated with the edge types between each pair of nodes. There are 70,000 training samples and 15,000 testing samples.
Statistics:
Name
Type
#Graphs
#Nodes
#Edges
Attributed
Directed
Weighted
Signed
Homogeneous
Spatial
Temporal
Labels
CLEVR
Scene Graphs
85,000
6
~40
YES
YES
YES
NO
YES
3D
NO
NO
Acknowlegement: Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Lawrence Zitnick, C. and Girshick, Ross. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017.