Graphical contrastive losses for scene graph parsing

Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, Bryan Catanzaro

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e.g. multiple cups). The second, Proximal Relationship Ambiguity, arises when multiple subject-predicate-object triplets appear in close proximity with the same predicate, and the model struggles to infer the correct subject-object pairings (e.g. mis-pairing musicians and their instruments). We propose a set of contrastive loss formulations that specifically target these types of errors within the scene graph parsing problem, collectively termed the Graphical Contrastive Losses. These losses explicitly force the model to disambiguate related and unrelated instances through margin constraints specific to each type of confusion. We further construct a relationship detector, called RelDN, using the aforementioned pipeline to demonstrate the efficacy of our proposed losses. Our model outperforms the winning method of the OpenImages Relationship Detection Challenge by 4.7% (16.5% relatively) on the test set. We also show improved results over the best previous methods on the Visual Genome and Visual Relationship Detection datasets.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
PublisherIEEE Computer Society
Pages11527-11535
Number of pages9
ISBN (Electronic)9781728132938
DOIs
StatePublished - Jun 2019
Event32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 - Long Beach, United States
Duration: Jun 16 2019Jun 20 2019

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2019-June
ISSN (Print)1063-6919

Conference

Conference32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019
CountryUnited States
CityLong Beach
Period6/16/196/20/19

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition

Keywords

  • Categorization
  • Recognition: Detection
  • Retrieval
  • Scene Analysis and Understanding

Fingerprint Dive into the research topics of 'Graphical contrastive losses for scene graph parsing'. Together they form a unique fingerprint.

  • Cite this

    Zhang, J., Shih, K. J., Elgammal, A., Tao, A., & Catanzaro, B. (2019). Graphical contrastive losses for scene graph parsing. In Proceedings - 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019 (pp. 11527-11535). [8954006] (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2019-June). IEEE Computer Society. https://doi.org/10.1109/CVPR.2019.01180