TY - GEN
T1 - Cross-Modal Coherence for Text-to-Image Retrieval
AU - Alikhani, Malihe
AU - Han, Fangda
AU - Ravi, Hareesh
AU - Kapadia, Mubbasir
AU - Pavlovic, Vladimir
AU - Stone, Matthew
N1 - Funding Information:
2The research presented in this paper has been supported by NSF awards IIS-1703883, IIS-1955404, IIS-1955365, IIS 1955404, RETTL-2119265, IIS-1526723, CCF-1934924, and EAGER-2122119, and through generous donations from Adobe.
Publisher Copyright:
Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2022/6/30
Y1 - 2022/6/30
N2 - Common image-text joint understanding techniques presume that images and the associated text can universally be characterized by a single implicit model. However, co-occurring images and text can be related in qualitatively different ways, and explicitly modeling it could improve the performance of current joint understanding models. In this paper, we train a Cross-Modal Coherence Model for text-to-image retrieval task. Our analysis shows that models trained with image-text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models. We also show via human evaluation that images retrieved by the proposed coherence-aware model are preferred over a coherenceagnostic baseline by a huge margin. Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.
AB - Common image-text joint understanding techniques presume that images and the associated text can universally be characterized by a single implicit model. However, co-occurring images and text can be related in qualitatively different ways, and explicitly modeling it could improve the performance of current joint understanding models. In this paper, we train a Cross-Modal Coherence Model for text-to-image retrieval task. Our analysis shows that models trained with image-text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models. We also show via human evaluation that images retrieved by the proposed coherence-aware model are preferred over a coherenceagnostic baseline by a huge margin. Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.
UR - http://www.scopus.com/inward/record.url?scp=85147542541&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147542541&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85147542541
T3 - Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022
SP - 10427
EP - 10435
BT - AAAI-22 Technical Tracks 10
PB - Association for the Advancement of Artificial Intelligence
T2 - 36th AAAI Conference on Artificial Intelligence, AAAI 2022
Y2 - 22 February 2022 through 1 March 2022
ER -