TY - GEN
T1 - Learning Disentangled Factors from Paired Data in Cross-Modal Retrieval
T2 - 29th ACM International Conference on Multimedia, MM 2021
AU - Kim, Minyoung
AU - Guerrero, Ricardo
AU - Pavlovic, Vladimir
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/10/17
Y1 - 2021/10/17
N2 - We tackle the problem of learning the underlying disentangled latent factors that are shared between the paired bi-modal data in cross-modal retrieval. Typically the data in both modalities are complex, structured, and high dimensional (e.g., image and text), for which the conventional deep auto-encoding latent variable models such as the Variational Autoencoder (VAE) often suffer from difficulty of accurate decoder training or realistic synthesis. In this paper we propose a novel idea of the implicit decoder, which completely removes the ambient data decoding module from a latent variable model, via implicit encoder inversion that is achieved by Jacobian regularization of the low-dimensional embedding function. Motivated from the recent Identifiable-VAE (IVAE) model, we modify it to incorporate the query modality data as conditioning auxiliary input, which allows us to prove that the true parameters of the model can be identifiable under some regularity conditions. Tested on various datasets where the true factors are fully/partially available, our model is shown to identify the factors accurately, significantly outperforming conventional latent variable models.
AB - We tackle the problem of learning the underlying disentangled latent factors that are shared between the paired bi-modal data in cross-modal retrieval. Typically the data in both modalities are complex, structured, and high dimensional (e.g., image and text), for which the conventional deep auto-encoding latent variable models such as the Variational Autoencoder (VAE) often suffer from difficulty of accurate decoder training or realistic synthesis. In this paper we propose a novel idea of the implicit decoder, which completely removes the ambient data decoding module from a latent variable model, via implicit encoder inversion that is achieved by Jacobian regularization of the low-dimensional embedding function. Motivated from the recent Identifiable-VAE (IVAE) model, we modify it to incorporate the query modality data as conditioning auxiliary input, which allows us to prove that the true parameters of the model can be identifiable under some regularity conditions. Tested on various datasets where the true factors are fully/partially available, our model is shown to identify the factors accurately, significantly outperforming conventional latent variable models.
KW - cross-modal retrieval
KW - factor analysis
KW - latent variable model
KW - multi-modal data analysis
UR - http://www.scopus.com/inward/record.url?scp=85119365546&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119365546&partnerID=8YFLogxK
U2 - 10.1145/3474085.3475448
DO - 10.1145/3474085.3475448
M3 - Conference contribution
AN - SCOPUS:85119365546
T3 - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
SP - 2862
EP - 2870
BT - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 20 October 2021 through 24 October 2021
ER -