TY - JOUR
T1 - Neural multisensory scene inference
AU - Lim, Jae Hyun
AU - Pinheiro, Pedro O.
AU - Rostamzadeh, Negar
AU - Pal, Christopher
AU - Ahn, Sungjin
N1 - Funding Information:
JL would like to thank Chin-Wei Huang, Shawn Tan, Tatjana Chavdarova, Arantxa Casanova, Ankesh Anand, and Evan Racah for helpful comments and advice. SA thanks Kakao Brain, the Center for Super Intelligence (CSI), and Element AI for their support. CP also thanks NSERC and PROMPT.
Publisher Copyright:
© 2019 Neural information processing systems foundation. All rights reserved.
PY - 2019
Y1 - 2019
N2 - For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. We also introduce a novel method, called the Amortized Product-of-Experts, to improve the computational efficiency and the robustness to unseen combinations of modalities at test time. Experimental results demonstrate that the proposed model can efficiently infer robust modality-invariant 3D-scene representations from arbitrary combinations of modalities and perform accurate cross-modal generation. To perform this exploration, we also develop the Multisensory Embodied 3D-Scene Environment (MESE).
AB - For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. We also introduce a novel method, called the Amortized Product-of-Experts, to improve the computational efficiency and the robustness to unseen combinations of modalities at test time. Experimental results demonstrate that the proposed model can efficiently infer robust modality-invariant 3D-scene representations from arbitrary combinations of modalities and perform accurate cross-modal generation. To perform this exploration, we also develop the Multisensory Embodied 3D-Scene Environment (MESE).
UR - http://www.scopus.com/inward/record.url?scp=85090175133&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090175133&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85090175133
SN - 1049-5258
VL - 32
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 33rd Annual Conference on Neural Information Processing Systems, NeurIPS 2019
Y2 - 8 December 2019 through 14 December 2019
ER -