TY - GEN
T1 - Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder
AU - Gu, Yue
AU - Fu, Shiyu
AU - Li, Xinyu
AU - Yang, Kangning
AU - Huang, Kaixiang
AU - Chen, Shuhong
AU - Zhou, Moliang
AU - Marsic, Ivan
N1 - Funding Information:
We would like to thank the four reviewers’ valuable feedback. This research was supported in part by the National Institutes of Health under Award Number R01LM011834.
Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/10/15
Y1 - 2018/10/15
N2 - Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoder-decoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluated our system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperformed previous state-of-the-art approaches in both classification and regressions tasks on three datasets. We also outperformed previous approaches in generalization tests on two commonly used datasets. We achieved comparable performance in predicting co-existing labels using the proposed model instead of multiple individual models. In addition, the easily-visualized modality and temporal attention demonstrated that the proposed attention mechanism helps feature selection and improves model interpretability.
AB - Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoder-decoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluated our system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperformed previous state-of-the-art approaches in both classification and regressions tasks on three datasets. We also outperformed previous approaches in generalization tests on two commonly used datasets. We achieved comparable performance in predicting co-existing labels using the proposed model instead of multiple individual models. In addition, the easily-visualized modality and temporal attention demonstrated that the proposed attention mechanism helps feature selection and improves model interpretability.
KW - Attention Mechanism
KW - Hierarchical Encoder-Decoder Structure
KW - Human Conversation Analysis
KW - Sensor Fusion
UR - http://www.scopus.com/inward/record.url?scp=85058231083&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058231083&partnerID=8YFLogxK
U2 - 10.1145/3240508.3240714
DO - 10.1145/3240508.3240714
M3 - Conference contribution
AN - SCOPUS:85058231083
T3 - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference
SP - 537
EP - 545
BT - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference
PB - Association for Computing Machinery, Inc
T2 - 26th ACM Multimedia conference, MM 2018
Y2 - 22 October 2018 through 26 October 2018
ER -