TY - GEN
T1 - Mutual correlation attentive factors in dyadic fusion networks for speech emotion recognition
AU - Gu, Yue
AU - Li, Weitian
AU - Lyu, Xinyu
AU - Chen, Shuhong
AU - Ivan, Marsic
AU - Sun, Weijia
AU - Li, Xinyu
PY - 2019/10/15
Y1 - 2019/10/15
N2 - Emotion recognition in dyadic communication is challenging because: 1. Extracting informative modality-specific representations requires disparate feature extractor designs due to the heterogenous input data formats. 2. How to effectively and efficiently fuse unimodal features and learn associations between dyadic utterances are critical to the model generalization in actual scenario. 3. Disagreeing annotations prevent previous approaches from precisely predicting emotions in context. To address the above issues, we propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Our approach has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in most previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets: IEMOCAP and MELD. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.
AB - Emotion recognition in dyadic communication is challenging because: 1. Extracting informative modality-specific representations requires disparate feature extractor designs due to the heterogenous input data formats. 2. How to effectively and efficiently fuse unimodal features and learn associations between dyadic utterances are critical to the model generalization in actual scenario. 3. Disagreeing annotations prevent previous approaches from precisely predicting emotions in context. To address the above issues, we propose an efficient dyadic fusion network that only relies on an attention mechanism to select representative vectors, fuse modality-specific features, and learn the sequence information. Our approach has three distinct characteristics: 1. Instead of using a recurrent neural network to extract temporal associations as in most previous research, we introduce multiple sub-view attention layers to compute the relevant dependencies among sequential utterances; this significantly improves model efficiency. 2. To improve fusion performance, we design a learnable mutual correlation factor inside each attention layer to compute associations across different modalities. 3. To overcome the label disagreement issue, we embed the labels from all annotators into a k-dimensional vector and transform the categorical problem into a regression problem; this method provides more accurate annotation information and fully uses the entire dataset. We evaluate the proposed model on two published multimodal emotion recognition datasets: IEMOCAP and MELD. Our model significantly outperforms previous state-of-the-art research by 3.8%-7.5% accuracy, using a more efficient model.
KW - Attention Mechanism
KW - Dyadic Communication
KW - Multimodal Fusion Network
KW - Mutual Correlation Attentive Factor
KW - Speech Emotion Recognition
UR - http://www.scopus.com/inward/record.url?scp=85074858940&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85074858940&partnerID=8YFLogxK
U2 - 10.1145/3343031.3351039
DO - 10.1145/3343031.3351039
M3 - Conference contribution
T3 - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
SP - 157
EP - 165
BT - MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 27th ACM International Conference on Multimedia, MM 2019
Y2 - 21 October 2019 through 25 October 2019
ER -