TY - GEN
T1 - DETECTING HIGHLIGHTED VIDEO CLIPS THROUGH EMOTION-ENHANCED AUDIO-VISUAL CUES
AU - Hu, Linkang
AU - He, Weidong
AU - Zhang, Le
AU - Xu, Tong
AU - Xiong, Hui
AU - Chen, Enhong
N1 - Funding Information:
This research was partially supported by grants from the National Key Research and Development Program of China (Grant No.2018YFB1402600), and the National Natural Science Foundation of China (Grant No.61727809, 62072423, 91746301)
Publisher Copyright:
© 2021 IEEE
PY - 2021
Y1 - 2021
N2 - Recent years have witnessed the growing research interests in video highlight detection. Existing studies mainly focus on detecting highlights in user-generated videos with simple topics based on visual content. However, relying solely on visual features limits the ability of conventional methods to capture highlights for videos with more complicated semantics, like movies. Therefore, we propose to mine the emotional information in video sounds to enhance highlight detection. Specifically, we design a novel emotion-enhanced framework with multi-stage fusion to detect highlights for complex videos. Along this line, we first extract multi-grained features from the audio waves. Then, the tailored-designed intra-modal fusion is applied on audio features to obtain emotional representation. Furthermore, the cross-modal fusion is developed to generate comprehensive representation of clip by merging audio emotional representations and visual features. This representation can be leveraged for predicting highlight probability. Finally, extensive experiments on real-world datasets demonstrate the effectiveness of our method.
AB - Recent years have witnessed the growing research interests in video highlight detection. Existing studies mainly focus on detecting highlights in user-generated videos with simple topics based on visual content. However, relying solely on visual features limits the ability of conventional methods to capture highlights for videos with more complicated semantics, like movies. Therefore, we propose to mine the emotional information in video sounds to enhance highlight detection. Specifically, we design a novel emotion-enhanced framework with multi-stage fusion to detect highlights for complex videos. Along this line, we first extract multi-grained features from the audio waves. Then, the tailored-designed intra-modal fusion is applied on audio features to obtain emotional representation. Furthermore, the cross-modal fusion is developed to generate comprehensive representation of clip by merging audio emotional representations and visual features. This representation can be leveraged for predicting highlight probability. Finally, extensive experiments on real-world datasets demonstrate the effectiveness of our method.
KW - multimodal fusion
KW - multimodal video analysis
KW - video highlight detection
UR - http://www.scopus.com/inward/record.url?scp=85126440674&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126440674&partnerID=8YFLogxK
U2 - 10.1109/ICME51207.2021.9428252
DO - 10.1109/ICME51207.2021.9428252
M3 - Conference contribution
AN - SCOPUS:85126440674
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - 2021 IEEE International Conference on Multimedia and Expo, ICME 2021
PB - IEEE Computer Society
T2 - 2021 IEEE International Conference on Multimedia and Expo, ICME 2021
Y2 - 5 July 2021 through 9 July 2021
ER -