VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis

doi:10.1016/j.knosys.2023.111136

CORC > 自动化研究所 > 中国科学院自动化研究所 > 模式识别国家重点实验室 > 智能交互

	VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis
	Yi, Guofeng 1; Fan, Cunhang 1; Zhu, Kang 1; Lv, Zhao 1; Liang, Shan 5; Wen, Zhengqi 4; Pei, Guanxiong 3; Li, Taihao 3; Tao, Jianhua 2
刊名	KNOWLEDGE-BASED SYSTEMS
	2024-01-11
卷号	283 页码:9
关键词	Multimodal sentiment analysis Vision-language Multimodal fusion
ISSN号	0950-7051
DOI	10.1016/j.knosys.2023.111136
通讯作者	Fan, Cunhang(cunhang.fan@ahu.edu.cn)
英文摘要	Large-scale vision-and-language representation learning has improved performance on various joint vision language downstream tasks. In this work, our objective is to extend it effectively to multimodal sentiment analysis tasks and address two urgent challenges in this field: (1) the low contribution of the visual modality (2) the design of an effective multimodal fusion architecture. To overcome the imbalance between the visual and textual modalities, we propose an inter-frame hybrid transformer, which extends the recent CLIP and Timesformer architectures. This module extracts spatiotemporal features from sparsely sampled video frames, not only focusing on facial expressions but also capturing body movement information, providing a more comprehensive visual representation compared to the traditional direct use of pre-extracted facial information. Additionally, we tackle the challenge of modality heterogeneity in the fusion architecture by introducing a new scheme that prompts and aligns the video and text information before fusing them. Specifically, We generate discriminative text prompts based on the video content information to enhance the text representation and align the unimodal video-text features using a video-text contrastive loss. Our proposed end-to-end trainable model demonstrates state-of-the-art performance on three widely-used datasets using only two modalities: MOSI, MOSEI, and CH-SIMS. These experimental results validate the effectiveness of our approach in improving multimodal sentiment analysis tasks.
资助项目	STI 2030-Major Projects[2021ZD0201500] ; National Natural Science Foundation of China (NSFC)[62201002] ; National Natural Science Foundation of China (NSFC)[61972437] ; Excellent Youth Foundation of Anhui Scientific Committee[2208085J05] ; Special Fund for Key Program of Science and Technology of Anhui Province[202203a07020008] ; Open Research Projects of Zhejiang Lab[2021KH0 AB06] ; Open Projects Program of National Laboratory of Pattern Recognition[202200014]
WOS研究方向	Computer Science
语种	英语
出版者	ELSEVIER
WOS记录号	WOS:001108284900001
资助机构	STI 2030-Major Projects ; National Natural Science Foundation of China (NSFC) ; Excellent Youth Foundation of Anhui Scientific Committee ; Special Fund for Key Program of Science and Technology of Anhui Province ; Open Research Projects of Zhejiang Lab ; Open Projects Program of National Laboratory of Pattern Recognition
内容类型	期刊论文
源URL	[http://ir.ia.ac.cn/handle/173211/55122]
专题	模式识别国家重点实验室_智能交互
通讯作者	Fan, Cunhang
作者单位	1.Anhui Univ, Sch Comp Sci & Technol, Anhui Prov Key Lab Multimodal Cognit Comp, Hefei, Peoples R China 2.Tsinghua Univ, Dept Automat, Beijing, Peoples R China 3.Zhejiang Lab, Inst Artificial Intelligence, Hangzhou, Peoples R China 4.Qiyuan Lab, Beijing, Peoples R China 5.Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
推荐引用方式 GB/T 7714	Yi, Guofeng,Fan, Cunhang,Zhu, Kang,et al. VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis[J]. KNOWLEDGE-BASED SYSTEMS,2024,283:9.
APA	Yi, Guofeng.,Fan, Cunhang.,Zhu, Kang.,Lv, Zhao.,Liang, Shan.,...&Tao, Jianhua.(2024).VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis.KNOWLEDGE-BASED SYSTEMS,283,9.
MLA	Yi, Guofeng,et al."VLP2MSA: Expanding vision-language pre-training to multimodal sentiment analysis".KNOWLEDGE-BASED SYSTEMS 283(2024):9.