Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering

doi:10.1109/TMM.2023.3272224

CORC > 自动化研究所 > 中国科学院自动化研究所 > 模式识别国家重点实验室 > 多媒体计算与图形学团队

	Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering
	Song, Yaguang 1,2,3; Yang, Xiaoshan 1,2,3; Wang, Yaowei 1,2,3; Xu, Changsheng 1
刊名	IEEE Transactions on Multimedia
	2023-05-05
页码	1-15
关键词	Multi-modal Foundation Model Out-of-Distribution Generalization Visual Question Answering Knowledge Distillation
DOI	10.1109/TMM.2023.3272224
文献子类	期刊论文
英文摘要	With the emergence of large-scale multi-modal foundation models, significant improvements have been made towards Visual Question Answering (VQA) in recent years via the “Pre-training and Fine-tuning” paradigm. However, the fine-tuned VQA model, which is more specialized for the downstream training data, may fail to generalize well when there is a distribution shift between the training and test data, which is defined as the Out-of-Distribution (OOD) problem. An intuitive way to solve this problem is to transfer the common knowledge from the foundation model to the fine-tuned VQA model via knowledge distillation for better generalization. However, the generality of distilled knowledge based on the task-specific training data is questionable due to the bias between the training and test data. An ideal way is to adopt the pre-training data to distill the common knowledge shared by the training and OOD test samples, which however is impracticable due to the huge size of pre-training data. Based on the above considerations, in this paper, we propose a method, named Pre-training-like Knowledge Distillation (PKD), to imitate the pre-training feature distribution and leverage it to distill the common knowledge, which can improve the generalization performance of the fine-tuned model for OOD VQA. Specifically, we first leverage the in-domain VQA data as guidance and adopt two cross-modal feature prediction networks, which are learned under the supervision of image-text matching loss and feature divergence loss, to estimate pre-training-like vision and text features. Next, we conduct feature-level distillation by explicitly integrating the downstream VQA input features with the predicted pre-training-like features through a memory mechanism. In the meantime, we also conduct model-level distillation by constraining the image-text matching output of the downstream VQA model and the output of the foundation model for the pre-training-like image and text features. Extensive experiments on the VQA-CP v2 and VQA v2 datasets demonstrate the effectiveness of our method.
URL标识	查看原文
语种	英语
内容类型	期刊论文
源URL	[http://ir.ia.ac.cn/handle/173211/51954]
专题	自动化研究所_模式识别国家重点实验室_多媒体计算与图形学团队
通讯作者	Xu, Changsheng
作者单位	1.Peng Cheng Laboratory 2.School of Artificial Intelligence, University of Chinese Academy of Sciences 3.State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences
推荐引用方式 GB/T 7714	Song, Yaguang,Yang, Xiaoshan,Wang, Yaowei,et al. Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering[J]. IEEE Transactions on Multimedia,2023:1-15.
APA	Song, Yaguang,Yang, Xiaoshan,Wang, Yaowei,&Xu, Changsheng.(2023).Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering.IEEE Transactions on Multimedia,1-15.
MLA	Song, Yaguang,et al."Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering".IEEE Transactions on Multimedia (2023):1-15.