Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering
Song, Yaguang1,2,3; Yang, Xiaoshan1,2,3; Wang, Yaowei1,2,3; Xu, Changsheng1
刊名IEEE Transactions on Multimedia
2023-05-05
页码1-15
关键词Multi-modal Foundation Model Out-of-Distribution Generalization Visual Question Answering Knowledge Distillation
DOI10.1109/TMM.2023.3272224
文献子类期刊论文
英文摘要

With the emergence of large-scale multi-modal foundation models, significant improvements have been made towards Visual Question Answering (VQA) in recent years via the “Pre-training and Fine-tuning” paradigm. However, the fine-tuned VQA model, which is more specialized for the downstream training data, may fail to generalize well when there is a distribution shift between the training and test data, which is defined as the Out-of-Distribution (OOD) problem. An intuitive way to solve this problem is to transfer the common knowledge from the foundation model to the fine-tuned VQA model via knowledge distillation for better generalization. However, the generality of distilled knowledge based on the task-specific training data is questionable due to the bias between the training and test data. An ideal way is to adopt the pre-training data to distill the common knowledge shared by the training and OOD test samples, which however is impracticable due to the huge size of pre-training data. Based on the above considerations, in this paper, we propose a method, named Pre-training-like Knowledge Distillation (PKD), to imitate the pre-training feature distribution and leverage it to distill the common knowledge, which can improve the generalization performance of the fine-tuned model for OOD VQA. Specifically, we first leverage the in-domain VQA data as guidance and adopt two cross-modal feature prediction networks, which are learned under the supervision of image-text matching loss and feature divergence loss, to estimate pre-training-like vision and text features. Next, we conduct feature-level distillation by explicitly integrating the downstream VQA input features with the predicted pre-training-like features through a memory mechanism. In the meantime, we also conduct model-level distillation by constraining the image-text matching output of the downstream VQA model and the output of the foundation model for the pre-training-like image and text features. Extensive experiments on the VQA-CP v2 and VQA v2 datasets demonstrate the effectiveness of our method.

URL标识查看原文
语种英语
内容类型期刊论文
源URL[http://ir.ia.ac.cn/handle/173211/51954]  
专题自动化研究所_模式识别国家重点实验室_多媒体计算与图形学团队
通讯作者Xu, Changsheng
作者单位1.Peng Cheng Laboratory
2.School of Artificial Intelligence, University of Chinese Academy of Sciences
3.State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences
推荐引用方式
GB/T 7714
Song, Yaguang,Yang, Xiaoshan,Wang, Yaowei,et al. Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering[J]. IEEE Transactions on Multimedia,2023:1-15.
APA Song, Yaguang,Yang, Xiaoshan,Wang, Yaowei,&Xu, Changsheng.(2023).Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering.IEEE Transactions on Multimedia,1-15.
MLA Song, Yaguang,et al."Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering".IEEE Transactions on Multimedia (2023):1-15.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace