Video captioning based on vision transformer and reinforcement learning

doi:10.7717/peerj-cs.916

CORC > 兰州理工大学 > 兰州理工大学

	Video captioning based on vision transformer and reinforcement learning
	Zhao, Hong 1; Chen, Zhiwen 1; Guo, Lan 1; Han, Zeyu 2
刊名	PEERJ COMPUTER SCIENCE
	2022-03-16
卷号	8
关键词	Video captioning Vision transformer Reinforcement learning Long short-term memory network Computer vision Natural language processing Attention mechanism Encode-decode Deep learning
DOI	10.7717/peerj-cs.916
英文摘要	Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.
WOS研究方向	Computer Science
语种	英语
出版者	PEERJ INC
WOS记录号	WOS:000773302200003
内容类型	期刊论文
源URL	[http://ir.lut.edu.cn/handle/2XXMBERH/158092]
专题	兰州理工大学
作者单位	1.Lanzhou Univ Technol, Sch Comp & Commun, Lanzhou, Gansu, Peoples R China; 2.Lanzhou Univ Technol, Network & Informat Ctr, Lanzhou, Gansu, Peoples R China
推荐引用方式 GB/T 7714	Zhao, Hong,Chen, Zhiwen,Guo, Lan,et al. Video captioning based on vision transformer and reinforcement learning[J]. PEERJ COMPUTER SCIENCE,2022,8.
APA	Zhao, Hong,Chen, Zhiwen,Guo, Lan,&Han, Zeyu.(2022).Video captioning based on vision transformer and reinforcement learning.PEERJ COMPUTER SCIENCE,8.
MLA	Zhao, Hong,et al."Video captioning based on vision transformer and reinforcement learning".PEERJ COMPUTER SCIENCE 8(2022).