Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
Song, Zengjie1; Zhang, Zhaoxiang2,3
刊名IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
2023-07-12
页码15
关键词Feature fusion multimodal learning predictive coding (PC) self-supervised learning sound source separation
ISSN号2162-237X
DOI10.1109/TNNLS.2023.3288022
通讯作者Zhang, Zhaoxiang(zhaoxiang.zhang@ia.ac.cn)
英文摘要The framework of visually guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such a divide-and-conquer paradigm is parameter-inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this article presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter-efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding (PC)-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via copredicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding.
资助项目Major Project for New Generation of AI[2018AAA0100400] ; National Natural Science Foundation of China[61836014] ; National Natural Science Foundation of China[U21B2042] ; National Natural Science Foundation of China[62072457] ; National Natural Science Foundation of China[62006231] ; National Natural Science Foundation of China[61976174] ; China Postdoctoral Science Foundation[2021M703489]
WOS关键词INTEGRATION
WOS研究方向Computer Science ; Engineering
语种英语
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
WOS记录号WOS:001030674000001
资助机构Major Project for New Generation of AI ; National Natural Science Foundation of China ; China Postdoctoral Science Foundation
内容类型期刊论文
源URL[http://ir.ia.ac.cn/handle/173211/53783]  
专题多模态人工智能系统全国重点实验室
通讯作者Zhang, Zhaoxiang
作者单位1.Xi An Jiao Tong Univ, Sch Math & Stat, Xian 710049, Peoples R China
2.Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
3.Chinese Acad Sci, Hong Kong Inst Sci & Innovat, Ctr Artificial Intelligence & Robot, Hong Kong, Peoples R China
推荐引用方式
GB/T 7714
Song, Zengjie,Zhang, Zhaoxiang. Visually Guided Sound Source Separation With Audio-Visual Predictive Coding[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,2023:15.
APA Song, Zengjie,&Zhang, Zhaoxiang.(2023).Visually Guided Sound Source Separation With Audio-Visual Predictive Coding.IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,15.
MLA Song, Zengjie,et al."Visually Guided Sound Source Separation With Audio-Visual Predictive Coding".IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2023):15.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace