UM  > Faculty of Science and Technology
Affiliated with RCfalse
A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence
Xiongkuo Min1; Guangtao Zhai1; Jiantao Zhou2; Xiao-Ping Zhang3; Xiaokang Yang1; Xinping Guan4
Source PublicationIEEE Transactions on Image Processing

Audio information has been bypassed by most of current visual attention prediction studies. However, sound could have influence on visual attention and such influence has been widely investigated and proofed by many psychological studies. In this paper, we propose a novel multi-modal saliency (MMS) model for videos containing scenes with high audio-visual correspondence. In such scenes, humans tend to be attracted by the sound sources and it is also possible to localize the sound sources via cross-modal analysis. Specifically, we first detect the spatial and temporal saliency maps from the visual modality by using a novel free energy principle. Then we propose to detect the audio saliency map from both audio and visual modalities by localizing the moving-sounding objects using cross-modal kernel canonical correlation analysis, which is first of its kind in the literature. Finally we propose a new two-stage adaptive audiovisual saliency fusion method to integrate the spatial, temporal and audio saliency maps to our audio-visual saliency map. The proposed MMS model has captured the influence of audio, which is not considered in the latest deep learning based saliency models. To take advantages of both deep saliency modeling and audio-visual saliency modeling, we propose to combine deep saliency models and the MMS model via a later fusion, and we find that an average of 5% performance gain is obtained. Experimental results on audio-visual attention databases show that the introduced models incorporating audio cues have significant superiority over state-of-the-art image and video saliency models which utilize a single visual modality.

KeywordAttention Fusion Audio-visual Attention Multimodal Saliency Visual Attention
URLView the original
Indexed BySCIE
WOS Research AreaComputer Science ; Engineering
WOS SubjectComputer Science, Artificial Intelligence ; Engineering, Electrical & Electronic
WOS IDWOS:000510750900069
Scopus ID2-s2.0-85079659182
Fulltext Access
Citation statistics
Cited Times [WOS]:45   [WOS Record]     [Related Records in WOS]
Document TypeJournal article
CollectionFaculty of Science and Technology
Corresponding AuthorGuangtao Zhai
Affiliation1.nstitute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
2.Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau 999078, China
3.Department of Electrical and Computer Engineering, Ryerson University, Toronto, ON M5B 2K3, Canada
4.Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China
Recommended Citation
GB/T 7714
Xiongkuo Min,Guangtao Zhai,Jiantao Zhou,et al. A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence[J]. IEEE Transactions on Image Processing,2020,29:3805-3819.
APA Xiongkuo Min,Guangtao Zhai,Jiantao Zhou,Xiao-Ping Zhang,Xiaokang Yang,&Xinping Guan.(2020).A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence.IEEE Transactions on Image Processing,29,3805-3819.
MLA Xiongkuo Min,et al."A Multimodal Saliency Model for Videos with High Audio-Visual Correspondence".IEEE Transactions on Image Processing 29(2020):3805-3819.
Files in This Item: Download All
File Name/Size Publications Version Access License
A_Multimodal_Salienc(5107KB)期刊论文作者接受稿开放获取CC BY-NC-SAView Download
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Xiongkuo Min]'s Articles
[Guangtao Zhai]'s Articles
[Jiantao Zhou]'s Articles
Baidu academic
Similar articles in Baidu academic
[Xiongkuo Min]'s Articles
[Guangtao Zhai]'s Articles
[Jiantao Zhou]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Xiongkuo Min]'s Articles
[Guangtao Zhai]'s Articles
[Jiantao Zhou]'s Articles
Terms of Use
No data!
Social Bookmark/Share
File name: A_Multimodal_Saliency_Model_for_Videos_With_High_Audio-Visual_Correspondence.pdf
Format: Adobe PDF
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.