Blind Audio-Visual Localization Dataset



Blind Audio-Visual Localization (BAVL) Dataset consists of 20 audio-visual recordings of  sound sources, which could be talking faces or music instruments. Most audio-visual recordings (19) are videos from Youtube except V8, which is from [1]. Besides, the video V7 was also used in[2][3], and V16 used in [3]. All 20 videos are annotated by ourselves in a uniform manner. Details of the video sequences are listed in Table 1. 

The videos in the dataset have average duration of 10 seconds, and they are all recorded by one camera and one microphone. The audio files (.wav) was sampled at a 16 kHz for V7, V8, V16, and 44.1 kHz for the rest. The video frames contain the sound-making object (sound source) and distracting objects (e.g. pedestrian on the street), while the audio signals  consists of the sound produced by the sound source (human speech or instrumental music), environmental noise and sometimes other sounds. The distracting objects and other irrelevant noise/sounds do not exist in all videos. The primary usage of the dataset is to evaluate the performance of sound source localization method, in the presence of distracting motions and noise.


Table 1. Main Specifications and Contents of the Video Sequences.


We provide visual annotations as illustrated in Figure 1. The locations of sound-making objects are annotated in the white region. The images of annotation have binary values, where 1 indicates the existence of sound source. 

Figure 1. An Example of Annotation for V8.


Data Download

Content: 20 videos of the dataset, in the form of image frames and wav audio files, plus annotations.
Format: zip archive containing jpg and wav files.
Size: 395 MB
Download Link:


If you use this, please cite:

title={Audio-visual object localization and separation using low-rank and sparsity},
author={Pu, Jie and Panagakis, Yannis and Petridis, Stavros and Pantic, Maja},
booktitle={Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on},



This work has been funded by the European Community Horizon 2020 under grant agreement no. 645094 (SEWA) and no. 688835 (DE-ENIGMA).



[1] Kidron, Einat, Yoav Y. Schechner, and Michael Elad. "Pixels that sound."Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE, 2005.

[2] Izadinia, Hamid, Imran Saleemi, and Mubarak Shah. "Multimodal analysis for identification and segmentation of moving-sounding objects."IEEE Transactions on Multimedia 15.2 (2013): 378-390.

[3] Li, Kai, Jun Ye, and Kien A. Hua. "What's making that sound?."Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014.


For any questions, please contact Jie Pu (