Noise-robust voice activity detection (rVAD) - source code, reference VAD for Aurora 2 语音端点检测 源码



A two-pass segment-based unsupervised method for voice activity detection (VAD), or speech activity detection (SAD), is presented here. In the first pass, high-energy segments are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment. In the second pass, noise reduction is applied to the speech. Afterwards, the a posteriori SNR weighted energy difference is applied to the denoised speech for voice activity detection (VAD). Denoised speech is generated as a byproduct. This is an updated version of the VAD in paper [1].

The VAD method has been applied as a preprocessor for speech recognition, speaker identification [2], language identification, age and gender identification [4], human-robot interaction (for social robots) [5], audio archive segmentation, and so on. The method performs well on NIST OpenSAD Challenge [7].


Source code:

Source code in Matlab is available as a zip archive. It is straightforward to use: Simply call the function vad.m.

Some Matlab functions and their modified versions from the publicly available VoiceBox are included with kind permission of Mike Brookes. 

Reference VAD for Aurora 2 database:
The frame-by-frame reference VAD was generated from forced-alignment speech recognition experiments, and has been used as a 'ground truth' for evaluating VAD algorithms. Whole word models were trained on clean speech data for all digits, and used for performing forced-alignment for the 4004 utterances (clean speech) from which all utterances in Test Set A, B, and C are derived from by adding noise. The forced-alignment results, in which '0' and '1' stand for non-speech and speech frames, respectively, are used to set the time boundaries for speech segments to create a frame-based reference VAD. For more details, refer to paper [1]. The generated reference VAD for the test set is available as a zip archive. The forced-alignment generated reference VAD for the training set of 8440 clean utterances is also available as a zip archive.

Other archives: the frame-by-frame results (i.e., VAD outputs) of the advanced front end VAD for the test set A, B and C as a bz archive, the results of the variable-frame-rate VAD (shown as 'Proposed' in Table VI of paper [1])  for the test set A, B and C as a bz archive. Forced alignment labels with timestampts for the training set is available as a text archive and for the test set A as a text archive [1].

We have done a systematic comparison of forced-alignment speech recognition and humans for generating reference VAD in [6].


[1] Z.-H. Tan and B. Lindberg, "Low-complexity variable frame rate analysis for speech recognition and voice activity detection." IEEE Journal of Selected Topics in Signal Processing, vol. 4, no. 5, pp. 798-807, 2010. (Google Scholar)

Our relvated work:

[2] O. Plchot, S. Matsoukas, P. Matejka, N. Dehak, J. Ma, S. Cumani, O. Glembek, H. Hermansky, S.H. Mallidi, N. Mesgarani, R. Schwartz, M. Soufifar, Z.-H. Tan, S. Thomas, B. Zhang and X. Zhou, “Developing a Speaker Identification System for the DARPA RATS project,” ICASSP 2013, Vancouver, Canada, May 26 - 31, 2013. (Google Scholar)

[3] T. Petsatodis, C. Boukis, F. Talantzis, Z.-H. Tan and R. Prasad, “Convex Combination of Multiple Statistical Models with Application to VAD,” IEEE Transactions on Audio, Speech and Language Processing,  vol. 19, no. 8, pp. 2314 - 2327, November 2011. (Google Scholar)

[4] S.E. Shepstone, Z-H. Tan, and S.H. Jensen. "Audio-based age and gender identification to enhance the recommendation of TV content." Consumer Electronics, IEEE Transactions on 59.3 (2013): 721-729.

[5] N.B. Thomsen, Z.-H. Tan, B. Lindberg and S.H. Jensen, “Improving Robustness against Environmental Sounds for Directing Attention of Social Robots,” The 2nd Workshop on Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, September 14, 2014, Singapore.

[6] I. Kraljevski, Z.-H. Tan and M. P. Bissiri, “Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD,” Interspeech 2015, Dresden, Germany, September 6-10, 2015.

[7] T. Kinnunen, A. Sholokhov, E. Khoury, D. Thomsen, Md Sahidullah and Z.-H. Tan, "HAPPY Team Entry to NIST OpenSAD Challenge: A Fusion of Short-Term Unsupervised and Segment i-Vector Based Speech Activity Detectors," Interspeech 2016, San Francisco, USA, 8 - 12 September 2016. PDF

[8] Zheng-Hua Tan, Nicolai Bĺk Thomsen, Xiaodong Duan, Evgenios Vlachos, Sven Ewan Shepstone, Morten H. Rasmussen and Jesper Lisby HŅjvang, "iSocioBot - A Multimodal Interactive Social Robot," accepted by International Journal of Social Robotics. (Springer). PDF from Springer Nature Sharing.


Zheng-Hua Tan

Department of Electronic Systems, Aalborg University, Denmark