Noise-robust voice activity detection (rVAD) - source code, reference VAD for Aurora 2 语音端点检测 源码
A two-pass segment-based unsupervised method for voice activity detection (VAD), or speech activity detection (SAD), is presented here. In the first pass, high-energy segments are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment. In the second pass, noise reduction is applied to the speech. Afterwards, the a posteriori SNR weighted energy difference is applied to the denoised speech for voice activity detection (VAD). Denoised speech is generated as a byproduct. This is an updated version of the VAD in paper .
VAD method has been applied as a preprocessor for
speech recognition, speaker identification , language
age and gender identification , human-robot interaction (for social
robots) , audio archive segmentation, and so on. The method performs
well on NIST OpenSAD Challenge .
Source code in Matlab is available as a zip archive. It is straightforward to use: Simply call the function vad.m.
Some Matlab functions
and their modified versions from the publicly available VoiceBox
are included with kind permission of Mike Brookes.
Reference VAD for Aurora 2 database:
The frame-by-frame reference VAD was generated from forced-alignment speech recognition experiments, and has been used as a 'ground truth' for evaluating VAD algorithms. Whole word models were trained on clean speech data for all digits, and used for performing forced-alignment for the 4004 utterances (clean speech) from which all utterances in Test Set A, B, and C are derived from by adding noise. The forced-alignment results, in which '0' and '1' stand for non-speech and speech frames, respectively, are used to set the time boundaries for speech segments to create a frame-based reference VAD. For more details, refer to paper . The generated reference VAD for the test set is available as a zip archive. The forced-alignment generated reference VAD for the training set of 8440 clean utterances is also available as a zip archive.
archives: the frame-by-frame results
(i.e., VAD outputs) of the advanced front end VAD for the test set A, B and C as a bz archive, the results
of the variable-frame-rate VAD (shown as 'Proposed' in Table VI of
paper ) for the test set A, B and C as a bz archive. Forced alignment labels with timestampts
for the training set is available as a text archive and for the test set A as a text archive .
have done a systematic comparison of forced-alignment speech
recognition and humans for generating reference VAD in .
Z.-H. Tan and B. Lindberg, "Low-complexity variable frame rate analysis
for speech recognition and voice activity detection." IEEE Journal of Selected Topics in
Signal Processing, vol. 4, no. 5, pp. 798-807, 2010. (Google
Our relvated work:
O. Plchot, S. Matsoukas, P. Matejka, N. Dehak, J. Ma, S. Cumani, O.
Glembek, H. Hermansky, S.H. Mallidi, N. Mesgarani, R. Schwartz, M.
Soufifar, Z.-H. Tan, S. Thomas, B. Zhang and X. Zhou, “Developing a
Speaker Identification System for the DARPA RATS project,” ICASSP 2013,
Vancouver, Canada, May 26 - 31, 2013. (Google
 T. Petsatodis, C. Boukis, F. Talantzis, Z.-H. Tan and R. Prasad, “Convex Combination of Multiple Statistical Models with Application to VAD,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 8, pp. 2314 - 2327, November 2011. (Google Scholar)
 S.E. Shepstone, Z-H. Tan, and S.H. Jensen. "Audio-based age and gender identification to enhance the recommendation of TV content." Consumer Electronics, IEEE Transactions on 59.3 (2013): 721-729.
N.B. Thomsen, Z.-H. Tan, B. Lindberg and S.H. Jensen, “Improving
Robustness against Environmental Sounds for Directing Attention of
Social Robots,” The 2nd Workshop on Multimodal Analyses Enabling
Artificial Agents in Human-Machine Interaction, September 14, 2014,
 I. Kraljevski, Z.-H. Tan and M. P. Bissiri, “Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD,” Interspeech 2015, Dresden, Germany, September 6-10, 2015.
T. Kinnunen, A. Sholokhov, E.
Khoury, D. Thomsen, Md Sahidullah and Z.-H. Tan, "HAPPY Team Entry to
NIST OpenSAD Challenge: A Fusion of Short-Term Unsupervised and Segment
i-Vector Based Speech Activity Detectors," Interspeech 2016, San
Francisco, USA, 8 - 12 September 2016. PDF
Zheng-Hua Tan, Nicolai Bĺk Thomsen, Xiaodong Duan,
Evgenios Vlachos, Sven Ewan Shepstone, Morten H. Rasmussen and Jesper
Lisby HŅjvang, "iSocioBot - A Multimodal Interactive Social Robot,"
accepted by International Journal of Social Robotics. (Springer). PDF
from Springer Nature Sharing.
Department of Electronic Systems, Aalborg University, Denmark