Abstract
In recent years, the established link between the various human communication production domains has become more widely utilised in the field of speech processing. In this work, we build on previous work by the authors and present a novel two-stage audiovisual speech enhancement system, making use of audio-only beamforming, automatic lip tracking, and pre-processing with visually derived Wiener speech filtering. Initial results have demonstrated that this two-stage multimodal speech enhancement approach can produce positive results with noisy speech mixtures that conventional audio-only beamforming would struggle to cope with, such as in very noisy environments with a very low signal to noise ratio, and when the type of noise is difficult for audio-only beamforming to process.
Similar content being viewed by others
References
Greenberg J. Improved design of microphone-array hearing aids. 1994.
Zelinski R. A microphone array with adaptive post-filtering for noise reduction in reverberant rooms. In: Acoustics, speech, and signal processing, 1988. ICASSP-88., 1988 international conference. p. 2578–2581, 1988.
Hussain A, Cifani S, Squartini S, Piazza F, Durrani TS. A novel psyco-acoustically motivated multi-channel speech enhancement system. In: Verbal and non-verbal communication behaviours. Lecture Notes in Computer Science (LNCS), vol. 4775. Springer-Verlag; 2007. p. 190–199.
Liu Y, Lv J, Xiang Y. Underdetermined blind source separation by parallel factor analysis in time-frequency domain. Cogn Comput. 2012;5(2):207–14.
Gannot S, Burshtein D, Weinstein E. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process. 2001;49(8):1614–26.
Griffiths L, Jim C. An alternative approach to linearly constrained adaptive beamforming. IEEE Trans Antennas Propag. 1982;30(1):27–34.
Wiener N. Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. Cambridge, MA: The MIT Press; 1949.
Li J, Sakamoto S, Hongo S, Akagi M, Suzuki Y. A two-stage binaural speech enhancement approach for hearing aids with preserving binaural benefits in noisy environments. In: Proceedings of forum acousticum 2008, Paris, France; 2008. p. 723–727.
Li J, Sakamoto S, Hongo S, Akagi M, Suzuki Y. Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication. Speech Communication, 2010.
Van den Bogaert T, Doclo S, Wouters J, Moonen M. Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids. J Acoust Soc Am. 2009;125:360–71.
Anderson M, Adali T, Li X. Joint blind source separation with multivariate Gaussian model: algorithms and performance analysis. IEEE Trans Signal Process. 2012;60(4):1672–83.
Rivet B, Girin L, Jutten C. Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans Audio Speech Lang Process. 2007;15(1):96–108.
Rivet B, Chambers J. Multimodal speech separation. In: Sol-Casals J, Zaiats V, editors. Advances in nonlinear speech processing, vol 5933 of Lecture Notes in Computer Science. Berlin/Heidelberg: Springer; 2010. p. 1–11.
Rivet B, Girin L, Jutten C. Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Trans Audio Speech Lang Process. 2007;15(3):796–802.
Sumby W, Pollack I. Visual contribution to speech intelligibility in noise. J Acoust Soc Am. 1954;26(2):212–5.
Erber NP. Auditory-visual perception of speech. J Speech Hear Disord. 1975;40(4):481–92.
Summerfield Q. Use of visual information for phonetic perception. Phonetica. 1979;36(4):314–31.
Berthommier F. A phonetically neutral model of the low-level audio-visual interaction. Speech Commun. 2004;44(1):31–41.
McGurk H, MacDonald J. Hearing lips and seeing voices. Nature. 1976;264:746–8.
Patterson ML, Werker JF. Two-month-old infants match phonetic information in lips and voice. Dev Sci. 2003;6(2):191–6.
Patterson ML, Werker JF. Matching phonetic information in lips and voice is robust in 4.5-month-old infants. Infant Behav Dev. 1999;22(2):237–47.
Benoit C, Guiard-Marigny T, Le Fogg B, Adjoudani A. Which components of the face do humans and machines best speechread? Nato ASI Ser F Comput Syst Sci. 1996;150:315–30.
Vatikiotis-Bateson E, Eigsti I-M, Yano S, Munhall KG. Eye movement of perceivers during audiovisual speech perception. Percept Psychophys. 1998;60(6):926–40.
Ghazanfar AA, Nielson K, Logothetis NK. Eye movements of monkey observers viewing vocalizing conspecifics. Cognition. 2006;101(3):515–29.
Schwartz J-L, Berthommier F. Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition. 2004;93(2):B69–78.
Grant KW, Seitz P. The use of visible speech cues for improving auditory detection of spoken sentences. J Acoust Soc Am. 2000;108(3):1197–208.
Bernstein LE, Takayanagi S, Auer E Jr. Enhanced auditory detection with AV speech: perceptual evidence for speech and non-speech mechanisms. AVSP. 2004;2003:2003.
Grant KW. The effect of speechreading on masked detection thresholds for filtered speech. J Acoust Soc Am. 2001;109(5):2272–5.
Kim J, Davis C. Hearing foreign voices: does knowing what is said affect visual-masked-speech detection? Perception. 2003;32(1):111–20.
Petajan ED. Automatic lipreading to enhance speech recognition (speech reading). Doctoral Thesis, University of Illinois at Urbana-Champaign, 1984.
Helfer KS, Freyman R. The role of visual speech cues in reducing energetic and informational masking. J Acoust Soc Am. 2005;117(2):842–9.
Wightman F, Kistler D, Brungart D. Informational masking of speech in children: auditory-visual integration. J Acoust Soc Am. 2006;119(6):3940–9.
Sodoyer D, Rivet B, Girin L, Savariaux C, Schwartz J-L. A study of lip movements during spontaneous dialog and its application to voice activity detection. J Acoust Soc Am. 2009;125(2):1184–96.
Yehia H, Rubin P, Vatikiotis-Bateson E. Quantitative association of vocal-tract and facial behavior. Speech Commun. 1998;26(1–2):23–43.
Barker J, Berthommier F. Estimation of speech acoustics from visual speech features: a comparison of linear and non-linear models. In: AVSP’99-international conference on auditory-visual speech processing, 1999.
Almajai I, Milner B. Maximising audio-visual speech correlation. In: Proceedings of AVSP, 2007.
Cifani S, Abel A, Hussain A, Squartini S, Piazza F. An investigation into audiovisual speech correlation in reverberant noisy environments. In: Esposito A, Vích R, editors. Cross-modal analysis of speech, gestures, gaze and facial expressions. Berlin, Heidelberg: Springer; 2009. p. 331–343.
Abel A, Hussain A, Nguyen Q, Ringeval F, Chetouani M, Milgram M. Maximising audiovisual correlation with automatic lip tracking and vowel based segmentation. In: Fierrez J, Ortega-Garcia J, Esposito A, Drygajlo A, Faundez-Zanuy M, editors. Biometric ID management and multimodal communication. Berlin, Heidelberg: Springer; 2009. p. 65–72.
Girin L, Schwartz J, Feng G. Audio-visual enhancement of speech in noise. J Acoust Soc Am. 2001;109:3007.
Goecke R, Potamianos G, Neti C. Noisy audio feature enhancement using audio-visual speech data. In: Acoustics, speech, and signal processing, 2002. Proceedings.(ICASSP’02). IEEE international conference on, vol. 2, p. 2025–2028, IEEE, 2002.
Potamianos G, Neti C, Deligne S. Joint audio-visual speech processing for recognition and enhancement. In: AVSP 2003-international conference on auditory-visual speech processing, p. 95–104, 2003.
Acero A, Stern R. Environmental robustness in automatic speech recognition. In: Acoustics, speech, and signal processing, 1990. ICASSP-90., 1990 international conference, p. 849–852, IEEE, 2002.
Deng L, Acero A, Jiang L, Droppo J, Huang X. High-performance robust speech recognition using stereo training data. In: Acoustics, speech, and signal processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE international conference on, vol. 1, p. 301–304, IEEE, 2002.
Almajai I, Milner B. Enhancing audio speech using visual speech features. In: Proceedings of Interspeech, Brighton, UK, 2009.
Almajai I, Milner B, Darch J, Vaseghi S. Visually-derived Wiener filters for speech enhancement. In: IEEE international conference on acoustics, speech and signal processing, 2007. ICASSP 2007, vol. 4, p. 585–588, 2007.
Solé-Casals J, Zaiats V. A non-linear VAD for noisy environments. Cogn Comput. 2010;2(3):191–8.
Nguyen Q, Milgram M. Semi adaptive appearance models for lip tracking. In: ICIP09, p. 2437–2440, 2009.
Calinon S, Guenter F, Billard A. On learning, representing, and generalizing a task in a humanoid robot. IEEE Trans Syst Man Cybern B. 2007;37(2):286–98.
Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am. 2006;120(5 Pt 1):2421–4.
Levey A, Lindenbaum M. Sequential Karhunen-Loeve basis extraction and its application to images. IEEE Trans Image Process. 2000;9(8):1371–4.
Golub G, Van Loan C. Matrix computations. Baltimore, MD: Johns Hopkins University Press; 1996.
Cauwenberghs G, Poggio T. Incremental and decremental support vector machine learning. In: Advances in neural information processing systems 13: proceedings of the 2000 conference, p. 409–415, The MIT Press, 2001.
Hiller A, Chin R. Iterative Wiener filters for image restoration. In: Acoustics, speech, and signal processing, 1990. ICASSP-90. 1990 international conference on, p. 1901–1904, 1990.
Sargin M, Yemez Y, Erzin E, Tekalp A. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans Multimedia. 2007;9(7):1396–403.
Fritsch F, Carlson R. Monotone piecewise cubic interpolation. SIAM J Numer Anal. 1980;17(2):238–46.
Loizou P. Speech enhancement: theory and practice (signal processing and communication. Boca Raton, FL: CRC; 2007.
Hu Y, Loizou P. Evaluation of objective measures for speech enhancement. Proc Interspeech. 2006;2006:1447–50.
Hu Y, Loizou P. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process. 2008;16(1):229–38.
Rix AW, Beerends JG, Hollier MP, Hekstra AP. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Acoustics, speech, and signal processing, IEEE international conference on (ICASSP’01), vol. 2, p. 749–752, 2001.
Klatt D. Prediction of perceived phonetic distance from critical-band spectra: a first step. In: Acoustics, speech, and signal processing, IEEE international conference on (ICASSP’82), vol 7, p. 1278–1281, 1982.
Lu Y, Loizou P. A geometric approach to spectral subtraction. Speech Commun. 2008;50(6):453–66.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Abel, A., Hussain, A. Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments. Cogn Comput 6, 200–217 (2014). https://doi.org/10.1007/s12559-013-9231-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-013-9231-2