Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments

Abel, Andrew; Hussain, Amir

doi:10.1007/s12559-013-9231-2

Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments

Published: 20 October 2013

Volume 6, pages 200–217, (2014)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Andrew Abel¹ &
Amir Hussain¹

447 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

In recent years, the established link between the various human communication production domains has become more widely utilised in the field of speech processing. In this work, we build on previous work by the authors and present a novel two-stage audiovisual speech enhancement system, making use of audio-only beamforming, automatic lip tracking, and pre-processing with visually derived Wiener speech filtering. Initial results have demonstrated that this two-stage multimodal speech enhancement approach can produce positive results with noisy speech mixtures that conventional audio-only beamforming would struggle to cope with, such as in very noisy environments with a very low signal to noise ratio, and when the type of noise is difficult for audio-only beamforming to process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Greenberg J. Improved design of microphone-array hearing aids. 1994.
Zelinski R. A microphone array with adaptive post-filtering for noise reduction in reverberant rooms. In: Acoustics, speech, and signal processing, 1988. ICASSP-88., 1988 international conference. p. 2578–2581, 1988.
Hussain A, Cifani S, Squartini S, Piazza F, Durrani TS. A novel psyco-acoustically motivated multi-channel speech enhancement system. In: Verbal and non-verbal communication behaviours. Lecture Notes in Computer Science (LNCS), vol. 4775. Springer-Verlag; 2007. p. 190–199.
Liu Y, Lv J, Xiang Y. Underdetermined blind source separation by parallel factor analysis in time-frequency domain. Cogn Comput. 2012;5(2):207–14.
Google Scholar
Gannot S, Burshtein D, Weinstein E. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process. 2001;49(8):1614–26.
Article Google Scholar
Griffiths L, Jim C. An alternative approach to linearly constrained adaptive beamforming. IEEE Trans Antennas Propag. 1982;30(1):27–34.
Article Google Scholar
Wiener N. Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. Cambridge, MA: The MIT Press; 1949.
Google Scholar
Li J, Sakamoto S, Hongo S, Akagi M, Suzuki Y. A two-stage binaural speech enhancement approach for hearing aids with preserving binaural benefits in noisy environments. In: Proceedings of forum acousticum 2008, Paris, France; 2008. p. 723–727.
Li J, Sakamoto S, Hongo S, Akagi M, Suzuki Y. Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication. Speech Communication, 2010.
Van den Bogaert T, Doclo S, Wouters J, Moonen M. Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids. J Acoust Soc Am. 2009;125:360–71.
Article PubMed Google Scholar
Anderson M, Adali T, Li X. Joint blind source separation with multivariate Gaussian model: algorithms and performance analysis. IEEE Trans Signal Process. 2012;60(4):1672–83.
Article Google Scholar
Rivet B, Girin L, Jutten C. Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans Audio Speech Lang Process. 2007;15(1):96–108.
Article Google Scholar
Rivet B, Chambers J. Multimodal speech separation. In: Sol-Casals J, Zaiats V, editors. Advances in nonlinear speech processing, vol 5933 of Lecture Notes in Computer Science. Berlin/Heidelberg: Springer; 2010. p. 1–11.
Rivet B, Girin L, Jutten C. Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Trans Audio Speech Lang Process. 2007;15(3):796–802.
Article Google Scholar
Sumby W, Pollack I. Visual contribution to speech intelligibility in noise. J Acoust Soc Am. 1954;26(2):212–5.
Article Google Scholar
Erber NP. Auditory-visual perception of speech. J Speech Hear Disord. 1975;40(4):481–92.
CAS PubMed Google Scholar
Summerfield Q. Use of visual information for phonetic perception. Phonetica. 1979;36(4):314–31.
Article CAS PubMed Google Scholar
Berthommier F. A phonetically neutral model of the low-level audio-visual interaction. Speech Commun. 2004;44(1):31–41.
Article Google Scholar
McGurk H, MacDonald J. Hearing lips and seeing voices. Nature. 1976;264:746–8.
Article CAS PubMed Google Scholar
Patterson ML, Werker JF. Two-month-old infants match phonetic information in lips and voice. Dev Sci. 2003;6(2):191–6.
Article Google Scholar
Patterson ML, Werker JF. Matching phonetic information in lips and voice is robust in 4.5-month-old infants. Infant Behav Dev. 1999;22(2):237–47.
Article Google Scholar
Benoit C, Guiard-Marigny T, Le Fogg B, Adjoudani A. Which components of the face do humans and machines best speechread? Nato ASI Ser F Comput Syst Sci. 1996;150:315–30.
Article Google Scholar
Vatikiotis-Bateson E, Eigsti I-M, Yano S, Munhall KG. Eye movement of perceivers during audiovisual speech perception. Percept Psychophys. 1998;60(6):926–40.
Article CAS PubMed Google Scholar
Ghazanfar AA, Nielson K, Logothetis NK. Eye movements of monkey observers viewing vocalizing conspecifics. Cognition. 2006;101(3):515–29.
Article PubMed Google Scholar
Schwartz J-L, Berthommier F. Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition. 2004;93(2):B69–78.
Article PubMed Google Scholar
Grant KW, Seitz P. The use of visible speech cues for improving auditory detection of spoken sentences. J Acoust Soc Am. 2000;108(3):1197–208.
Article CAS PubMed Google Scholar
Bernstein LE, Takayanagi S, Auer E Jr. Enhanced auditory detection with AV speech: perceptual evidence for speech and non-speech mechanisms. AVSP. 2004;2003:2003.
Google Scholar
Grant KW. The effect of speechreading on masked detection thresholds for filtered speech. J Acoust Soc Am. 2001;109(5):2272–5.
Article CAS PubMed Google Scholar
Kim J, Davis C. Hearing foreign voices: does knowing what is said affect visual-masked-speech detection? Perception. 2003;32(1):111–20.
Article PubMed Google Scholar
Petajan ED. Automatic lipreading to enhance speech recognition (speech reading). Doctoral Thesis, University of Illinois at Urbana-Champaign, 1984.
Helfer KS, Freyman R. The role of visual speech cues in reducing energetic and informational masking. J Acoust Soc Am. 2005;117(2):842–9.
Article PubMed Google Scholar
Wightman F, Kistler D, Brungart D. Informational masking of speech in children: auditory-visual integration. J Acoust Soc Am. 2006;119(6):3940–9.
Article PubMed Central PubMed Google Scholar
Sodoyer D, Rivet B, Girin L, Savariaux C, Schwartz J-L. A study of lip movements during spontaneous dialog and its application to voice activity detection. J Acoust Soc Am. 2009;125(2):1184–96.
Article PubMed Google Scholar
Yehia H, Rubin P, Vatikiotis-Bateson E. Quantitative association of vocal-tract and facial behavior. Speech Commun. 1998;26(1–2):23–43.
Article Google Scholar
Barker J, Berthommier F. Estimation of speech acoustics from visual speech features: a comparison of linear and non-linear models. In: AVSP’99-international conference on auditory-visual speech processing, 1999.
Almajai I, Milner B. Maximising audio-visual speech correlation. In: Proceedings of AVSP, 2007.
Cifani S, Abel A, Hussain A, Squartini S, Piazza F. An investigation into audiovisual speech correlation in reverberant noisy environments. In: Esposito A, Vích R, editors. Cross-modal analysis of speech, gestures, gaze and facial expressions. Berlin, Heidelberg: Springer; 2009. p. 331–343.
Abel A, Hussain A, Nguyen Q, Ringeval F, Chetouani M, Milgram M. Maximising audiovisual correlation with automatic lip tracking and vowel based segmentation. In: Fierrez J, Ortega-Garcia J, Esposito A, Drygajlo A, Faundez-Zanuy M, editors. Biometric ID management and multimodal communication. Berlin, Heidelberg: Springer; 2009. p. 65–72.
Girin L, Schwartz J, Feng G. Audio-visual enhancement of speech in noise. J Acoust Soc Am. 2001;109:3007.
Article CAS PubMed Google Scholar
Goecke R, Potamianos G, Neti C. Noisy audio feature enhancement using audio-visual speech data. In: Acoustics, speech, and signal processing, 2002. Proceedings.(ICASSP’02). IEEE international conference on, vol. 2, p. 2025–2028, IEEE, 2002.
Potamianos G, Neti C, Deligne S. Joint audio-visual speech processing for recognition and enhancement. In: AVSP 2003-international conference on auditory-visual speech processing, p. 95–104, 2003.
Acero A, Stern R. Environmental robustness in automatic speech recognition. In: Acoustics, speech, and signal processing, 1990. ICASSP-90., 1990 international conference, p. 849–852, IEEE, 2002.
Deng L, Acero A, Jiang L, Droppo J, Huang X. High-performance robust speech recognition using stereo training data. In: Acoustics, speech, and signal processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE international conference on, vol. 1, p. 301–304, IEEE, 2002.
Almajai I, Milner B. Enhancing audio speech using visual speech features. In: Proceedings of Interspeech, Brighton, UK, 2009.
Almajai I, Milner B, Darch J, Vaseghi S. Visually-derived Wiener filters for speech enhancement. In: IEEE international conference on acoustics, speech and signal processing, 2007. ICASSP 2007, vol. 4, p. 585–588, 2007.
Solé-Casals J, Zaiats V. A non-linear VAD for noisy environments. Cogn Comput. 2010;2(3):191–8.
Article Google Scholar
Nguyen Q, Milgram M. Semi adaptive appearance models for lip tracking. In: ICIP09, p. 2437–2440, 2009.
Calinon S, Guenter F, Billard A. On learning, representing, and generalizing a task in a humanoid robot. IEEE Trans Syst Man Cybern B. 2007;37(2):286–98.
Article Google Scholar
Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am. 2006;120(5 Pt 1):2421–4.
Article PubMed Google Scholar
Levey A, Lindenbaum M. Sequential Karhunen-Loeve basis extraction and its application to images. IEEE Trans Image Process. 2000;9(8):1371–4.
Article CAS PubMed Google Scholar
Golub G, Van Loan C. Matrix computations. Baltimore, MD: Johns Hopkins University Press; 1996.
Google Scholar
Cauwenberghs G, Poggio T. Incremental and decremental support vector machine learning. In: Advances in neural information processing systems 13: proceedings of the 2000 conference, p. 409–415, The MIT Press, 2001.
Hiller A, Chin R. Iterative Wiener filters for image restoration. In: Acoustics, speech, and signal processing, 1990. ICASSP-90. 1990 international conference on, p. 1901–1904, 1990.
Sargin M, Yemez Y, Erzin E, Tekalp A. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans Multimedia. 2007;9(7):1396–403.
Article Google Scholar
Fritsch F, Carlson R. Monotone piecewise cubic interpolation. SIAM J Numer Anal. 1980;17(2):238–46.
Article Google Scholar
Loizou P. Speech enhancement: theory and practice (signal processing and communication. Boca Raton, FL: CRC; 2007.
Google Scholar
Hu Y, Loizou P. Evaluation of objective measures for speech enhancement. Proc Interspeech. 2006;2006:1447–50.
Google Scholar
Hu Y, Loizou P. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process. 2008;16(1):229–38.
Article Google Scholar
Rix AW, Beerends JG, Hollier MP, Hekstra AP. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Acoustics, speech, and signal processing, IEEE international conference on (ICASSP’01), vol. 2, p. 749–752, 2001.
Klatt D. Prediction of perceived phonetic distance from critical-band spectra: a first step. In: Acoustics, speech, and signal processing, IEEE international conference on (ICASSP’82), vol 7, p. 1278–1281, 1982.
Lu Y, Loizou P. A geometric approach to spectral subtraction. Speech Commun. 2008;50(6):453–66.
Article PubMed Central PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Computing Science and Mathematics, School of Natural Sciences, University of Stirling, Stirling, FK9 4AL, Scotland, UK
Andrew Abel & Amir Hussain

Authors

Andrew Abel
View author publications
You can also search for this author in PubMed Google Scholar
Amir Hussain
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew Abel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abel, A., Hussain, A. Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments. Cogn Comput 6, 200–217 (2014). https://doi.org/10.1007/s12559-013-9231-2

Download citation

Received: 30 January 2013
Accepted: 01 October 2013
Published: 20 October 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s12559-013-9231-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation