Skip to main content
Log in

Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

In recent years, the established link between the various human communication production domains has become more widely utilised in the field of speech processing. In this work, we build on previous work by the authors and present a novel two-stage audiovisual speech enhancement system, making use of audio-only beamforming, automatic lip tracking, and pre-processing with visually derived Wiener speech filtering. Initial results have demonstrated that this two-stage multimodal speech enhancement approach can produce positive results with noisy speech mixtures that conventional audio-only beamforming would struggle to cope with, such as in very noisy environments with a very low signal to noise ratio, and when the type of noise is difficult for audio-only beamforming to process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Greenberg J. Improved design of microphone-array hearing aids. 1994.

  2. Zelinski R. A microphone array with adaptive post-filtering for noise reduction in reverberant rooms. In: Acoustics, speech, and signal processing, 1988. ICASSP-88., 1988 international conference. p. 2578–2581, 1988.

  3. Hussain A, Cifani S, Squartini S, Piazza F, Durrani TS. A novel psyco-acoustically motivated multi-channel speech enhancement system. In: Verbal and non-verbal communication behaviours. Lecture Notes in Computer Science (LNCS), vol. 4775. Springer-Verlag; 2007. p. 190–199.

  4. Liu Y, Lv J, Xiang Y. Underdetermined blind source separation by parallel factor analysis in time-frequency domain. Cogn Comput. 2012;5(2):207–14.

    Google Scholar 

  5. Gannot S, Burshtein D, Weinstein E. Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Trans Signal Process. 2001;49(8):1614–26.

    Article  Google Scholar 

  6. Griffiths L, Jim C. An alternative approach to linearly constrained adaptive beamforming. IEEE Trans Antennas Propag. 1982;30(1):27–34.

    Article  Google Scholar 

  7. Wiener N. Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. Cambridge, MA: The MIT Press; 1949.

    Google Scholar 

  8. Li J, Sakamoto S, Hongo S, Akagi M, Suzuki Y. A two-stage binaural speech enhancement approach for hearing aids with preserving binaural benefits in noisy environments. In: Proceedings of forum acousticum 2008, Paris, France; 2008. p. 723–727.

  9. Li J, Sakamoto S, Hongo S, Akagi M, Suzuki Y. Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication. Speech Communication, 2010.

  10. Van den Bogaert T, Doclo S, Wouters J, Moonen M. Speech enhancement with multichannel Wiener filter techniques in multimicrophone binaural hearing aids. J Acoust Soc Am. 2009;125:360–71.

    Article  PubMed  Google Scholar 

  11. Anderson M, Adali T, Li X. Joint blind source separation with multivariate Gaussian model: algorithms and performance analysis. IEEE Trans Signal Process. 2012;60(4):1672–83.

    Article  Google Scholar 

  12. Rivet B, Girin L, Jutten C. Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans Audio Speech Lang Process. 2007;15(1):96–108.

    Article  Google Scholar 

  13. Rivet B, Chambers J. Multimodal speech separation. In: Sol-Casals J, Zaiats V, editors. Advances in nonlinear speech processing, vol 5933 of Lecture Notes in Computer Science. Berlin/Heidelberg: Springer; 2010. p. 1–11.

  14. Rivet B, Girin L, Jutten C. Log-Rayleigh distribution: a simple and efficient statistical representation of log-spectral coefficients. IEEE Trans Audio Speech Lang Process. 2007;15(3):796–802.

    Article  Google Scholar 

  15. Sumby W, Pollack I. Visual contribution to speech intelligibility in noise. J Acoust Soc Am. 1954;26(2):212–5.

    Article  Google Scholar 

  16. Erber NP. Auditory-visual perception of speech. J Speech Hear Disord. 1975;40(4):481–92.

    CAS  PubMed  Google Scholar 

  17. Summerfield Q. Use of visual information for phonetic perception. Phonetica. 1979;36(4):314–31.

    Article  CAS  PubMed  Google Scholar 

  18. Berthommier F. A phonetically neutral model of the low-level audio-visual interaction. Speech Commun. 2004;44(1):31–41.

    Article  Google Scholar 

  19. McGurk H, MacDonald J. Hearing lips and seeing voices. Nature. 1976;264:746–8.

    Article  CAS  PubMed  Google Scholar 

  20. Patterson ML, Werker JF. Two-month-old infants match phonetic information in lips and voice. Dev Sci. 2003;6(2):191–6.

    Article  Google Scholar 

  21. Patterson ML, Werker JF. Matching phonetic information in lips and voice is robust in 4.5-month-old infants. Infant Behav Dev. 1999;22(2):237–47.

    Article  Google Scholar 

  22. Benoit C, Guiard-Marigny T, Le Fogg B, Adjoudani A. Which components of the face do humans and machines best speechread? Nato ASI Ser F Comput Syst Sci. 1996;150:315–30.

    Article  Google Scholar 

  23. Vatikiotis-Bateson E, Eigsti I-M, Yano S, Munhall KG. Eye movement of perceivers during audiovisual speech perception. Percept Psychophys. 1998;60(6):926–40.

    Article  CAS  PubMed  Google Scholar 

  24. Ghazanfar AA, Nielson K, Logothetis NK. Eye movements of monkey observers viewing vocalizing conspecifics. Cognition. 2006;101(3):515–29.

    Article  PubMed  Google Scholar 

  25. Schwartz J-L, Berthommier F. Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition. 2004;93(2):B69–78.

    Article  PubMed  Google Scholar 

  26. Grant KW, Seitz P. The use of visible speech cues for improving auditory detection of spoken sentences. J Acoust Soc Am. 2000;108(3):1197–208.

    Article  CAS  PubMed  Google Scholar 

  27. Bernstein LE, Takayanagi S, Auer E Jr. Enhanced auditory detection with AV speech: perceptual evidence for speech and non-speech mechanisms. AVSP. 2004;2003:2003.

    Google Scholar 

  28. Grant KW. The effect of speechreading on masked detection thresholds for filtered speech. J Acoust Soc Am. 2001;109(5):2272–5.

    Article  CAS  PubMed  Google Scholar 

  29. Kim J, Davis C. Hearing foreign voices: does knowing what is said affect visual-masked-speech detection? Perception. 2003;32(1):111–20.

    Article  PubMed  Google Scholar 

  30. Petajan ED. Automatic lipreading to enhance speech recognition (speech reading). Doctoral Thesis, University of Illinois at Urbana-Champaign, 1984.

  31. Helfer KS, Freyman R. The role of visual speech cues in reducing energetic and informational masking. J Acoust Soc Am. 2005;117(2):842–9.

    Article  PubMed  Google Scholar 

  32. Wightman F, Kistler D, Brungart D. Informational masking of speech in children: auditory-visual integration. J Acoust Soc Am. 2006;119(6):3940–9.

    Article  PubMed Central  PubMed  Google Scholar 

  33. Sodoyer D, Rivet B, Girin L, Savariaux C, Schwartz J-L. A study of lip movements during spontaneous dialog and its application to voice activity detection. J Acoust Soc Am. 2009;125(2):1184–96.

    Article  PubMed  Google Scholar 

  34. Yehia H, Rubin P, Vatikiotis-Bateson E. Quantitative association of vocal-tract and facial behavior. Speech Commun. 1998;26(1–2):23–43.

    Article  Google Scholar 

  35. Barker J, Berthommier F. Estimation of speech acoustics from visual speech features: a comparison of linear and non-linear models. In: AVSP’99-international conference on auditory-visual speech processing, 1999.

  36. Almajai I, Milner B. Maximising audio-visual speech correlation. In: Proceedings of AVSP, 2007.

  37. Cifani S, Abel A, Hussain A, Squartini S, Piazza F. An investigation into audiovisual speech correlation in reverberant noisy environments. In: Esposito A, Vích R, editors. Cross-modal analysis of speech, gestures, gaze and facial expressions. Berlin, Heidelberg: Springer; 2009. p. 331–343.

  38. Abel A, Hussain A, Nguyen Q, Ringeval F, Chetouani M, Milgram M. Maximising audiovisual correlation with automatic lip tracking and vowel based segmentation. In: Fierrez J, Ortega-Garcia J, Esposito A, Drygajlo A, Faundez-Zanuy M, editors. Biometric ID management and multimodal communication. Berlin, Heidelberg: Springer; 2009. p. 65–72.

  39. Girin L, Schwartz J, Feng G. Audio-visual enhancement of speech in noise. J Acoust Soc Am. 2001;109:3007.

    Article  CAS  PubMed  Google Scholar 

  40. Goecke R, Potamianos G, Neti C. Noisy audio feature enhancement using audio-visual speech data. In: Acoustics, speech, and signal processing, 2002. Proceedings.(ICASSP’02). IEEE international conference on, vol. 2, p. 2025–2028, IEEE, 2002.

  41. Potamianos G, Neti C, Deligne S. Joint audio-visual speech processing for recognition and enhancement. In: AVSP 2003-international conference on auditory-visual speech processing, p. 95–104, 2003.

  42. Acero A, Stern R. Environmental robustness in automatic speech recognition. In: Acoustics, speech, and signal processing, 1990. ICASSP-90., 1990 international conference, p. 849–852, IEEE, 2002.

  43. Deng L, Acero A, Jiang L, Droppo J, Huang X. High-performance robust speech recognition using stereo training data. In: Acoustics, speech, and signal processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE international conference on, vol. 1, p. 301–304, IEEE, 2002.

  44. Almajai I, Milner B. Enhancing audio speech using visual speech features. In: Proceedings of Interspeech, Brighton, UK, 2009.

  45. Almajai I, Milner B, Darch J, Vaseghi S. Visually-derived Wiener filters for speech enhancement. In: IEEE international conference on acoustics, speech and signal processing, 2007. ICASSP 2007, vol. 4, p. 585–588, 2007.

  46. Solé-Casals J, Zaiats V. A non-linear VAD for noisy environments. Cogn Comput. 2010;2(3):191–8.

    Article  Google Scholar 

  47. Nguyen Q, Milgram M. Semi adaptive appearance models for lip tracking. In: ICIP09, p. 2437–2440, 2009.

  48. Calinon S, Guenter F, Billard A. On learning, representing, and generalizing a task in a humanoid robot. IEEE Trans Syst Man Cybern B. 2007;37(2):286–98.

    Article  Google Scholar 

  49. Cooke M, Barker J, Cunningham S, Shao X. An audio-visual corpus for speech perception and automatic speech recognition. J Acoust Soc Am. 2006;120(5 Pt 1):2421–4.

    Article  PubMed  Google Scholar 

  50. Levey A, Lindenbaum M. Sequential Karhunen-Loeve basis extraction and its application to images. IEEE Trans Image Process. 2000;9(8):1371–4.

    Article  CAS  PubMed  Google Scholar 

  51. Golub G, Van Loan C. Matrix computations. Baltimore, MD: Johns Hopkins University Press; 1996.

    Google Scholar 

  52. Cauwenberghs G, Poggio T. Incremental and decremental support vector machine learning. In: Advances in neural information processing systems 13: proceedings of the 2000 conference, p. 409–415, The MIT Press, 2001.

  53. Hiller A, Chin R. Iterative Wiener filters for image restoration. In: Acoustics, speech, and signal processing, 1990. ICASSP-90. 1990 international conference on, p. 1901–1904, 1990.

  54. Sargin M, Yemez Y, Erzin E, Tekalp A. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Trans Multimedia. 2007;9(7):1396–403.

    Article  Google Scholar 

  55. Fritsch F, Carlson R. Monotone piecewise cubic interpolation. SIAM J Numer Anal. 1980;17(2):238–46.

    Article  Google Scholar 

  56. Loizou P. Speech enhancement: theory and practice (signal processing and communication. Boca Raton, FL: CRC; 2007.

    Google Scholar 

  57. Hu Y, Loizou P. Evaluation of objective measures for speech enhancement. Proc Interspeech. 2006;2006:1447–50.

    Google Scholar 

  58. Hu Y, Loizou P. Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio Speech Lang Process. 2008;16(1):229–38.

    Article  Google Scholar 

  59. Rix AW, Beerends JG, Hollier MP, Hekstra AP. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: Acoustics, speech, and signal processing, IEEE international conference on (ICASSP’01), vol. 2, p. 749–752, 2001.

  60. Klatt D. Prediction of perceived phonetic distance from critical-band spectra: a first step. In: Acoustics, speech, and signal processing, IEEE international conference on (ICASSP’82), vol 7, p. 1278–1281, 1982.

  61. Lu Y, Loizou P. A geometric approach to spectral subtraction. Speech Commun. 2008;50(6):453–66.

    Article  PubMed Central  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew Abel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abel, A., Hussain, A. Novel Two-Stage Audiovisual Speech Filtering in Noisy Environments. Cogn Comput 6, 200–217 (2014). https://doi.org/10.1007/s12559-013-9231-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-013-9231-2

Keywords

Navigation