Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE Dept., IIT Bombay NCC 2013, Delhi, 15-17 Feb. 2013, Paper 3.2_2_1569696063 ( Sat.16 th, 1135 – 1320, 3.2_2) Speech Enhancement.

Similar presentations


Presentation on theme: "EE Dept., IIT Bombay NCC 2013, Delhi, 15-17 Feb. 2013, Paper 3.2_2_1569696063 ( Sat.16 th, 1135 – 1320, 3.2_2) Speech Enhancement."— Presentation transcript:

1

2 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay NCC 2013, Delhi, 15-17 Feb. 2013, Paper 3.2_2_1569696063 ( Sat.16 th, 1135 – 1320, 3.2_2) Speech Enhancement Using Spectral Subtraction and Cascaded-Median Based Noise Estimation for Hearing Impaired Listeners Santosh K. Waddi, Prem C. Pandey, Nitya Tiwari {wsantosh, pcpandey, nitya} @ ee.iitb.ac.in http://www.ee.iitb.ac.in/~spilab/material/santosh/ncc2013 IIT Bombay

3 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 2/19 Overview 1.Introduction 2.Signal Processing for Spectral Subtraction 3.Implementation for Real-time Processing 4.Test Results 5.Summary & Conclusion

4 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 3/19 1. Introduction Sensorineural hearing loss – Increased hearing thresholds and high frequency loss – Decreased dynamic range & abnormal loudness growth – Reduced speech perception due to increased spectral & temporal masking → Decreased speech intelligibility in noisy environment Signal processing in hearing aids – Frequency selective amplification – Automatic volume control – Multichannel dynamic range compression ( settable attack time, release time, and compression ratios) Processing for reducing the effect of increased spectral masking in sensorineural loss – Binaural dichotic presentation ( Lunner et al. 1993, Kulkarni et al. 2012) – Spectral contrast enhancement (Yang et al. 2003) – Multiband frequency compression (Arai et al. 2004, Kulkarni et al. 2012)

5 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 4/19 Techniques for reducing the background noise – Directional microphone – Adaptive filtering (a second microphone needed for noise reference) – Single-channel noise suppression using spectral subtraction (Boll 1979, Berouti et al. 1979, Martin 1994, Loizou 2007, Lu & Loizou 2008, Paliwal et al. 2010) Processing steps Dynamic estimation of non-stationary noise spectrum - During non-speech segments using voice activity detection - Continuously using statistical techniques Estimation of noise-free speech spectrum - Spectral noise subtraction - Multiplication by noise suppression function Speech resynthesis (using enhanced magnitude and noisy phase)

6 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 5/19 Research objective Real-time single-input speech enhancement for use in hearing aids and other sensory aids (cochlear prostheses, etc) for hearing impaired listeners Main challenges Noise estimation without voice activity detection to avoid errors under low-SNR & during long speech segments Low signal delay(algorithmic + computational) for real-time application Low computational complexity & memory requirement for implementation on a low-power processor Proposed technique: Spectral subtraction using cascaded-median based continuous updating of the noise spectrum (without using voice activity detection) Real-time implementation: 16-bit fixed-point DSP with on-chip FFT hardware Evaluation: Informal listening, PESQ-MOS

7 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 6/19 2. Signal Processing for Spectral Subtraction Dynamic estimation of non-stationary noise spectrum Estimation of noise-free speech spectrum Speech resynthesis

8 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 7/19 Power subtraction Windowed speech spectrum = X n (k) Estimated noise mag. spectrum = D n (k) Estimated speech spectrum Y n (k) = [|X n (k)| 2 – (D n (k)) 2 ] 0.5 e j<X n (k) Problems: residual noise due to under-subtraction, distortion in the form of musical noise & clipping due to over-subtraction. Generalized spectral subtraction (Berouti et al. 1979) |Y n (k)| = β 1/γ D n (k), if |X n (k)| < (α + β) 1/γ D n (k) [ |X n (k)| γ – α(D n (k)) γ ] 1/γ otherwise γ = exponent factor ( 2: power subtraction, 1: magnitude subtraction) α = o ver-subtraction factor (for limiting the effect of short-term variations in noise spectrum) β = floor factor to mask the musical noise due to over-subtraction Re-synthesis with noisy phase without explicit phase calculation Y n (k) = |Y n (k)| X n (k) / |X n (k)|

9 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 8/19 Dynamic estimation of noise magnitude spectrum Min. statistics based est. (Martin 1994 ): SNR-dependent over-subtraction factor. Median based est. (Stahl et al. 2000) : Large computation & memory. Pseudo-median based est. (Basha & Pandey 2012) : moving median approximated by p -point q -stage cascaded-median, with a saving in memory & computation for real-time implementation. MedianStorage per freq. binSortings per frame per freq. bin M -point 2M (M–1)/2 p -pont q -stage ( M = p q ) pqp(p–1)/2 Condition for reducing sorting operations and storage: low p, q ≈ ln(M)

10 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 9/19 Re-synthesis of enhanced signal Spectral subtraction → Enhanced magnitude spectrum Enh. magnitude spectrum & original phase spectrum → Complex spectrum Resynthesis using IFFT and overlap-add Investigations using offline implementation ( f s = 12 kHz, frame = 30 ms) Overlap of 50% & 75% : indistinguishable outputs FFT length N = 512 & higher : indistinguishable outputs γ = 1 (magnitude subtraction) : higher tolerance to variation in α, β values Duration needed for dynamic noise estimation ≈ 1 s p = 3 for simplifying programming and reducing the sorting operations 3-frame 4-stage cascaded-median ( M=81, p=3, q=4 ), 50% overlap: moving median over 1.215 s Outputs using true-median & cascaded-median: indistinguishable Reduction in storage requirement per freq. bin: from 162 to 12 samples Reduction in number of sorting operations per frame per freq. bin: from 40 to 3

11 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 10/19 3. Implementation for Real-time Processing 16-bit fixed point DSP: TI/TMS320C5515 16 MB memory space : 320 KB on-chip RAM with 64 KB dual access RAM, 128 KB on-chip ROM Three 32-bit programmable timers, 4 DMA controllers each with 4 channels FFT hardware accelerator (8 to 1024-point FFT) Max. clock speed: 120 MHz DSP Board: eZdsp 4 MB on-board NOR flash for user program Codec TLV320AIC3204: stereo ADC & DAC, 16/20/24/32-bit quantization, 8 – 192 kHz sampling Development environment for C: TI's 'CCStudio, ver. 4.0'

12 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 11/19 Implementation One codec channel (ADC and DAC) with 16-bit quantization Sampling frequency: 12 kHz Window length of 30 ms ( L = 360) with 50 % overlap, FFT length N = 512 Storage of input samples, spectral values, processed samples: 16-bit real & 16- bit imaginary parts

13 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 12/19 Data transfers and buffering operations ( S = L/2 ) DMA cyclic buffers –3 block input buffer –2 block output buffer (each with S samples) Pointers –current input block –just-filled input block –current output block –write-to output block (incremented cyclically on DMA interrupt) Signal delay –Algorithmic: 1 frame (30 ms) –Computational ≤ frame shift (15 ms)

14 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 13/19 4. Test Results Test material Speech: Recording with three isolated vowels, a Hindi sentence, an English sentence (-/a/-/i/-/u/– “aayiye aap kaa naam kyaa hai?” – “Where were you a year ago?”) from a male speaker. Noise: white, pink, babble, car, and train noises (AURORA ). SNR: ∞, 15, 12, 9, 6, 3, 0, -3, -6 dB. Evaluation methods Informal listening Objective evaluation using PESQ measure (0 – 4.5) Results: Offline processing Processing parameters: β = 0.001, α : 1.5 – 2.5. Informal listening: no audible roughness in the enhanced speech, speech clipping at larger α.

15 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 14/19 PESQ score vs SNR: noisy & enhanced speech SNR advantage : 13 dB for white noise, 4 dB for babble

16 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 15/19 Processing examples & PESQ scores Noise PESQ score (SNR = 0dB) Optimal α Un-processedProcessed White1.692.552.5 Pink1.792.602.5 Babble2.132.502.0 Car1.772.412.5 Train2.082.652.5 Speech material: speech-speech-silence-speech, speech: /-a-i-u-/––"ayiye"–– "aap kaa naam kyaa hai?"––" where were you a year ago?” Processing parameters: Frame length = 30 ms, Overlap = 50%, β = 0.001, noise estimation by 3-point 4-stage casc. median (estimation duration = 1.215 s)

17 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 16/19 (a) Clean speech (b) Noisy speech (c) Offline processed (d) Real-time processed Example: “Where were you a year ago”, white noise, input SNR = 3 dB. More examples: http://www.ee.iitb.ac.in/~spilab/material/santosh/ncc2013http://www.ee.iitb.ac.in/~spilab/material/santosh/ncc2013

18 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 17/19 Results: Real-time processing Real-time processing tested using white, babble, car, pink, train noises: real-time processed output perceptually similar to the offline processed output Signal delay = 48 ms Lowest clock for satisfactory operation = 16.4 MHz → Processing capacity used ≈ 1/7 of the capacity with highest clock (120 MHz)

19 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 18/19 5. Summary & Conclusions Proposed technique for suppression of additive noise: cascaded-median based dynamic noise estimation for reducing computation and memory requirement for real-time operation. Enhancement of speech with different types of additive stationary and non- stationary noise: SNR advantage (at PESQ score = 2.5): 4 – 13 dB, Increase in PESQ score (at SNR = 0dB): 0.37 – 0.86. Implementation for real-time operation using 16-bit fixed-point processor TI/TMS320C5515: used-up processing capacity ≈ 1/7, delay = 48 ms. Further work – Frequency & a posteriori SNR-dependent subtraction & spectral floor factors – Combination of speech enhancement technique with other processing techniques in the sensory aids – Implementation using other processors

20 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 19/19

21 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 20/19 Abstract A spectral subtraction technique is presented for real-time speech enhancement in the aids used by hearing impaired listeners. For reducing computational complexity and memory requirement, it uses a cascaded-median based estimation of the noise spectrum without voice activity detection. The technique is implemented and tested for satisfactory real-time operation, with sampling frequency of 12 kHz, processing using window length of 30 ms with 50% overlap, and noise estimation by 3-frame 4-stage cascaded-median, on a 16-bit fixed- point DSP processor with on-chip FFT hardware. Enhancement of speech with different types of additive stationary and non-stationary noise resulted in SNR advantage of 4 – 13 dB.

22 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 21/19 References [1]H. Levitt, J. M. Pickett, and R. A. Houde (eds.), Senosry Aids for the Hearing Impaired. New York: IEEE Press, 1980. [2]J. M. Pickett, The Acoustics of Speech Communication: Fundamentals, Speech Perception Theory, and Technology. Boston, Mass.: Allyn Bacon, 1999, pp. 289–323. [3]H. Dillon, Hearing Aids. New York: Thieme Medical, 2001. [4]T. Lunner, S. Arlinger, and J. Hellgren, “8-channel digital filter bank for hearing aid use: preliminary results in monaural, diotic, and dichotic modes,” Scand. Audiol. Suppl., vol. 38, pp. 75–81, 1993. [5]P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Binaural dichotic presentation to reduce the effects of spectral masking in moderate bilateral sensorineural hearing loss,” Int. J. Audiol., vol. 51, no. 4, pp. 334–344, 2012. [6]J. Yang, F. Luo, and A. Nehorai, “Spectral contrast enhancement: Algorithms and comparisons,” Speech Commun., vol. 39, no. 1–2, pp. 33–46, 2003. [7]T. Arai, K. Yasu, and N. Hodoshima, “Effective speech processing for various impaired listeners,” in Proc. 18th Int. Cong. Acoust., 2004, Kyoto, Japan, pp. 1389–1392. [8]P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti, “Multi-band frequency compression for improving speech perception by listeners with moderate sensorineural hearing loss,” Speech Commun., vol. 54, no. 3 pp. 341–350, 2012. [9]P. C. Loizou, "Speech processing in vocoder-centric cochlear implants," in A. R. Moller (ed.), Cochlear and Brainstem Implants, Adv. Otorhinolaryngol. vol. 64, Basel: Karger, 2006, pp. 109–143. [10]P. C. Loizou, Speech Enhancement: Theory and Practice. New York: CRC, 2007. [11]R. Martin, “Spectral subtraction based on minimum statistics,” in Proc. Eur. Signal Process. Conf., 1994, pp. 1182-1185. [12]I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466-475, 2003. [13]H. Hirsch and C. Ehrlicher, “Noise estimation techniques for robust speech recognition,” in Proc. IEEE ICASSP, 1995, pp. 153-156.

23 wsantosh@ee.iitb.ac.in EE Dept., IIT Bombay 22/19 [14]V. Stahl, A. Fisher, and R. Bipus, “Quantile based noise estimation for spectral subtraction and Wiener filtering,” in Proc. IEEE ICASSP, 2000, pp. 1875-1878. [15]M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. IEEE ICASSP, 1979, pp. 208-211. [16]S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113-120, 1979. [17]Y. Lu and P. C. Loizou, “A geometric approach to spectral subtraction,” Speech Commun., vol. 50, no. 6, pp. 453-466, 2008. [18]S. K. Basha and P. C. Pandey, “Real-time enhancement of electrolaryngeal speech by spectral subtraction,” in Proc. Nat. Conf. Commun. 2012, Kharagpur, India, pp. 516-520. [19]K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech enhancement using spectral subtraction in the short-time modulation domain,” Speech Commun., vol. 52, no. 5, pp. 450–475, 2010. [20]Texas Instruments, Inc., “TMS320C5515 Fixed-Point Digital Signal Processor,” 2011, [online] Available: focus.ti.com/lit/ds/symlink/ tms320c5515.pdf. [21]Spectrum Digital, Inc., “TMS320C5515 eZdsp USB Stick Technical Reference,” 2010, [online] Available: support.spectrumdigital.com/boards/usbstk5515/reva/files/ usbstk5515_TechRef_RevA.pdf [22]Texas Instruments, Inc., “TLV320AIC3204 Ultra Low Power Stereo Audio Codec,” 2008, [online] Available: focus.ti.com/lit/ds/ symlink/tlv320aic3204.pdf. [23]ITU, “Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrow- band telephone networks and speech codecs,” ITU-T Rec., P.862, 2001. [24]S. K. Waddi, “Speech enhancement results”, 2013, [online] Available: www.ee.iitb.ac.in/ ~spilab/material/santosh/ncc2013.


Download ppt "EE Dept., IIT Bombay NCC 2013, Delhi, 15-17 Feb. 2013, Paper 3.2_2_1569696063 ( Sat.16 th, 1135 – 1320, 3.2_2) Speech Enhancement."

Similar presentations


Ads by Google