Presentation is loading. Please wait.

Presentation is loading. Please wait.

IIT Bombay ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 1/30 Intro.Intro. Clear speech.

Similar presentations


Presentation on theme: "IIT Bombay ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 1/30 Intro.Intro. Clear speech."— Presentation transcript:

1

2 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 1/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Automated Detection of Transition Segments for Intensity and Time-Scale Modification for Speech Intelligibility Enhancement by A. R. Jayan, P. C. Pandey, P. K. Lehana EE Dept, IIT Bombay 5 th January, 2008

3 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 2/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. PAPER OUTLINE 1. Introduction 2. Acoustic Properties of Clear Speech 3.Automated Detection of Transition Segments 4.Intensity and Time-Scale Modification 5.Experimental Results 6.Summary and Conclusion

4 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 3/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. INTRODUCTION Speech landmarks  Regions in speech containing important information for speech perception  Associated with spectral transitions  Most of the landmarks coincide with phoneme boundaries Landmarks types 1. Abrupt-consonantal (AC) – Tight constrictions of primary articulators 2. Abrupt (A) - Fast glottal or velum activity 3. Non-abrupt (N) - Semi-vowel landmarks, less vocal tract constriction 4. Vocalic (V) - Vowel landmarks, oral cavity maximally open, maximum energy, F1  Abrupt (~68%)  Vocalic (~29%)  Non-abrupt (~3%) Intro. 1/2

5 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 4/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Objective  To improve speech intelligibility in quiet and noisy environments  Automated detection of landmarks  Speech modification using acoustic properties of clear speech Landmarks Intro. 2/2

6 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 5/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. ACOUSTIC PROPERTIS OF CLEAR SPEECH Clear speech: speech produced with clear articulation when talking to a hearing impaired listener, or in noisy environments Examples - http://www.acoustics.org/press/145th/clr-spch-tab.htm ‘the book tells a story’ ‘the boy forgot his book’ ConversationalClear Intelligibility of clear speech ▪ More intelligible for different classes of listeners & listening conditions ▪ Picheny et al. (1985): ~17% more intelligible than conversational speech Clear speech 1/5

7 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 6/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Acoustic properties of clear speech Picheny et al. (1986)  Sentence level Reduced speaking rate (conv: 200 wpm, clr: 100 wpm) Larger variation in fundamental frequency Increased number of pauses, more pause durations  Word level Less sound deletions More sound insertions  Phonetic level Context dependent, non-linear increase in segment durations More targeted vowel formants Increase in consonant intensity Clear speech 2/5

8 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 7/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Clear speech 3/5 Acoustic cues in clear speech are more robust and discriminable Speech intelligibility of conversational speech can be improved by incorporating properties of clear speech  Consonant-vowel intensity ratio (CVR) enhancement Increasing the ratio of rms energy of consonant segment to nearby vowel  Consonant duration enhancement Increasing VOT, burst duration, formant transition duration Difficulties  Detection of regions for modification  Performing modification with low signal processing artifacts

9 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 8/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Earlier studies on CVR enhancement  House et al. (1965): MRT, high scores for high consonant level  Gordon-Salant (1986): CVR +10dB, 19 CV, Elderly SNHI, +16%  Guelke (1987): Burst intensity +17 dB, stop CV, NH, +40%  Montgomery et al. (1987): CVR -20 dB to +9 dB, CVC, NH, SNHI, no significant loudness increase  Freyman & Nerbonne (1989): Equated consonant levels across talkers, CV syllables, NH, +12%  Thomas & Pandey (1996): CVR +3 to +12 dB, CV & VC, NH, +16%  Kennedy et al. (1997): CE 0-24 dB, VC, SNHI, max CE: 8.3 dB (voiced), 10.7 dB (unvoiced)  Hazan & Simpson (1998): Burst +12 dB, fric. +6 dB, nas. +6 dB filtering, VCV, SUS, NH, +12% Clear speech 4/5

10 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 9/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Earlier studies on duration enhancement  Gordon-Salant (1986): DUR +100%, marginal improvement  Thomas & Pandey (1996): BD +100%, FTD +50%, VOT +100% BD, FTD → improved scores, VOT → degraded  Vaughan et al. (2002): Unvoiced consonants expanded by 1.2, 1.4 1.4 effective in noisy condition  Nejime & Moore (1998): Voiced segments expanded by 1.2, 1.5 Degraded performance  Liu & Zeng (2006): Temporal envelope (2-50 Hz) contributes at positive SNRs Fine structure (> 500 Hz) contributes at lower SNRs  Hodoshima et al. (2007): Slowed down, steady-state suppressed speech more intelligible in reverberant environments Clear speech 5/5

11 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 10/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. AUTOMATED DETECTION OF TRANSITION SEGMENTS Auto.Trans. 1/3 Identifying regions for enhancement - segmentation / landmark detection Manual segmentation  accurate  high detection rate  time consuming  subjective  useful only for research & not for actual application Automated detection of segments  low detection rate  less accurate  consistent Segmentation based on Spectral Transition Measures  maximum spectral transitions coincide with segment boundaries

12 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 11/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Earlier studies on automated segmentation  Mermelstien (1975): based on loudness variation, low detection rate, slow carefully uttered speech  Glass & Zue (1988): based on auditory critical bands, detection rate 90%, ± 20ms  Sarkar & Sreenivas (2005): based on level crossing rate, adaptive level allocation, detection rate 78.6%, ± 20ms  Alani & Deriche (1999): wavelet transform based, energy in different bands, detection rate 90.9%, ± 20ms  Liu (1996): landmark detection algorithm, energy variation in spectral bands, detection rate 83%, ± 20 ms Auto.Trans. 2/3

13 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 12/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Earlier studies on automated intelligibility enhancement  Colotte & Laprie (2000)  Segmentation by spectral variation function (82%)  Stops and unvoiced fricatives amplified by +4 dB  Time-scaled by 1.8, 2.0 (TD-PSOLA)  Missing word identification, TIMIT sentences  Improved performance  Skowronski & Harris (2006)  Spectral transition measure based voiced/unvoiced classification  Energy redistribution in voiced / unvoiced segments (ERVU)  Amplifying low energy temporal regions critical to intelligibility  Confusable words TI-46 corpus, 16 talkers, 25 subjects  Improved performance for 9 talkers, no degradation for others  Enhancement useful for native & non-native listeners Auto.Trans. 3/3

14 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 13/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. PROPOSED METHOD FOR INTELLIGIBILITY ENHANCEMENT  VC and CV transition segments expanded, steady-state segments compressed, overall speech duration kept unaltered  Intensity scaling of transition segments (CVR enhancement)  Objective : reducing the masking of consonantal segments by vowel segments Intel. Enh. 1/15

15 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 14/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Liu’s Landmark detection algorithm ▪ Based on energy variation in 6 spectral bands ▪ Segment duration, articulatory, and phonetic class constraints ▪ Glottal, sonorant closures, releases, stop closures, releases ▪ Peak picking based on convex-hull algorithm ▪ Matching of peaks across bands for locating boundaries ▪ Detection rate 83%, accuracy ± 20ms Observations  Assumptions in the method Spectral prominence represented by peak energy in the band One spectral prominence per band  Information regarding frequency location of peak energy not used Intel. Enh. 2/15

16 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 15/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Landmark detection using spectral peaks and centroids  Spectrum divided into five non-overlapping bands 0–0.4, 0.4–1.2, 1.2–2.0, 2.0–3.5, 3.5–5.0 kHz Spectral peak and centroid estimated in each band & used for calculating transition index  Peak energy  Centroid frequency  Rate-of-rise functions  Transition index Intel. Enh. 3/15

17 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 16/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Spectral peak & centroid variation in bands Example: /aka/  Centroid variation not necessarily in phase with energy variation  Transitions: Some of energy peaks and centroids undergo change 0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz Intel. Enh. 4/15

18 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 17/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Peak & centroid ROR contours Observation: Product of two RORs near-to-zero during steady-states & peaks during transition segments Example: /aba/ 0-0.4 kHz 0.4-1.2 kHz 1.2-2.0 kHz 2.0-3.5 kHz 3.5-5.0 kHz Intel. Enh. 5/15

19 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 18/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Detection of transition segments spectrogram transition index boundaries /aba/ Intel. Enh. 6/15 (a) Signal waveform for VCV syllable /aka/ (b) Spectrogram, (c) Transition index (d) transition boundaries detected. waveform

20 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 19/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. sentence ‘ put the butcher block table’, (b) TIMIT land­marks, and (c) detected landmarks. Manual anno­tation: “bcl”- / b / closure onset, “b”- / b / release burst, etc. Automatic detection: landmarks numbered as 5, 6,..etc. (a) (b) (c) Intel. Enh. 7/15 Evaluation using sentences

21 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 20/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Evaluation using sentences  50 manually annotated sentences from TIMIT database  5 speakers: 3 female, 2 male Detection rates ST-stop FR-fricative NAS-nasal V-vowel SV-semivowel Intel. Enh. 8/15

22 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 21/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Harmonic plus noise model (HNM) (Stylianou 1996) Harmonic part / Deterministic part (quasi periodic components of speech) modeled by harmonics of fundamental frequency Noise part /stochastic part (non periodic components) modeled by LPC coefficients, energy envelope Intel. Enh. 9/15

23 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 22/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. HNM parameters ( Lehana and Pandey ) Voiced / Unvoiced Classification (V/UV)  Harmonic part pitch F 0 Maximum voiced frequency F m Amplitudes and phases of harmonics A k  Noise part LPC coefficients Energy envelope Voiced Frame →parameters (Harmonic part + noise part ) Unvoiced Frame → parameters (noise part ) Intel. Enh. 10/15

24 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 23/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. HNM based analysis stage  Modification using a small parameter set  Low perceptual distortions, preserves naturalness and intelligibility HNM analysis stage Intel. Enh. 11/15

25 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 24/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. HNM based time-scale modification stage Scaling factors Intel. Enh. 12/15

26 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 25/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. SNRorig.+6 dB+3 dB0 dB-2 dB-4 dB-6 dB aba Syn. Tsm.  = 1.5 Tsm.  = 2 Tsm.  = 3 Example: VCV syllable /aba/ Time scaling of consonant duration with steady-state compression Intel. Enh. 13/15

27 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 26/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. /ama/ Spectrograms: Time-scaled VCV syllable Orig. Synth. β=1.5 β= 2 β= 2.5 Steady-state compression Transition segment expansion Intel. Enh. 14/15

28 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 27/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. /aba/ Original Time-scaled Intensity enhanced +6dB Time and Intensity scaling: VCV syllable Intel. Enh. 15/15

29 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 28/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. EXPERIMENTAL RESULTS  Test material - VCV syllables /aba/, /ada/, /aga/, /apa/, /ata/, /aka/  Time scaling factors : 1.0, 1.2, 1.5, 1.8, 2.0  CVR enhancement : +6 dB 12 processing conditions  Unprocessed: UP  Enhanced CVR without time-scaling: E  Time scaled: TS-1.0, TS-1.2, TS-1.5, TS-1.8, TS ‑ 2.0  Enhanced CVR, time scaled: ETS-1.0, ETS-1.2, ETS-1.5, ETS ‑ 1.8, ETS-2.0 Simulated hearing impairment (adding broadband noise) 6 different SNR levels (inf, 0, -3, -6, -9, and -12 dB) 72 test conditions 60 presentations, 5 tests for each condition,1 subject Exp. Res. 1/2

30 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 29/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. Results  Time-scaling factors 1.2-1.5 appears to be optimum  Time-scaling improves performance at lower SNR levels  Consonant intensity enhancement more effective Exp. Res. 2/2

31 IIT Bombay arjayani@ee.iitb.ac.in ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 30/30 Intro.Intro. Clear speech Trans.Det. Mod. Exp. Res. Sum.Clear speechTrans.Det.Mod.Exp. Res.Sum. SUMMARY & CONCLUSION Processing improved recognition scores for stop consonants  Without increasing overall speech duration  Method found more effective at lower SNR levels  Place feature identification improved significantly by processing  Intensity enhancement found more effective than duration enhancement To be investigated  Optimum scaling factors for different speech material  Testing using different speech material  Testing on more number of subjects & subjects with sensorineural impairment  Analysis in terms of vowel context, consonant category  Quantitative analysis of Intelligibility enhancement - MRT


Download ppt "IIT Bombay ICSCN 2008 - International Conference on Signal Processing, Communications and Networking 1/30 Intro.Intro. Clear speech."

Similar presentations


Ads by Google