Presentation is loading. Please wait.

Presentation is loading. Please wait.

A glimpsing model of speech perception

Similar presentations


Presentation on theme: "A glimpsing model of speech perception"— Presentation transcript:

1 A glimpsing model of speech perception
Martin Cooke & Sarah Simpson Speech and Hearing Research Department of Computer Science University of Sheffield

2 Motivation: The nonstationarity ‘paradox’
speech technology performance falls with the nonstationarity of the noise background … Aurora eval Simpson & Cooke (2003)

3 Motivation: The nonstationarity ‘paradox’
speech technology performance falls with the nonstationarity of the noise background … Miller (1947) … while listeners appear to prefer a nonstationary background (8-12 dB SRT gain) Simpson & Cooke (2003)

4 Possible factors In a 1-speaker background, listeners can …
… employ organisational cues from the background source to help segregate foreground … employ schemas for both foreground and background … benefit from better glimpses of the speech target but: multi-speaker backgrounds have certain advantages … … less chance of informational masking … easier enhancement algorithm

5 Glimpsing opportunities
Spectro-temporal glimpse densities % of time-frequency regions with a locally-positive SNR

6 Glimpsing Informal definition
a glimpse is some time-frequency region which contains a reasonably undistorted ‘view’ of local signal properties Precursors Term used by Miller & Licklider (1950) to explain intelligibility of interrupted speech Related to ‘multiple looks’ model of Viemeister & Wakefield (1991) which demonstrated ‘intelligent’ temporal integration of tone bursts Assmann & Summerfield (in press) suggest ‘glimpsing & tracking’ as way of understanding how listeners cope with adverse conditions Culling & Darwin (1994) developed a glimpsing model to explain double vowel identification for small ΔF0s de Cheveigné & Kawahara (1999) can be considered a glimpsing model of vowel identification Close relation to missing data processing (Cooke et al, 1994)

7 Types of glimpses Comodulated Eg Miller & Licklider (1950) Spectral
Eg Warren et al (1995) General uncomodulated Eg Howard-Jones & Rosen (1993), Buss et al (2003)

8 Evidence from distorted speech
e.g. Drullman (1995) filtered noisy speech into 24 ¼-octave bands, extracted the temporal envelope in each band, and replaced those parts of the envelope below a target level with a constant value. Found intelligibility of 60% when 98% of signal was missing

9 Glimpsing in natural conditions: the dominance effect
Although audio signals add ‘additively’, the occlusion metaphor is more appropriate due to loglike compression in the auditory system Consequently, most regions in a mixture are dominated by one or other source, leaving very few ambiguous regions, even for a pair of speech signals mixed at 0 dB.

10 Issues for a glimpsing model
What constitutes a useful glimpse? Is sufficient information contained in glimpses? How do listeners detect glimpses? How can they be integrated? Glimpse detection Glimpse integration

11 Glimpsing study Aims Determine if glimpses contain sufficient information Explore definition of useful glimpse Comparison between listeners and model using natural VCV stimuli Subset of Shannon et al (1999) corpus V = /a/ C = { b, d, g, p, t, k, m, n, l, r, f, v, s, z, sh, ch } Background source reversed multispeaker babbler for N=1, 8 Allows variation in glimpsing opportunities 3 SNRs (TMRs): 0, -6 and -12 dB 12 listeners heard 160 tokens in each condition 2 repeats X 16 VCVs X 5 male speakers

12 Identification results
1-speaker 8-speaker

13 Glimpsing model CDHMM employing missing data techniques
16 whole-word HMMs 8 states 4 component Gaussian mixture per state Input representation 10 ms frames of modelled auditory excitation pattern (40 gammatone filters, Hilbert envelope, 8 ms smoothing) NB: only simultaneous masking is modelled Training 8 repetitions of each VCV by 5 male speakers per model Testing As for listeners viz. 2 repetitions of each VCV by 5 male speakers Performance in clean: > 99%

14 Model performance I: ideal glimpses
All time-frequency regions whose local SNR exceeds a threshold Optimum threshold = 0 dB For this task, there is more than sufficient information in the glimpsed regions Listeners perform suboptimally with respect to this glimpse definition 1 8

15 Model performance: variation in detection threshold
Q Can varying the local SNR threshold for glimpse detection prodce a better match? No choice of local SNR threshold provides good fit to listeners Closest fit shown (-6 dB) 1 8

16 Analysis Unreasonable to expect listeners to detect individual glimpses in a sea of noise unless glimpse region is large enough

17 Analysis Unreasonable to expect listeners to detect individual glimpses in a sea of noise unless glimpse region is large enough

18 Model performance: useable glimpses
Definition: glimpsed region must occupy at least N ERBs and T ms Search over 1-15 ERBs, ms, at various detection thresholds Best match at 6.3 ERBs (9 channels) 40 ms 0 dB local SNR threshold 1 8 Howard-Jones & Rosen (1993) suggested 2-4 bands limit for uncomodulated glimpsing Buss et al (2003) found evidence for uncomodulated glimpsing in up to 9 bands

19 Consonant identification
Reasonable matches overall apart from b, s & z However, little token-by-token agreement between common listener errors and model errors. Why?

20 Factors ‘Confusability’ Audibility of target Informational masking
Energetic masking Existence of schemas for target Successful identification Organisational cues in target Existence of schemas for background Organisational cues in background

21 Measuring energetic masking
Approach: resynthesise glimpses alone Filter, time-reverse, refilter to remove phase distortion Select regions based on local SNR mask Results Little difference for 1-speaker background, suggesting relatively low contribution of info masking in this case (due to reversed masker?) Larger difference for 8-speaker case possibly due to ‘unrealistic’ glimpses 1 8 glimpses alone speech+noise

22 Comparison with ideal model
Results Ideal model performs well in excess of listeners when supplied with precisely the same information Possible reasons: Distortions Glimpses do not occur in isolation: possibility that a noise background will help Lack of nonsimultaneous masking model will inflate model performance Ideal (model) Ideal? (listeners)

23 The glimpse decoder Attempt at a unifying statistical theory for primitive and model-driven processes in CASA Basic idea: decoder not only determines the most likely speech hypothesis but also decides which glimpses to use Key advantage: no longer need to rely on clean acoustics! Can interpret (some) informational masking effects as the incorrect assignment of glimpses during signal interpretation Barker, J, Cooke, M.P. & Ellis, D.P.W. “Decoding speech in the presence of other sources”, accepted for Speech Communication

24 Summary & outlook Proposed a glimpsing model of speech identification in noise Demonstrated sufficiency of information in target glimpses, at least for VCV task Preliminary definition of useful glimpse gives good overall model-listener match Introduced 2 procedures for measuring the amount of energetic masking (i) via ASR (ii) via glimpse resynthesis Need nonsimultaneous masking model Need to isolate affects due to schemas Repeat using non-reversed speech to introduce more informational masking Need to quantify affect of distortion in glimpse resynthesis

25 Masking noise can be beneficial
Warren et al (1995) demonstrated spectral induction effect with 2 narrow bands of speech with intervening noise fullband Cooke & Cunningham (in prep) Spectral induction with single speech-bands.

26 Speech modulated noise
As in Brungart (2001) Model results and glimpse distributions indicate increase in energetic masking for this type of masker Natural speech natural, 1 spkr natural, 8 spkr SMN, 1 spkr SMN, 8 spkr Speech modulated noise

27 Speech modulated noise
Listeners perform better with SMN than predicted on the basis of reduced glimpses (cf SMN model), but not quite as well as they do with natural speech masker Suggests energetic masking is not the whole story (cf Brungart, 2001), but further work needed to quantify relative contribution of Release from IM Absence of background models/cues 1 8 SMN (model) NAT (model) SMN (listeners) NAT (listeners)


Download ppt "A glimpsing model of speech perception"

Similar presentations


Ads by Google