Presentation is loading. Please wait.

Presentation is loading. Please wait.

Human Speech Communication

Similar presentations


Presentation on theme: "Human Speech Communication"— Presentation transcript:

1 Human Speech Communication
speaker dialogue interaction message linguistic code (< 50 b/s) motor control speech production SPEECH SIGNAL (> 50 kb/s) auditory processing speech perception processes self-control adaptation listener

2 speaker listener knowledge knowledge high bit rate u h e l o w r d
low bit rate message very low bit rate speaker knowledge shared u h e l o w r d low bit rate message very low bit rate listener knowledge

3 Machine recognition of speech
u h e l o w r d high bit rate u h e l o w r d low bit rate message word another word machine recognition of speech

4 u o COARTICULATION

5 hello world u h e l o w r d u h e l o w r d coarticulation+ talker idiosyncrasies + environmental variability = a big mess

6 Two dominant sources of variability in speech
FEATURE VARIABILITY different people sound different, communication environment different, coarticulation effects, … TEMPORAL VARIABILITY people can say the same thing with different speeds “Doubly stochastic” process (Hidden Markov Model) Speech as a sequence of hidden states (phonemes) - recover the sequence never know for sure which data will be generated from a given state never know for sure in which state we are

7 already old Greeks …….. wall fire echoes activity shadows

8 f0= Hz hi Know what are the typical ranges of boy’s and girl’s voices ? how likely a boy walks first? how many boys and girls go typically together? how many more boys is typically there? Want to know where are the boys (girls) ?

9 the model pm pf P(sound|gender) 1-pm m f m f f0 1-pf p1m pm pf
P(gender) Given this knowledge, generate all possible sequences of boys and girls and find which among them could most likely generate the observed sequence

10 Getting the parameters (training of the model)
f0= Hz boys compute distributions of parameters for each state girls find the best alignment of states given the parameters compute distributions of parameters for each state find the best alignment of states given the parameters hi “Forced alingnment” of the model with the data

11 Machine recognition of speech
finding boys and girls speech recognition people’s parade speech utterance gender groups speech sounds voice pitch vector of features derived from the signal prior probabilities of gender occurrence language model more complex model architecture

12 How to find w (efficiently) ?
Form of the model M ( wi ) ? What is the data x ?

13 Data x ? Speech signal ? Describes changes in acoustic pressure
original purpose is reconstruction of speech rather high bit-rate additional processing is necessary to alleviate the irrelevant information

14 Machine for recognition of speech
acoustic training data prior knowledge speech signal pre-processing acoustic processing decoding (search) best matching utterance

15 time frequency time

16 time frequency

17 /j/ /u/ /ar/ /j/ /o/ /j/ /o/
Short-term Spectrum time 10-20 ms get spectral components /j/ /u/ /ar/ /j/ /o/ /j/ /o/ time frequency

18 Spectrogram – 2D representation of sound

19 Short-term Fourier analysis
p frequency [rad/s] log gain

20 Spectral resolution of hearing
spectral resolution of hearing decreases with frequency (critical bands of hearing, perception of pitch,…) critical bandwidth [Hz] frequency [Hz] 100 50 500 1000 2000 5000 10000

21

22 energies in “critical bands”
frequency energies in “critical bands”

23 Sensitivity of hearing depends on frequency

24

25 intensity ≈ signal 2 [w/m2]
loudness [Sones] loudness = intensity 0.33 intensity (power spectrum) loudness |.|0.33

26 Not all spectral details are important
a) compute Fourier transform of the logarithmic auditory spectrum and truncate it (Mel cepstrum) b) approximate the auditory spectrum by an autoregressive model (Perceptual Linear Prediction – PLP) 6th order AR model frequency (tonality) power (loudness) 14th order AR model

27

28

29

30 Current state-of-the-art speech recognizers typically use high model order PLP

31 It’s about time (to talk about TIME)

32

33 Masking in Time masker signal t time increase in threshold t
t 200 ms stronger masker suggests ~200 ms buffer (critical interval) in auditory system

34 filter with time constant > 200 ms (temporal buffer > 200 ms)
time trajectories of the spectrum in critical bands of hearing filter with time constant > 200 ms (temporal buffer > 200 ms)

35 spectrogram (short-term Fourier spectrum)
time [s] spectrogram (short-term Fourier spectrum) Perceptual Linear Prediction (PLP) (12th order model) RASTA-PLP

36 spectrum from RASTA-PLP
filter spectrogram spectrum from RASTA-PLP

37 Data-guided feature extraction
Spectrogram Posteriogram time frequency data preprocessing artificial neural network trained on large amounts of labeled data /f/ /ay/ /v/ time

38 Signal components inside the critical time-frequency window interact
Masking in time increase in threshold of perception of the target noise bandwidth critical bandwidth what happens outside the critical band does not affect decoding of the sound in the critical band Masking in frequency stronger masker increase in threshold of perception of the target t 200 ms what happens outside the critical interval, does not affect detection of signal within the critical interval Signal components inside the critical time-frequency window interact

39 Emulation of cortical processing (MRASTA)
16 x 14 bands = 448 projections data1 t0 32 2-D projections with variable resolutions frequency data2 dataN 32 2-D projections with variable resolutions (critical-band spectral analysis) peripheral processing time

40 Multi-resolution RASTA (MRASTA)
(Interspeech 05) -500 500 time [ms] Spectro-temporal basis formed by outer products of time central band frequency derivative 3 critical bands time [ms] frequency example Bank of 2-D (time-frequency) filters (band-pass in time, high-pass in frequency) RASTA-like: alleviates stationary components multi-resolution in time

41 Spectral dynamics (much) more interesting than spectral shape
Old way of getting spectral dynamics t0 short-term spectral components time f0 Older way of getting spectral dynamics (Spectrograph™) t0 f0 components spectral time

42 frequency time frequency time
critical-band spectrum from all-pole models of auditory-like spectrum (PLP) frequency time critical-band spectrum from all-pole models of temporal envelopes of the auditory-like spectrum (FDPLP) frequency time

43 Phoneme recognition accuracy [%]
Telephone speech Digit recognition accuracy [%] - ICSI Meeting Room Digit Corpus clean reverberated PLP FDPLP Improvements on real reverberations similar (IEEE Signal Proc.Letters 08) Reverberant speech Gain included Gain excluded Phoneme recognition accuracy [%] TIMIT HTIMIT PLP-MRASTA FDPLP

44 FDPLP with static and dynamic compression
Recognition accuracy [%] on TIMIT, HTIMIT, CTS and NIST RT05 meeting tasks PLP FDPLP TIMIT HTIMIT CTS RT Hilbert envelope logarithmically compressed FDLP fit FDLP fit to Hilbert envelope FDLP fit compressed by PEMO model

45 TANDEM (Hermansky et al., ICASSP 2000)
features for conventional speech recognizer should be Normally distributed and uncorrelated principal component projection pre-softmax outputs to HMM (Gaussian mixture based) classifier posteriors of speech sounds correlation matrix of features histogram of one element

46 Summary Alternatives to short-term spectrum based attributes could be beneficial data-driven phoneme posterior based extract speech-specific knowledge from large out-of-domain corpora larger temporal spans exploit coarticulation patterns of individual speech sounds models of temporal trajectories improved modeling of fine temporal details allows for partial alleviation of channel distortions and reverberation effects

47 Coarticulation u h e l o w r d coarticulation human speech production
human auditory perception u h e l o w r d

48 Hierarchical bottom-up event-based recognition ?
w r d low bit rate unequally distributed identities of individual speech sounds (phonemes) equally distributed posterior probabilities of speech sounds pre-processing to emulate known properties of peripheral and cortical auditory processes high bit rate

49 One way of going from phoneme posteriors to phonemes
probability /n/ /ay/ matched filtering

50 (some of) the Issues e.t.c. ???????????????? SPEECH SIGNAL
(high bit-rate) auditory perception acoustic “events” ??? cognitive processes ??? linguistic code (low bit rate) message (even lower bit rate) Perceptual processes, involved in decoding of message in speech ? where and how ? higher levels (cortical) probably most relevant acoustic “events” for speech ? Cognitive issues what to “listen for” ? roles of “bottom-up” and “top-down” channels ? coding alphabet (phonemes) ? category forming invariants when to make decision ? e.t.c. ????????????????

51 Speculation Improvements in acoustic processing could make domain-independent ASR feasible


Download ppt "Human Speech Communication"

Similar presentations


Ads by Google