Download presentation
Presentation is loading. Please wait.
1
message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50 b/s) message Human Speech Communication
2
PCM (Pulse Code Modulation) Transmit value of each speech sample –dynamic range of speech is about 50-60 dB 11 bits/sample –maximum frequency in telephone speech is 3.4 kHz sampling frequency 8 kHz 8000 x 11 = 88 kb/s Simple and universal but not very efficient
3
Better quantization ? Less quantization noise for weaker signals IN OUT
4
- law A - law Logarithmic PCM ( -law, A-law) Finer quantization for each individual small amplitude sample –how about small signal samples surrounded by large ones? –it is the instantaneous signal energy which should determine the step
5
Adaptive PCM larger quantization step for large energy signal Forward adaptive –look into the future what the signal amplitude will be –adjust the quantization step Inherent (algorithmic) delay in processing Decoder needs to know the quantization step used –quantization step needs to be transmitted Backward adaptive –adjust the quantization step from the amplitude of past samples –hope that the future will not be too different from the past Decoder can look into the past too –no need to transmit the quantization step Possibly unstable (errors may cumulate) –Needs periodical re-sets efficiency - robustness compromise
6
?
7
Differential coding For many natural signals, the difference between successive samples quantizes better than samples themselves Even better, predict the current sample from the past ones and transmit the error of the prediction current sample time
8
Differential predictive coding DPCM –a single predictor reflecting global predictability of speech –predictor order up to 4-5 –delta modulation - gross quantization of prediction error into 1 bit (typically requires up-sampling well over the Nyquist rate) adaptive DPCM –new predictor for every new speech block –predictor needs to be transmitted together with the prediction error
9
Speech Coders
10
Linear model of speech production
11
A.G. Bell got it almost right
12
linear model of speech source filter speech changes slowly
13
current sample time short-term prediction long-term prediction short-term - resonance of vocal tract long-term - periodicity of voiced speech (vocal cord vibration)
14
LPC vocoder The same principle as in H. Dudley’s Vocoder Used by US Government (LPC-10) - 2.4 kbs
15
Residual Excited LPC (RELP) Transmitter: –Simplify prediction error (low-pass filter and down-sample Receiver –re-introduce high frequencies in the simplified residual (nonlinear distortion)
16
Analysis-by-synthesis Identical synthesizer in coder and in decoder –change parameters in coder –use for synthesizing speech –compare synthesized speech with real speech –when “close enough”, send parameters to the receiver
17
Future in speech coding? No need to transmit what we do not hear –study human hearing, especially masking No need to transmit what is predictable –speech production mechanism –speaker characteristics –linguistic code (recognition-synthesis) –thought-to-speech
18
Automatic recognition of speech reduce information = decrease entropy linguistic message phoneme string (below 50 b/s) knowledge electric signal (more than 50 kb/s) prior knowledge ( textbook ) acquired knowledge ( data ) Automatic speech recognition (ASR) –derive proper response from speech stimulus Auditory perception –how do biological systems respond to acoustic stimuli Knowledge of auditory perception ?
19
Principle of stochastic ASR Using a model of speech production process, generate all possible acoustic sequences w i for all legal linguistic messages Compare all generated sequences with the unknown acoustic input x to find which one is the most similar 1.What is the model M ( w i ) ? 2.Form of the data x ?
20
One (simple) model hello world u helowrdlo Two dominant sources of variability in speech 1.people say the same thing with different speeds ( temporal variability ) 2.different people sound different, communication environment different, ( feature variability) “Doubly stochastic” process (Hidden Markov Model) –Speech as a sequence of hidden states - recover the state sequence 1.never know for sure in which state we are 2.never know for sure which data can be generated from a given state
21
Hidden Markov Model hi sequence of male and female groups? f 0 =160 Hz 170 Hz 160 Hz 170 Hz 200 Hz 110 Hz 140 Hz 240 Hz170 Hz 190 Hz m f m m m m m f m m m f pmpm p m-f m f The model P(sound|gender) pfpf p f-m f0f0 p 1m
22
mfm fm 160 170 160 170 200 110 140 240 170 190 units of speech (phonemes) x What the x should be ?
23
Speech signal ? always also carry some irrelevant information –additional processing is necessary to alleviate it Reflects changes in acoustic pressure –its original purpose is reconstruction of speech –does carry relevant information
24
histogram speech signal correlations
25
/u/ /o/ /a/ / / /iy/ beer Isaac Newton it is in the spectrum !! Where Is The Message ? /uw//ao//ah//eh//ih //iy/ averaged fft spectra of some vowels from 3 hours of fluent speech
26
Radio Rex (1917) “limited commercial success” -John Pierce 1969 Sweet smell of money
27
Linear model of speech production sourcefilter speech log spectrum Re-synthesis of speech from its spectral envelope is possible ergo spectral envelope carries phonetic information
28
time frequency spectrum time frequency
29
Steam Engine (1769) Internal Combustion Engine (2003) Inertia in engineering
30
time frequency j/ /u/ /a r / /j/ /o/ /j/ /o/ 10-20 ms get spectral components Short-term Spectrum time
31
histogram short-term speech spectral envelope correlations
32
histogram logarithmic short-term speech spectral envelope correlations
33
histogram cosine transform of logarithmic short-term speech spectral envelope (cepstrum) correlations
34
What Is Wrong With the Short-term Spectrum ? 1) inconsistent (same message, different representation) frequency short-term spectrum “auditory-like” spectrum auditory-like modifications
35
Pitch of the tone (Mel scale) Frequency resolution of human ear decreases with frequency
36
FFT “critical-band energy” f t Emulating frequency resolution of human ear with FFT
37
Equal Loudness Curves
39
Perceptual Linear Prediction (PLP) [Hermansky 1990] Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model –critical-band spectral resolution –equal-loundness sensitivity –intensity-loudness nonlinearity Today applied in virtually all state-of-the-art experimental ASR systems
42
Task oriented signal processing Optimize the processing on the task ! –in effect, training of signal processing module on the data TASK: get the linguistic information (a string of phonemes)
43
Linear Discriminant Analysis (LDA) Needs labeled data
44
/j/ /u/ /a r / /j/ /o/ /j/ /o/ LDA gives basis for projection of spectral space time frequency Spectral Basis from LDA
45
LDA vectors from Fourier Spectrum Spectral resolution of LDA- derived spectral basis is higher at low frequencies Critical bands of human hearing are narrower at lower frequencies 63 % 16 % 12 % 2 %
46
Sensitivity to Spectral Change (Malayath 1999) Cosine basis LDA-derived bases Critical-band filterbank
47
if the receiver could be controlled –put more resources (introduce less noise) where there is more signal – biological system optimized for information extraction from sensory signals Combination of channel and signal spectrum should be as flat (as random- like) as possible. – Shannon, Communication in presence of noise (1949) resource space energy of the signal resource space level of noise in the channel energy of the signal level of noise in the channel if signal could be controlled (e.g. in communication) –put more signal where there is less noise – sensory signal optimized for a given communication channel
48
What Is Wrong With the Short-term Spectral Envelope? 2) Fragile (easily corrupted by minor disturbances) f spectrum f additive band-limited noise ignore the noisy parts of the spectrum f linear (high-pass) filtering remove means from parts of the spectrum
49
tone at f threshold of perception of the tone noise bandwidth Nonlinear frequency resolution of hearing –Critical bands up to ~600 Hz constant bandwidth above 1 kHz constant Q band-pass filtered noise centered at f Simultaneous Masking critical bandwidth
50
More Important Outcome of Masking Experiments What happens outside the critical band does not affect detection of events within the band !!! Independent processing of parts of the spectrum ? S ( frequency ) p f2 p f3 p f4 p f5 p f6 p f1 ( Hermansky, Sharma and Pavel 1996, Bourlard and Dupont 1996 ) frequency {p(f)} Replace spectral vector by a matrix of posterior probabilities of acoustic events
51
u helowrdlo coarticulation What Is Wrong With the Short-term Spectral Envelope? 3) Coarticulation (inertia of organs of speech production) human auditory perception
52
Masking in Time suggests ~200 ms buffer in auditory system –also seen in perception of loudness, detection of short stimuli, gaps in tones, auditory afterimages, binaural release from masking, ….. –what happens outside this buffer, does no affect detection of signal within the buffer signal masker time stronger masker increase in threshold 0 200 ms
53
Short-term Features? time ~ 10 ms data x processing time longer time span ? (~250 ms?)
54
Cortical Receptive Fields Average of the first two principal components ( 83% of variance ) along temporal axis from about 180 cortical receptive fields ( from D. Klein 2004, unpublished ) time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron
55
Data for Deriving Posterior Probabilities of Speech Events 1-3 critical bands 250-1000 ms TIME [s] FREQUENCY
56
How to Get Estimates of Temporal Evolution of Spectral Energy ? - with M. Athineos, D. Ellis (Columbia Univ), and P. Fousek (CTU Prague) data x time 10-20 ms 200-1000 ms 1-3 Bark time 200-1000 ms 1-3 Bark all-pole model of part of time-frequency plane
57
All-pole Model of Temporal Trajectory of Spectral Energy the signal signal power spectrum all-pole model of the power spectrum DCT of the signal Hilbert envelope of the signal all-pole model of the Hilbert envelope conventional LP spectral domain LP
58
signal discrete cosine transform low frequency high frequency prediction all-pole model of low- frequency Hilbert envelope all-pole model of high-frequency Hilbert envelope All-pole Models of Sub-band Energy Contours
59
Critical-band Spectrum From FFT time tonality Critical-band Spectrum From All-pole Models Of Hilbert Envelopes in Critical Bands time tonality
60
Putting It All Together TRAP-TANDEM –data-guided features based on frequency-independent processing of relatively long spans of signal with S. Sharma, P. Jain, S. Sivadas, ICSI Berkeley and TU Brno time frequency data processing ( trained NN ) processing ( trained NN ) some function of phoneme posteriors data processing ( trained NN ) class posteriors
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.