Message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50.

message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50 b/s) message Human Speech Communication

PCM (Pulse Code Modulation) Transmit value of each speech sample –dynamic range of speech is about 50-60 dB 11 bits/sample –maximum frequency in telephone speech is 3.4 kHz sampling frequency 8 kHz 8000 x 11 = 88 kb/s Simple and universal but not very efficient

Better quantization ? Less quantization noise for weaker signals IN OUT

 - law A - law Logarithmic PCM (  -law, A-law) Finer quantization for each individual small amplitude sample –how about small signal samples surrounded by large ones? –it is the instantaneous signal energy which should determine the step

Adaptive PCM larger quantization step for large energy signal Forward adaptive –look into the future what the signal amplitude will be –adjust the quantization step Inherent (algorithmic) delay in processing Decoder needs to know the quantization step used –quantization step needs to be transmitted Backward adaptive –adjust the quantization step from the amplitude of past samples –hope that the future will not be too different from the past Decoder can look into the past too –no need to transmit the quantization step Possibly unstable (errors may cumulate) –Needs periodical re-sets efficiency - robustness compromise

Differential coding For many natural signals, the difference between successive samples quantizes better than samples themselves Even better, predict the current sample from the past ones and transmit the error of the prediction current sample time

Differential predictive coding DPCM –a single predictor reflecting global predictability of speech –predictor order up to 4-5 –delta modulation - gross quantization of prediction error into 1 bit (typically requires up-sampling well over the Nyquist rate) adaptive DPCM –new predictor for every new speech block –predictor needs to be transmitted together with the prediction error

Speech Coders

Linear model of speech production

A.G. Bell got it almost right

linear model of speech source filter speech changes slowly

current sample time short-term prediction long-term prediction short-term - resonance of vocal tract long-term - periodicity of voiced speech (vocal cord vibration)

LPC vocoder The same principle as in H. Dudley’s Vocoder Used by US Government (LPC-10) - 2.4 kbs

Residual Excited LPC (RELP) Transmitter: –Simplify prediction error (low-pass filter and down-sample Receiver –re-introduce high frequencies in the simplified residual (nonlinear distortion)

Analysis-by-synthesis Identical synthesizer in coder and in decoder –change parameters in coder –use for synthesizing speech –compare synthesized speech with real speech –when “close enough”, send parameters to the receiver

Future in speech coding? No need to transmit what we do not hear –study human hearing, especially masking No need to transmit what is predictable –speech production mechanism –speaker characteristics –linguistic code (recognition-synthesis) –thought-to-speech

Automatic recognition of speech reduce information = decrease entropy linguistic message phoneme string (below 50 b/s) knowledge electric signal (more than 50 kb/s) prior knowledge ( textbook ) acquired knowledge ( data ) Automatic speech recognition (ASR) –derive proper response from speech stimulus Auditory perception –how do biological systems respond to acoustic stimuli Knowledge of auditory perception ?

Principle of stochastic ASR Using a model of speech production process, generate all possible acoustic sequences w i for all legal linguistic messages Compare all generated sequences with the unknown acoustic input x to find which one is the most similar 1.What is the model M ( w i ) ? 2.Form of the data x ?

One (simple) model hello world u helowrdlo Two dominant sources of variability in speech 1.people say the same thing with different speeds ( temporal variability ) 2.different people sound different, communication environment different, ( feature variability) “Doubly stochastic” process (Hidden Markov Model) –Speech as a sequence of hidden states - recover the state sequence 1.never know for sure in which state we are 2.never know for sure which data can be generated from a given state

Hidden Markov Model hi sequence of male and female groups? f 0 =160 Hz 170 Hz 160 Hz 170 Hz 200 Hz 110 Hz 140 Hz 240 Hz170 Hz 190 Hz m f m m m m m f m m m f pmpm p m-f m f The model P(sound|gender) pfpf p f-m f0f0 p 1m

mfm fm 160 170 160 170 200 110 140 240 170 190 units of speech (phonemes) x What the x should be ?

Speech signal ? always also carry some irrelevant information –additional processing is necessary to alleviate it Reflects changes in acoustic pressure –its original purpose is reconstruction of speech –does carry relevant information

histogram speech signal correlations

/u/ /o/ /a/ /  / /iy/  beer Isaac Newton it is in the spectrum !! Where Is The Message ? /uw//ao//ah//eh//ih //iy/ averaged fft spectra of some vowels from 3 hours of fluent speech

Radio Rex (1917) “limited commercial success” -John Pierce 1969 Sweet smell of money

Linear model of speech production sourcefilter speech log spectrum Re-synthesis of speech from its spectral envelope is possible ergo spectral envelope carries phonetic information

time frequency spectrum time frequency

Steam Engine (1769) Internal Combustion Engine (2003) Inertia in engineering

time frequency  j/ /u/ /a r / /j/ /o/ /j/ /o/ 10-20 ms get spectral components Short-term Spectrum time

histogram short-term speech spectral envelope correlations

histogram logarithmic short-term speech spectral envelope correlations

histogram cosine transform of logarithmic short-term speech spectral envelope (cepstrum) correlations

What Is Wrong With the Short-term Spectrum ? 1) inconsistent (same message, different representation) frequency short-term spectrum “auditory-like” spectrum auditory-like modifications

Pitch of the tone (Mel scale) Frequency resolution of human ear decreases with frequency

FFT  “critical-band energy” f t Emulating frequency resolution of human ear with FFT

Equal Loudness Curves

Perceptual Linear Prediction (PLP) [Hermansky 1990] Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model –critical-band spectral resolution –equal-loundness sensitivity –intensity-loudness nonlinearity Today applied in virtually all state-of-the-art experimental ASR systems

Task oriented signal processing Optimize the processing on the task ! –in effect, training of signal processing module on the data TASK: get the linguistic information (a string of phonemes)

Linear Discriminant Analysis (LDA) Needs labeled data

/j/ /u/ /a r / /j/ /o/ /j/ /o/ LDA gives basis for projection of spectral space time frequency Spectral Basis from LDA

LDA vectors from Fourier Spectrum Spectral resolution of LDA- derived spectral basis is higher at low frequencies Critical bands of human hearing are narrower at lower frequencies 63 % 16 % 12 % 2 %

Sensitivity to Spectral Change (Malayath 1999) Cosine basis LDA-derived bases Critical-band filterbank

if the receiver could be controlled –put more resources (introduce less noise) where there is more signal – biological system optimized for information extraction from sensory signals Combination of channel and signal spectrum should be as flat (as random- like) as possible. – Shannon, Communication in presence of noise (1949) resource space energy of the signal resource space level of noise in the channel energy of the signal level of noise in the channel if signal could be controlled (e.g. in communication) –put more signal where there is less noise – sensory signal optimized for a given communication channel

What Is Wrong With the Short-term Spectral Envelope? 2) Fragile (easily corrupted by minor disturbances) f spectrum f additive band-limited noise ignore the noisy parts of the spectrum f linear (high-pass) filtering remove means from parts of the spectrum

tone at f threshold of perception of the tone noise bandwidth Nonlinear frequency resolution of hearing –Critical bands up to ~600 Hz constant bandwidth above 1 kHz constant Q band-pass filtered noise centered at f Simultaneous Masking critical bandwidth

More Important Outcome of Masking Experiments What happens outside the critical band does not affect detection of events within the band !!! Independent processing of parts of the spectrum ? S ( frequency ) p f2 p f3 p f4 p f5 p f6 p f1 ( Hermansky, Sharma and Pavel 1996, Bourlard and Dupont 1996 ) frequency {p(f)} Replace spectral vector by a matrix of posterior probabilities of acoustic events

u helowrdlo coarticulation What Is Wrong With the Short-term Spectral Envelope? 3) Coarticulation (inertia of organs of speech production) human auditory perception

Masking in Time suggests ~200 ms buffer in auditory system –also seen in perception of loudness, detection of short stimuli, gaps in tones, auditory afterimages, binaural release from masking, ….. –what happens outside this buffer, does no affect detection of signal within the buffer signal  masker time stronger masker increase in threshold 0  200 ms

Short-term Features? time ~ 10 ms data x processing time longer time span ? (~250 ms?)

Cortical Receptive Fields Average of the first two principal components ( 83% of variance ) along temporal axis from about 180 cortical receptive fields ( from D. Klein 2004, unpublished ) time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron

Data for Deriving Posterior Probabilities of Speech Events 1-3 critical bands 250-1000 ms TIME [s] FREQUENCY

How to Get Estimates of Temporal Evolution of Spectral Energy ? - with M. Athineos, D. Ellis (Columbia Univ), and P. Fousek (CTU Prague) data x time 10-20 ms 200-1000 ms 1-3 Bark time 200-1000 ms 1-3 Bark all-pole model of part of time-frequency plane

All-pole Model of Temporal Trajectory of Spectral Energy the signal signal power spectrum all-pole model of the power spectrum DCT of the signal Hilbert envelope of the signal all-pole model of the Hilbert envelope conventional LP spectral domain LP

signal discrete cosine transform low frequency high frequency prediction all-pole model of low- frequency Hilbert envelope all-pole model of high-frequency Hilbert envelope All-pole Models of Sub-band Energy Contours

Critical-band Spectrum From FFT time tonality Critical-band Spectrum From All-pole Models Of Hilbert Envelopes in Critical Bands time tonality

Putting It All Together TRAP-TANDEM –data-guided features based on frequency-independent processing of relatively long spans of signal with S. Sharma, P. Jain, S. Sivadas, ICSI Berkeley and TU Brno time frequency data processing ( trained NN ) processing ( trained NN ) some function of phoneme posteriors data processing ( trained NN ) class posteriors

Message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50.

Similar presentations

Presentation on theme: "Message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50.

Similar presentations

Presentation on theme: "Message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50."— Presentation transcript:

Similar presentations

About project

Feedback