Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg
Outline Part I - Speech –Speech –History of speech synthesis & coding –Speech coding methods Part II – Audio –Psychoacoustic models –MPEG-4 Audio
Speech Production The human’s vocal apparatus consists of: – lungs – trachea (wind pipe) – larynx contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through – oral tract – nasal tract
Elements of the speech signal: spectral resonance (formants, moving) periodic excitation (voicing, pitched) + pitch contour noise excitation (fricatives, unvoiced, no pitch) transients (stop-release bursts) amplitude modulation (nasals, approximants) timing The Speech Signal
Vowels - characterised by formants; generally voiced; Tongue & lips - effect of rounding. Examples of vowels: a, e, i, o, u, a, ah, oh. Vibration of vocal cords: male 50 - 250Hz, female up to 500Hz. Vowels have in average much longer duration than consonants. Most of the acoustic energy of a speech signal is carried by vowels. F1-F2 chartFormant positions The Speech Signal
1939 - Channel vocoder - first analysis-by-synthesis system developed by Homer Dudley of AT&T labs - VODER 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 VODER – the architecture History of Speech Coding
1939 - Channel vocoder - first analysis - by - synthesis system developed by Homer Dudley of AT&T labs - VODER 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 History of Speech Coding
History of Speech - Coding 1939 - Channel vocoder - first analysis - by - synthesis system Homer Dudley of AT&T labs - VODER 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962 1957 - -law encoding proposed (standardised for telephone network in 1972 (G.711)) 1952 - delta modulation proposed, differential PCM invented 1974 - ADPCM developed 1984 - CELP vocoder proposed (majority of coding standards for speech signal today use a variation on CELP)
Signal from a source is filtered by a time-varying filter with resonant properties similar to that of the vocal tract. The gain controls A v and A N determine the intensity of voiced and unvoiced excitation. The frequency of higher formant are attenuated by -12 dB/octave (due to the nature of our speech organs). This is an over simplified model for speech production. However, it is very often adequate for understanding the basic principles. Source-filter Model of Speech Production
Speech Coding Strategies 1. PCM Invented 1926, deployed 1962. The speech signal is sampled at 8 kHz. Uniform quantization requires >10 bits/sample. Non-uniform quantization (G.711, 1972) Quantizing y to 8 bits -> 64 kbit/s.
Speech Coding Strategies 2. Adaptive DPCM Example: G.726 (1974) Adaptive predictor based on six previous differences. Gain-adaptive quantizer with 15 levels ) 32 kbit/s.
Speech Coding Strategies 3. Model-based Speech Coding Advanced speech coders are based on models of how speech is produced: Excitation source Vocal tract
An Excitation Source Noise generator Pulse generator Pitch
Vocal Tract Filter 1: A Fixed Filter Bank BP g1g1 g2g2 gngn
Linear Predictive Coding (LPC) The controllable filter is modelled as y n = a i y n-i + G n where n is the input signal and y n is the output. We need to estimate the vocal tract parameters (a i and G) and the exciatation parameters (pitch, v/uv). Typically the source signal is divided in short segments and the parameters are estimated for each segment. Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).
Typical Scheme of an LPC Coder Noise generator Pulse generator Pitch Vocal tract filter v/uvGain Filter coeffs
Estimating the Parameters v/uv estimation –Based on energy and frequency spectrum. Pitch-period estimation –Look for periodicity, either via the a.c.f our some other measure, for example that gives you a minimum value when p equals the pitch period. –Typical pitch-periods: 20 - 160 samples.
Estimating the Parameters Vocal tract filter estimation –Find the filter coefficients that minimize the error 2 = ( y n - a i y n-i + G n ) 2 –Compare to the computation of optimal predictors (Lecture 7).
Estimating the Parameters Assuming a stationary signal: where R and p contain acf values. This is called the autocorrelation method.
Estimating the Parameters Alternatively, in case of a non-stationary signal: where This is called the autocovariance method.
Example Coding of parameters using LPC10 (1984): v/uv1 bit Pitch6 bits Voiced filter46 bits Unvoiced filter46 bits Synchronization1 bit Sum: 54 bits ) 2.4 kbit/s
The Vocal Tract Filter Different representations: –LPC parameters –PARCOR (Partial Correlation Coefficients) –LSF (Line Spectrum Frequencies)
LPC analysis ) V(z) Define perceptual weighting filter. This permits more noise at formant frequencies where it will be masked by the speech Synthesise speech using each codebook entry in turn as the input to V(z) Calculate optimum gain to minimise perceptually weighted error energy in speech frame Select codebook entry that gives lowest error Decoding: Receive LPC parameters and codebook index Re-synthesise speech using V(z) and codebook entry Encoding: Transmit LPC parameters and codebook index Performance: 16kbit/s: MOS=4.2, Delay=1.5 ms, 19 MIPS 8 kbit/s: MOS=4.1, Delay=35 ms, 25 MIPS 2.4kbit/s: MOS=3.3, Delay=45 ms, 20 MIPS Code Excited Linear Prediction Coding (CELP)
Examples G.728 –V(z) is chosen as a large FIR-filter (M ¼ 50). –The gain and FIR-parametrers are estimated recursively from previously received samples. –The code book contains 127 sequences. GSM –The code book contains regular pulse trains with variabel frequency and amplitudes. MELP –Mixed excitation linear prediction –The code book is combined with a noise generator.
Other Variations SELP – Self Excited Linear Prediction MPLP – Multi-Pulse Excited Linear Prediction MBE – Multi-Band Excitation Coding
MOS (Mean Opinion Score): result of averaging opinions scores for a set of between 20 – 60 untrained subjects. They rate the quality 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5-excellent). MOS of 4 or higher defines good or tool quality (network quality) - reconstructed signal generally indistinguishable from the original. MOS between 3.5 – 4.0 defines communication quality – telephone communications MOS between 2.5 – 3.5 implies synthetic quality In digital communications speech quality is classified into four general categories, namely: broadcast, network or toll, communications, and synthetic. Broadcast wideband speech – high quality ”commentary” speech – generally achieved at rates above 64 kbits/s. Subjective Assessment
DRT (Diagnostic Rhyme Test): listeners should recognise one of the two possible words in a set of rhyming pairs (e.g. meatl/heat) DAM (Diagnostic Acceptability Measure) - trained listeners judge various factors e.g. muffledness, buzziness, intelligibility Quality versus data rate (8kHz sampling rate) Subjective Assessment