Download presentation
Presentation is loading. Please wait.
1
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing EE2F1 Multimedia (1): Speech & Audio Technology Lecture 9: Speech Coding Martin Russell Electronic, Electrical & Computer Engineering School of Engineering The University of Birmingham
2
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 2 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing What is speech coding? Digitisation of speech for transmission or storage Aim to minimise bits per second (bps)… …while preserving speech quality: –intelligibility and naturalness Main kinds of speech coding scheme: –waveform coder –vocoder
3
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 3 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Approaches Waveform coding –Work for all audio signals –Generic methods for bit reduction –Exploit properties of human hearing Vocoders –Optimised for speech coding –Assume that the signal to be encoded is speech
4
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 4 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Waveform coders PCM (Pulse Code Modulation) DPCM (Differential PCM) ADPCM (Adaptive Differential PCM) Delta modulation (1 bit ADPCM)
5
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 5 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Pulse Code Modulation (PCM) How many quantization points? How many samples per second (sample rate)? Quantization error Sample point Quantization point
6
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 6 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Differential PCM Encode the differences between values at successive quantisation points Quantization error Sample point Quantization point
7
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 7 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Adaptive DPCM Use small number of bits to encode differences in DPCM Adjust quantisation step size to accommodate large changes in the signal
8
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 8 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Delta Modulation 1 bit ADPCM Sequence of ‘all 1s’ or ‘all 0s’ indicates need to change step size ‘Slope Overload’ indicated by excessive use of 1s or 0s 1 0 11 0000
9
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 9 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Waveform coding summarised PCM, with 8 bits per sample (amplitude compression) and 8 kHz sampling rate, gives a bit rate of 64 kbps DPCM (aka. Delta PCM), difference between samples needs fewer bits for same accuracy Adaptive DPCM, scaling of bits varied, depending on dynamic range Delta modulation = 1-bit DPCM –can adapt step size to avoid slope overload –gives reasonable intelligibility at just 16 kbps
10
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 10 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Vocoders Coders designed specifically for speech Sometimes called analysis-synthesis coders Exploit source-filter model of speech
11
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 11 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Vocoders Encoding –Estimate and encode source –Estimate and encode vocal tract filter –Store as feature vector Transmission –Transmit at low data rate (~50-100 vectors per second) –Can do this because of relatively slow movement of vocal tract Decoding –Recover source information –Recover vocal tract filter information –Convert into synthesiser control parameters –Synthesise speech
12
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 12 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example: Channel Vocoder 19 band-pass filters, spanning 0-4 kHz centre-frequencies arranged non-linearly on frequency axis bandwidths increase with frequency, like ear’s critical bands Energies from filter outputs averaged over 20 ms
13
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 13 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example: Channel Vocoder Spectrum shape (Filter-bank energies) coded by DPCM Combined with binary ‘voiced/unvoiced’ flag plus estimate of fundamental frequency f 0 if ‘voiced’ 1 ‘frame’ of data (48bits) transmitted 50 times per second 2,400 bps
14
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 14 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example: Channel Vocoder Spectrum shape decoded and used to configure filterbank Voiced/unvoiced flag plus f 0 used to select source
15
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 15 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example: Channel Vocoder AnalyserSynthesiser
16
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 16 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Linear Predictive Coding (LPC) Basic idea –Assume that value of speech signal at time t can be written as a weighted sum of its values at times t-1, t-2,…, t-N –Nth order Linear Predictive Coding (LPC) –The coefficients a 0,…, a N can be thought of as the parameters of a digital filter (lecture 3) –They define the vocal tract filter at time t –Used in LPC vocoder
17
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 17 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Finite Impulse Response (FIR) digital filter x(n)x(n) Z -1 y(n)y(n) a1a1 a2a2 aNaN
18
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 18 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing LPC Vocoders Quality of LPC vocoded speech depends critically on the quality of the excitation signal Two particular forms of LPC used for speech coding in GSM mobile phones –RELP: Residual Excited LPC –CELP: Codebook Excited LPC
19
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 19 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Example: CELP Vocoders Vocal tract filter: –LPC analysis conducted over short (~20ms) section of speech to give LPC coefficients Source –Excitation source estimated over window –Compared with a finite set of ‘reference’ excitation signals e 1,…,e C. –Code for most similar reference transmitted –The set of references is called a codebook –Hence Codebook Excited LPC
20
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 20 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Formant Vocoder A formant vocoder exploits the importance of F 1, F 2 and F 3 for speech perception Formant frequencies, amplitudes and bandwidths estimated and used to model vocal tract filter Transmitted, with V/UV and f 0 information at 50-100 frames per second Speech decoded using a formant synthesiser Using 5-6 bits for each of the 10 control parameters results in 2.5-6 kbps bit rate
21
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 21 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Recognition-Synthesis Coder Input Speech “recce report…” Speech Recognizer Phone-level transcription r E k i r @ p O t.. TransmitterReceiver Phone-level transcription r E k i r @ p O t.. Speech Synthesiser Output Speech “recce report…” 50 bps
22
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 22 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Recognition-synthesis coders New technology – still in research labs Very low data rates: –Sounds of English (~46 phonemes) can be specified using 6 bits –Talking at 8 phonemes per second, the linguistic content can be encoded in just 50 bps! Computationally complex
23
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 23 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Use of ‘knowledge’ Bit rates reduced by exploiting properties of the the speech signal: –waveform coders: limited bandwidth –vocoders: signal contains resonances –recognition-synthesis: signal is speech Highest-level models give lowest bit rates Paralinguistic properties of the speech are sacrificed: –speaker’s identity –state of health –emotional/psychological state
24
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 24 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision Processing Summary of coding Waveform coders –PCM, DPCM, ADPCM –Delta modulation Vocoders –Channel vocoder, RELP, CELP –Segment vocoder Recognition-synthesis coders Trade-offs
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.