Speech Coding Techniques

Presentation on theme: "Speech Coding Techniques"— Presentation transcript:

Speech Coding Techniques
Chapter 3

Voice Quality (MOS) ITU recommendation of a 5-point scale;
Excellent – 5 Good – 4 Fair – 3 Poor – 2 Bad – 1 This metric is known as the Mean Opinion Score (MOS)

About Speech Speech was not meant to be sent over a data network – Pages of text were sent at a data rate of 300 bauds – This was sufficiently fast for a printer at the far end to print them out as they arrived. The Teletype was an example of this.

Voice Sampling

Sampling Rate Nyquist’s theorem states that a signal can be reconstructed if it is sampled at twice the maximum frequency of the signal. Speech frequency range is 300 – 3400 cycles/second So for conversational speech the maximum would be 4000 cycles/second The sampling rate would then be 8000 samples per second

Quantization

Quantization Noise When we use bits to represent each level
The number of bits used determines the number of levels The number of levels determines the accuracy of our representation of the original signal The difference between the actual signal and the digital reproduction is known as Quantization Noise

Linear Quantization Applicable when the signal is in a finite range (fmin, fmax) The entire data range is divided into L equal intervals of length Q (known as quantization interval or quantization step-size) Q=(fmax-fmin)/L Interval i is mapped to the middle value of this interval We store/send only the index of quantized value min

Signal Range is Symmetric

Errors Errors occur on every sample except where the sample size exactly coincides the mid-point of the decision level. If smaller steps are taken the quantization error will be less. However, increasing the steps will complicate the coding operation and increase bandwidth requirements. Quantizing noise depends on step size and not on signal amplitude

Non-Linear Quantization
The quantizing intervals are not of equal size Small quantizing intervals are allocated to small signal values (samples) and large quantization intervals to large samples so that the signal-to-quantization distortion ratio is nearly independent of the signal level S/N ratios for weak signals are much better but are slightly less for the stronger signals “Companding” is used to quantize signals

Function representation

Companding Formed from the words compressing and expanding.
A PCM compression technique where analogue signal values are rounded on a non-linear scale. The data is compressed before sent and then expanded at the receiving end using the same non-linear scale. Companding reduces the noise and crosstalk levels at the receiver.

u-LAW and A-LAW definitions
A-law and u-law are companding schemes used in telephone networks to get more dynamics to the 8 bit samples that is available with linear coding. Typically bit samples (linear scale) sampled at 8 kHz sample are companded to 8 bit (logarithmic scale) for transmission over 64 kbit/s data channel. In the receiving end the data is then converted back to linear scale ( bit) and played back. converted back

Speech Codecs Waveform codec Source codec (vocoders) Hybrid codec

Waveform Codec Waveform codec’s attempt, without using any knowledge of how the signal to be coded was generated, to produce a reconstructed signal whose waveform is as close as possible to the original. This means that in theory they should be signal independent and work well with non-speech signals. Generally they are low complexity codec’s which produce high quality speech at rates above about 16 kbits/s. When the data rate is lowered below this level the reconstructed speech quality that can be obtained degrades rapidly

Source Codec Source coders operate using a model of how the source was generated, and attempt to extract, from the signal being coded, the parameters of the model. It is these model parameters which are transmitted to the decoder. Source coders for speech are called vocoders, and work as follows. The vocal tract is represented as a time-varying filter and is excited with either a white noise source, for unvoiced speech segments, or a train of pulses separated by the pitch period for voiced speech. Therefore the information which must be sent to the decoder is the filter specification, a voiced/unvoiced flag, the necessary variance of the excitation signal, and the pitch period for voiced speech.

Hybrid Codec Hybrid codecs attempt to fill the gap between waveform and source codecs. Waveform coders are capable of providing good quality speech at bit rates down to about 16 kbits/s, but are of limited use at rates below this. Source coders on the other hand can provide intelligible speech at 2.4 kbits/s and below, but cannot provide natural sounding speech at any bit rate. Although other forms of hybrid codecs exist, the most successful and commonly used are time domain Analysis-by-Synthesis (AbS) codecs.

G.711 Pulse Code Modulation (PCM) codecs are the simplest form of waveform codecs. Narrowband speech is typically sampled 8000 times per second, and then each speech sample must be quantized. If linear quantization is used then about 12 bits per sample are needed, giving a bit rate of about 96 kbits/s. However this can be easily reduced by using non-linear quantization. For coding speech it was found that with non-linear quantization 8 bits per sample was sufficient for speech quality which is almost indistinguishable from the original. This gives a bit rate of 64 kbits/s, and two such non-linear PCM codecs were standardised in the 1960s

Adaptive Differential Pulse Code Modulation (ADPCM) codecs are waveform codecs which instead of quantizing the speech signal directly, quantize the difference between the speech signal and a prediction that has been made of the speech signal. If the prediction is accurate then the difference between the real and predicted speech samples will have a lower variance than the real speech samples, and will be accurately quantized with fewer bits than would be needed to quantize the original speech samples.

G.721 , G.726 & G.727 In the mid 1980s the CCITT standardised a 32 kbits/s ADPCM, known as G721, which gave reconstructed speech almost as good as the 64 kbits/s PCM codecs. Later in recommendations G726 and G727 codecs operating at 40,32,24 and 16 kbits/s were standardised

Code-Excited Linear Predictive (CELP)
At bit rates of around 16 kbits/s and lower the quality of waveform codecs falls rapidly, as can be seen in figure shown earlier. Thus at these rates hybrid codecs, especially CELP codecs and their derivatives, tend to be used. However because of the forward adaptive determination of the short term filter coefficients used in most of these codecs, they tend to have high delays.

G.728 (Low-Delay) CELP Codecs
CELP codec which was developed at AT&T Bell Labs, and was standardised in 1992 as G728. This codec uses backward adaption to calculate the short term filter coefficients, which means that rather than buffer 20 ms or so of the input speech to calculate the filter coefficients they are found from the past reconstructed speech. This means that the codec can use a much shorter frame length than traditional CELP codecs, and G728 uses a frame length of only 5 samples giving it a total delay of less than 2 ms.

G.723 (Algebraic Code-Excited Linear Prediction (ACELP)
Normal conversation involves significant periods of silence. G723 specifies a mechanism for silence suppression where Silence Insertion Description (SID) frames can be used. These are only 32bits long – this means that silence only occupies 1Kbps – compared to 64Kbps for G711. G.723 has an MOS score of 3.8 but has a delay of 37.5 mSecs at the encoder

G.729 G.729 is an umbrella of vocoder standards.
The G.729 codec perform voice compression at bit rates that vary between 6.4 and 12.4 kbps. The figure below shows an example of the G.729 vocoder connected to a digital communication channel. The input speech is fed into the G.729 encoder as a stream of 16-bit PCM samples, sampled at a rate of 8000 samples/second. The G.729 encoder compresses the data into the Encode Stream.

G G.729 also uses samples of the actual human speech to set the vocoder settings properly. It also compares the actual voice from the synthetic voice to come up with a "code." The code along with the vocoder settings are what's sent to the remote end. The remote end takes the code and vocoder settings and plays the sound.

Similar presentations