A Comparison Of Speech Coding With Linear Predictive Coding (LPC) And Code-Excited Linear Predictor Coding (CELP) By: Kendall Khodra Instructor: Dr. Kepuska.

Slides:



Advertisements
Similar presentations
Tamara Berg Advanced Multimedia
Advertisements

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.
Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen.
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
Speech Coding Nicola Orio Dipartimento di Ingegneria dell’Informazione IV Scuola estiva AISV, 8-12 settembre 2008.
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Speech & Audio Processing
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
A PRESENTATION BY SHAMALEE DESHPANDE
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.
Representing Acoustic Information
LE 460 L Acoustics and Experimental Phonetics L-13
Lecture 1 Signals in the Time and Frequency Domains
Topics covered in this chapter
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
LECTURE Copyright  1998, Texas Instruments Incorporated All Rights Reserved Encoding of Waveforms Encoding of Waveforms to Compress Information.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.
Concepts of Multimedia Processing and Transmission IT 481, Lecture #4 Dennis McCaughey, Ph.D. 25 September, 2006.
SPEECH CODING Maryam Zebarjad Alessandro Chiumento.
1 Linear Prediction. 2 Linear Prediction (Introduction) : The object of linear prediction is to estimate the output sequence from a linear combination.
1 Linear Prediction. Outline Windowing LPC Introduction to Vocoders Excitation modeling  Pitch Detection.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Submitted By: Santosh Kumar Yadav (111432) M.E. Modular(2011) Under the Supervision of: Mrs. Shano Solanki Assistant Professor, C.S.E NITTTR, Chandigarh.
Linear Predictive Analysis 主講人:虞台文. Contents Introduction Basic Principles of Linear Predictive Analysis The Autocorrelation Method The Covariance Method.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
ITU-T G.729 EE8873 Rungsun Munkong March 22, 2004.
1 Audio Coding. 2 Digitization Processing Signal encoder Signal decoder samplingquantization storage Analog signal Digital data.
More On Linear Predictive Analysis
SPEECH CODING Maryam Zebarjad Alessandro Chiumento Supervisor : Sylwester Szczpaniak.
Present document contains informations proprietary to France Telecom. Accepting this document means for its recipient he or she recognizes the confidential.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Normal Equations The Orthogonality Principle Solution of the Normal Equations.
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
Voice Sampling. Sampling Rate Nyquist’s theorem states that a signal can be reconstructed if it is sampled at twice the maximum frequency of the signal.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
CELP / FS-1016 – 4.8kbps Federal Standard in Voice Coding
Fundamentals of Multimedia Chapter 6 Basics of Digital Audio Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Linear Prediction.
1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.
Figure 11.1 Linear system model for a signal s[n].
ARTIFICIAL NEURAL NETWORKS
Vocoders.
Subject Name: Digital Communication Subject Code:10EC61
Linear Prediction.
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Microcomputer Systems 2
Linear Predictive Coding Methods
Mobile Systems Workshop 1 Narrow band speech coding for mobile phones
Vocoders.
Linear Prediction.
Govt. Polytechnic Dhangar(Fatehabad)
Speech Processing Final Project
Presentation transcript:

A Comparison Of Speech Coding With Linear Predictive Coding (LPC) And Code-Excited Linear Predictor Coding (CELP) By: Kendall Khodra Instructor: Dr. Kepuska

Introduction This project will develop Linear Predictive Coding (LPC) to process a speech signal. The objective is to mitigate the lack of quality of the simple LPC model by using a more complex description of the excitation, Code Excited Linear Prediction (CELP) to process the output of simple LPC.

Background Linear Predictive Coding (LPC) methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification and for speech storage. LPC has been considered one of the most powerful techniques for speech analysis. In fact, this technique is the basis of other more recent and sophisticated algorithms that are used for estimating speech parameters, e.g., pitch, formants, spectra, vocal tract and low bit representations of speech.

The basic principle of linear prediction, states that speech can be modeled as the output of a linear, time- varying system excited by either periodic pulses or random noise. These two kinds of acoustic sources are called voiced and unvoiced respectively. In this sense, voiced emissions are those generated by the vibration of the vocal cords in the presence of airflow and unvoiced sounds are those generated when the vocal cords are relaxed.

A. Physical Model: When you speak: Air is pushed from your lungs through your vocal tract and out of your mouth comes speech.

For certain voiced sound, your vocal cords (folds) vibrate (open and close). The rate at which the vocal cords vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration). For certain fricative and plosive (or unvoiced) sounds your vocal cords do not vibrate but remain constantly opened. The shape of your vocal tract, which changes as we speak, determines the sound that you make. The amount of air coming from your lung determines the loudness of your voice.

B. Mathematical Model Block diagram of simplified mathematical model for speech production · The model says that the digital speech signal is the output of a digital filter (called the LPC filter) whose input is either a train of impulses or a white noise sequence.

The relationship between the physical and the mathematical models: Vocal Tract H(z) (LPC Filter) Air u(n) (innovation) Vocal Cord Vibration V(Voiced) Vocal Cord Vibration Period T (Pitch period) Fricatives and Plosives UV (Unvoiced) Vocal tract system = function where

The LPC Model The LPC method considers a speech sample s(n) at time n, and approximates it as a linear combination of the past samples in the way: (1) Where G is the gain and u(n) the normalized excitation. The predictor coefficients (the  k ’s) are determined (computed) by minimizing the sum of squared differences (over a finite interval) between actual speech samples and the linearly predicted ones( we will see later).

Block diagram of an LPC In the LPC model the residual (excitation) is approximated during voicing by a quasi-periodic impulse train and during unvoicing by a white noise sequence. This approximation is denoted by. We then pass through the filter 1/A(z)

LPC consists of the following steps Pre-emphasis FilteringPre-emphasis Filtering Data WindowingData Windowing Autocorrelation Parameter EstimationAutocorrelation Parameter Estimation Pitch Period and Gain EstimationPitch Period and Gain Estimation QuantizationQuantization Decoding and Frame InterpolationDecoding and Frame Interpolation

Pre-emphasis Filtering When we speak, the speech signal experiences some spectral roll off due to the radiation effects of the sound from the mouthWhen we speak, the speech signal experiences some spectral roll off due to the radiation effects of the sound from the mouth As a result, the majority of the spectral energy is concentrated in the lower frequencies. As a result, the majority of the spectral energy is concentrated in the lower frequencies. To have our model give equal weight to both low and high frequencies, we need to apply a high-pass filter to the original signal.To have our model give equal weight to both low and high frequencies, we need to apply a high-pass filter to the original signal. This is done with a one zero filter, called the pre-emphasis filter. The filter has the form:This is done with a one zero filter, called the pre-emphasis filter. The filter has the form: y[n] = 1 - a x[n] Most standards use a = 15/16 =.9375 ( our default) When we decode the speech, the last thing we do to each frame is to pass it through a de-emphasis filter to undo this effect. Matlab: speech = filter([1 -preemp], 1, data)'; % Preemphasize speech

Data Windowing Because speech signals vary with time, this process is done on short chunks of the speech signal, which we call frames. Usually 30 to 50 ms frames give intelligible speech with good compression. For implementation in this project we will use overlappingFor implementation in this project we will use overlapping data framesto avoid discontinuities in the model. We used data frames to avoid discontinuities in the model. We used a frame width of 30 ms and overlap of 10 ms. a frame width of 30 ms and overlap of 10 ms. A hamming window was used to extract frames as shown belowA hamming window was used to extract frames as shown below

Determining Pitch Period For each frame, we must determine if the speech is voiced or unvoiced. We do this by searching for periodicities in the residual (prediction error) signal. To determine if the frame is voiced or unvoiced, we apply a threshold to the autocorrelation. Typically, this threshold is set at R x (0) * 0.3. If no values of the autocorrelation sequence exceed this threshold, then we declare the frame unvoiced.If no values of the autocorrelation sequence exceed this threshold, then we declare the frame unvoiced. If we have periodicities in the data, there should be spikes which exceed the threshold; in this case we declare the frame voiced.If we have periodicities in the data, there should be spikes which exceed the threshold; in this case we declare the frame voiced. The distance between spikes in the autocorrelation function is equivalent to the pitch period of the original signal.

LPC analyzes the speech signal by: Estimating the formants Estimating the formants Removing their effects from the speech signal Removing their effects from the speech signal Estimating the intensity and frequency of the remaining Estimating the intensity and frequency of the remaining signal. signal. The process of removing the formants is called inverse filtering, and the remaining signal is called the residue. LPC synthesizes the speech signal by reversing the process: Use the residue to create a source signalUse the residue to create a source signal Use the formants to create a filter (which represents the tube/tract)Use the formants to create a filter (which represents the tube/tract) Run the source through the filter, resulting in speech.Run the source through the filter, resulting in speech.

Estimating the Formants The coefficients of the difference equation (the prediction coefficients) characterize the formants. The LPC system needs to estimate these coefficients which is done by minimizing the mean-square error between the predicted signal and the actual signal.

CELP (Code Excited Linear Predictor) A CELP coder does the same LPC modeling but then computes the errors between the original speech & the synthetic model and transmits both model parameters and a very compressed representation of the errors (the compressed representation is an index into a 'code book' shared between coders & decoders -- this is why it's called "Code Excited"). A CELP coder does much more work than an LPC coder (usually about an order of magnitude more) but the result is much higher quality speech:

Block diagram of the CELP

The perceptual weighting filter is defined as: 0<r<1 This filter is used to de-emphasize the frequency regions that correspond to the formants as determined by LPC analysis. The noise, located in formant regions, that is more perceptibly disturbing can be reduced. The de-emphasis is controlled by factor r. The de-emphasis is controlled by factor r.

After determining the formant synthesis filter 1/A(z), the pitch synthesis filter 1/P(z), and encoding data rate, we can do an excitation codebook search. The codebook search is performed in the subframes of an LPC frame. The subframe length is usually equal to or shorter than the pitch subframe length. After determining the formant synthesis filter 1/A(z), the pitch synthesis filter 1/P(z), and encoding data rate, we can do an excitation codebook search. The codebook search is performed in the subframes of an LPC frame. The subframe length is usually equal to or shorter than the pitch subframe length.

The autocorrelation method assumes that the signal is identically zero outside the analysis interval (0<=m<=N-1). Then it tries to minimize the prediction error wherever it is nonzero, that is in the interval 0<=m<=N-1+p, where p is the order of the model used. The error is likely to be large at the beginning and at the end of this interval. This is the reason why the speech segment analyzed is usually tapered by the application of a Hamming Window. Autocorrelation Parameter Estimation

Given Our goal is to find the predictor coefficients a i which minimizes k the square of the prediction error in a short segment of speech. The mean short time prediction error per frame is defined as: To minimize this we take the derivative and set it to zero. This results in the equation: Finding the Parameters

Letting, we have

This equation is solved using the Levinson-Durbin algorithm This algorithm is one used to assist in finding the filter coefficients a i from the system Ra=r. What the Levinson-Durbin algorithm does here is making the solution to the problem O(n 2) instead of O(n 3) by exploiting the fact that matrix R is toeplitz hermitian.

Matlab % Levinson's method err(1) = autoCorVec(1); k(1) = 0; A = []; for index=1:L numerator = [1 A.']*autoCorVec(index+1:-1:2); denominator = -1*err(index); k(index) = numerator/denominator; % PARCOR coeffs A = [A+k(index)*flipud(A); k(index)]; err(index+1) = (1-k(index)^2)*err(index); end aCoeff(:,nframe) = [1; A]; parcor(:,nframe) = k';

Helpful matlab tools used synFrame = filter(1, A', residFrame) synFrame = filter(1, A', residFrame) This filters the data in vector residframe with the filter described by vector A resid2 = dct(resid); resid2 = dct(resid); This returns the discrete cosine transform of resid as discrete cosine transform coefficients. Only the first 50 coefficients are kept since most of the energy is stored there resid3 = uencode(resid2,4); resid3 = uencode(resid2,4); This function uniformly quantizes and encodes the data in the vector resid2 into N-bits. newsignal = udecode(resid,4); newsignal = udecode(resid,4); This does the opposite of uencode of resid

Results

It can be seen from the waveforms that the CELP method looks much more like and hence is a better method for speech coding. This is emphasized from the log-magnitude spectrum. It can be seen from the waveforms that the CELP method looks much more like and hence is a better method for speech coding. This is emphasized from the log-magnitude spectrum. The synthesized voice of linear prediction waveform is peaky and sounds buzzy since it is based on the autocorrelation method that has loss of absolute phase structure because of its minimum phase characteristics. The synthesized voice of linear prediction waveform is peaky and sounds buzzy since it is based on the autocorrelation method that has loss of absolute phase structure because of its minimum phase characteristics.

Results Male Voice Original Signal LPC Signal CELP Signal Female Voice Original Signal LPC Signal CELP Signal 4 bits encoding8 bits encoding 4 bits encoding8 bits encoding

Drawbacks The LPC method has inherent errors (quantization) and in most cases doesn’t give accurate solution. The tapering effects of the window (hamming window used) also introduces error since the waveform may not follow an all pole model assumed. However the tapering of window has an advantage that least square error in the finding the solution is reduced.

Conclusion By comparison of the original speech against LPC speech and the CELP; in both cases, the reconstructed speech has lower quality than the input speech. Both of the reconstructed speech sounds noisy with the LPC model being nearly unintelligible. The sound seems to be whispered with an extensive amount of noise. The CELP reconstructed speech sounds more spoken and less whispered. In all, the CELP speech sounded closer to the original one, still with a muffled sound.

Further investigation MELP The MELP (Mixed-Excitation Linear Predictive) Vocoder is the new 2400 bps Federal Standard speech coder. It is robust in difficult background noise environments such as those frequently encountered in commercial and military communication systems. It is very efficient in its computational requirements. The MELP Vocoder is based on the traditional LPC parametric model, but also includes four additional features. These are mixed-excitation, aperiodic pulses, pulse dispersion, and adaptive spectral enhancement.

The mixed-excitation is implemented using a multi- band mixing model. The primary effect of this multi- band mixed-excitation is to reduce the buzz usually associated with LPC vocoders, especially in broadband acoustic noise. Require explicit multi-band decision and source characterization Require explicit multi-band decision and source characterization

References: [1]J.L. Flanagan and L. R. Rabiner Speech Synthesis, Dowden, Hutchington & Ross, Inc., Stroudsburg, Pennsylvania [2]Z Li and M. Drew [2]Z Li and M. Drew Fundamentals of Multimedia Prentice Hall (October 22, 2003) [3] [3]Atlanta Signal Processors, Inc. The New 2400 bps Federal Standard Speech Coder (