Cepstrum and MFCC Cepstrum MFCC Speech processing.

Slides:



Advertisements
Similar presentations
Frequency analysis.
Advertisements

Analysis and Digital Implementation of the Talk Box Effect Yuan Chen Advisor: Professor Paul Cuff.
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Chapter 2: Audio feature extraction techniques (lecture2)
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Jacob Zurasky ECE5525 Fall  Goals ◦ Determine if the principles of speech processing relate to snoring sounds. ◦ Use homomorphic filtering techniques.
CMSC Assignment 1 Audio signal processing
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
A PRESENTATION BY SHAMALEE DESHPANDE
Basic Concepts: Physics 1/25/00. Sound Sound= physical energy transmitted through the air Acoustics: Study of the physics of sound Psychoacoustics: Psychological.
Representing Acoustic Information
LE 460 L Acoustics and Experimental Phonetics L-13
Ni.com Data Analysis: Time and Frequency Domain. ni.com Typical Data Acquisition System.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Lecture 1 Signals in the Time and Frequency Domains
Topics covered in this chapter
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Fourier series. The frequency domain It is sometimes preferable to work in the frequency domain rather than time –Some mathematical operations are easier.
ECE 598: The Speech Chain Lecture 7: Fourier Transform; Speech Sources and Filters.
Preprocessing Ch2, v.5a1 Chapter 2 : Preprocessing of audio signals in time and frequency domain  Time framing  Frequency model  Fourier transform 
By Sarita Jondhale1 Signal Processing And Analysis Methods For Speech Recognition.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
ECE 4710: Lecture #6 1 Bandlimited Signals  Bandlimited waveforms have non-zero spectral components only within a finite frequency range  Waveform is.
Audio signal processing ver1g1 Introduction to audio signal processing Part 2 Chapter 3: Audio feature extraction techniques Chapter 4 : Recognition Procedures.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Chapter 3: Feature extraction from audio signals
Gammachirp Auditory Filter
EE 113D Fall 2008 Patrick Lundquist Ryan Wong
Submitted By: Santosh Kumar Yadav (111432) M.E. Modular(2011) Under the Supervision of: Mrs. Shano Solanki Assistant Professor, C.S.E NITTTR, Chandigarh.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Pre-Class Music Paul Lansky Six Fantasies on a Poem by Thomas Campion.
LPC-analysis-VOSIM-resynthesis Combined class December 18 th 2012 Johan & Peter Institute of Sonology Royal Conservatory, The Hague.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Ch4 Short-time Fourier Analysis of Speech Signal z Fourier analysis is the spectrum analysis. It is an important method to analyze the speech signal. Short-time.
A STUDY ON SPEECH RECOGNITION USING DYNAMIC TIME WARPING CS 525 : Project Presentation PALDEN LAMA and MOUNIKA NAMBURU.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
7.0 Speech Signals and Front-end Processing References: , 3.4 of Becchetti of Huang.
 Carrier signal is strong and stable sinusoidal signal x(t) = A cos(  c t +  )  Carrier transports information (audio, video, text, ) across.
Ch. 3: Feature extraction from audio signals
Content: Distortion at electronic loads
PATTERN COMPARISON TECHNIQUES
Ch. 2 : Preprocessing of audio signals in time and frequency domain
Spectrum Analysis and Processing
Ch. 5: Speech Recognition
ARTIFICIAL NEURAL NETWORKS
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Sampling and Quantization
EE Audio Signals and Systems
IWPR18: LSTM music classification, WR58
Lecture 2: Frequency & Time Domains presented by David Shires
Digital Systems: Hardware Organization and Design
Mark Hasegawa-Johnson 10/2/2018
Govt. Polytechnic Dhangar(Fatehabad)
Speech Processing Final Project
Lec.6:Discrete Fourier Transform and Signal Spectrum
Speech Signal Representations
Presentation transcript:

Cepstrum and MFCC Cepstrum MFCC Speech processing

Cepstrum A new word by reversing the first 4 letters of spectrum  cepstrum. It is the spectrum of a spectrum of a signal. Speech processing

Cepstrum   Speech processing

Glottis and cepstrum Speech wave (X)= Excitation (E) . Filter (H) Output So voice has a strong glottis Excitation Frequency content In Ceptsrum We can easily identify and remove the glottal excitation (H) (Vocal tract filter) (E) Glottal excitation From Vocal cords (Glottis) http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif Speech processing

Cepstral analysis Signal(s)=convolution(*) of glottal excitation (e) and vocal_tract_filter (h) s(n)=e(n)*h(n), n is time index After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)} Convolution(*) becomes multiplication (.) n(time) w(frequency), S(w) = E(w).H(w) Find Magnitude of the spectrum |S(w)| = |E(w)|.|H(w)| log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|} Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1 Speech processing

Cepstrum C(n)=IDFT[log10 |S(w)|]= IDFT[ log10{|E(w)|} + log10{|H(w)|} ] In c(n), you can see E(n) and H(n) at two different positions Application: useful for (i) glottal excitation (ii) vocal tract filter analysis windowing DFT Log|x(w)| IDFT X(n) X(w) N=time index w=frequency I-DFT=Inverse-discrete Fourier transform S(n) C(n) Speech processing

Cepstral for pitch detection Dr. K.H. Wong, Introduction to Speech Processing Cepstral for pitch detection The theory behind the cepstral detector is that the fourier transform of a pitched signal usually have a number of regularly peaks, who is representing the harmonic spectrum. When log magnitude of a spectrum is taken, these peaks are reduced (their amplitude brought into a usable scale). The result is a periodic waveform in the frequency domain, where the period is related to the fundamental frequency of the original signal. This means that a fourier transformation of this waveform has a peak representing the fundamental frequency. Speech processing V.74d

MFCC MFCC is an efficient speech feature based on human hearing perceptions, i.e. MFCC is based on known variation of the human ear’s critical bandwidth with frequency. Speech Signal Pre-emphasis Framing Windowing Hamming FFT Mel Filter Bank DCT MFCC Speech processing

Cont’d   Speech processing

MFCC If x(n) is the input signal, then the short time Fourier transform for frame a is given is called power spectrum, and if it is passed through triangular filters of Mel frequency filer bank . Speech processing

Cont’d Human ear perception of frequency contents of sounds for speech signal does not follow a linear scale. Therefore, for each tone with an actual frequency f, measured in Hz, a subjective pitch is measured on a scale called the MEL scale. The mel frequency scale is a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000Hz. To compute the mels for a given frequency f in Hz, a the following approximate formula is used. Mel (f) = Sk = 2595*log10 (1 + f/700) Speech processing

Cont’d The subjective spectrum is simulated with the use of a filter bank, one filter for each desired mel-frequency component. The filter bank has a triangular band pass frequency response, and the spacing as well as the bandwidth is determined by a constant mel-frequency interval. Furthermore, we convert the log mel spectrum back to time by using a discrete cosine transform (DCT) of the logarithm of S(m) is calculated to find the MFCC as Speech processing

Filtering Ways to find the spectral envelope Filter banks: uniform Filter banks can also be non-uniform Spectral envelop spectral envelop energy filter2 output filter1 output filter3 output filter4 output freq.. Speech processing

Filtering method For each frame (ex 10 - 30 ms), a set of filter outputs will be calculated. (ex frame overlap 5ms) There are many different methods for setting the filter bandwidths -- uniform or non-uniform Time frame i Time frame i+1 Time frame i+2 Input waveform 30ms Filter outputs (v1,v2,…) Filter outputs (v’1,v’2,…) Filter outputs (v’’1,v’’2,…) Speech processing 5ms

How to determine filter band ranges Uniform filter banks Log frequency banks Mel filter bands Speech processing

Uniform Filter Banks Uniform filter banks bandwidth B= Sampling Freq... (Fs)/no. of banks (N) For example Fs=10Kz, N=20 then B=500Hz Simple to implement but not too useful V Filter output v3 v1 v2 1 2 3 4 5 .... Q ... freq.. (Hz) 500 1K 1.5K 2K 2.5K 3K ... Speech processing

Non-uniform filter banks: Log frequency Log. Freq... scale : close to human ear V Filter output v1 v2 v3 200 400 800 1600 3200 freq.. (Hz) Speech processing

Inner ear and the cochlea (human also has filter bands) Ear and cochlea Speech processing http://universe-review.ca/I10-85-cochlea2.jpg http://www.edu.ipa.go.jp/chiyo/HuBEd/HTML1/en/3D/ear.html

Mel filter bands (found by psychological and instrumentation experiments) output Freq. lower than 1 KHz has narrower bands (and in linear scale) Higher frequencies have larger bands (and in log scale) More filter below 1KHz Less filters above 1KHz Speech processing http://instruct1.cit.cornell.edu/courses/ece576/FinalProjects/f2008/pae26_jsc59/pae26_jsc59/images/melfilt.png

Mel scale (Melody scale) From http://en. wikipedia Measure relative strength in perception of different frequencies. The mel scale, named by Stevens, Volkman and Newman in 1937 is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The reference point between this scale and normal frequency measurement is defined by assigning a perceptual pitch of 1000 mels to a 1000Hz tone, 40 dB above the listener's threshold. …. The name mel comes from the word melody to indicate that the scale is based on pitch comparisons. Speech processing

Critical band scale: Mel scale Based on perceptual studies Log. scale when freq. is above 1KHz Linear scale when freq. is below 1KHz Popular scales are the “Mel” (stands for melody) or “Bark” scales Mel Scale (m) m (f) Freq in hz f Speech processing Below 1KHz, fmf, linear Above 1KHz, f>mf, log scale http://en.wikipedia.org/wiki/Mel_scale

Work examples: Exercise 1: When the input frequency ranges from 200 to 800 Hz (f=600Hz), what is the delta Mel (m) in the Mel scale? Exercise 2: When the input frequency ranges from 6000 to 7000 Hz (f=1000Hz), what is the delta Mel (m) in the Mel scale? Speech processing

Work examples: Answer1: also m=600Hz, because it is a linear scale. Answer 2: By observation, in the Mel scale diagram it is from 2600 to 2750, so delta Mel (m) in the Mel scale from 2600 to 2750, m=150 . It is a log scale change. We can re-calculate result using the formula M=2595 log10(1+f/700), M_low=2595 log10(1+f_low/700)= 2595 log10(1+6000/700), M_high=2595 log10(1+f_high/700)= 2595 log10(1+7000/700), Delta_m(m) = M_high - M_low = (2595* log10(1+7000/700))-( 2595* log10(1+6000/700)) = 156.7793 (agrees with the observation, Mel scale is a log scale) Speech processing

Example of cepstrum http://www. cse. cuhk. edu Example of cepstrum http://www.cse.cuhk.edu.hk/%7Ekhwong/www2/cmsc5707/demo_for_ch4_cepstrum.zip Run spCepstrumDemo in matlab 'sor1.wav‘=sampling frequency 22.05KHz Speech processing

(dft=discrete Fourier transform) s(n) time domain signal x(n)=windowed(s(n)) Suppress two sides |x(w)|=dft(x(n)) = frequency signal (dft=discrete Fourier transform) Log (|x(w)|) C(n)= iDft(Log (|x(w)|)) gives Cepstrum Glottal excitation cepstrum Vocal track cepstrum Speech processing http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1

Liftering (to remove glottal excitation) Low time liftering: Magnify (or Inspect) the low time to find the vocal tract filter cepstrum High time liftering: Magnify (or Inspect) the high time to find the glottal excitation cepstrum (remove this part for speech recognition. Vocal tract Cepstrum Used for Speech recognition Glottal excitation Cepstrum, useless for speech recognition, Cut-off Found by experiment Frequency =FS/ quefrency FS=sample frequency =22050 Speech processing

Reasons for liftering Cepstrum of speech Why we need this? Answer: remove the ripples of the spectrum caused by glottal excitation. Too many ripples in the spectrum caused by vocal cord vibrations (glottal excitation). But we are more interested in the speech envelope for recognition and reproduction Fourier Transform Input speech signal x Spectrum of x Speech processing http://isdl.ee.washington.edu/people/stevenschimmel/sphsc503/files/notes10.pdf

Liftering method: Select the high time and low time liftering Signal X Cepstrum Select high time, C_high Select low time C_low Speech processing