Feature Extraction for ASR Spectral (envelope) Analysis Auditory Model/ Normalizations.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Advertisements

Nonrecursive Digital Filters
Overview of Real-Time Pitch Tracking Approaches Music information retrieval seminar McGill University Francois Thibault.
Speech Recognition Chapter 3
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
2004 COMP.DSP CONFERENCE Survey of Noise Reduction Techniques Maurice Givens.
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Complete Discrete Time Model Complete model covers periodic, noise and impulsive inputs. For periodic input 1) R(z): Radiation impedance. It has been shown.
Speech and Audio Processing and Recognition
Speech & Audio Processing
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
ASR Intro: Outline ASR Research History Difficulties and Dimensions Core Technology Components 21st century ASR Research (Next two lectures)
Overview of Adaptive Multi-Rate Narrow Band (AMR-NB) Speech Codec
Edge Detection Phil Mlsna, Ph.D. Dept. of Electrical Engineering
System Microphone Keyboard Output. Cross Synthesis: Two Implementations.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Analysis & Synthesis The Vocoder and its related technology.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Warped Linear Prediction Concept: Warp the spectrum to emulate human perception; then perform linear prediction on the result Approaches to warp the spectrum:
Representing Acoustic Information
EE513 Audio Signals and Systems Digital Signal Processing (Systems) Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Topics covered in this chapter
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Linear Prediction Coding (LPC)
Feature Extraction for speech applications Chapters
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.
Chapter 16 Speech Synthesis Algorithms 16.1 Synthesis based on LPC 16.2 Synthesis based on formants 16.3 Synthesis based on homomorphic processing 16.4.
SPEECH CODING Maryam Zebarjad Alessandro Chiumento.
T – Biomedical Signal Processing Chapters
By Sarita Jondhale1 Signal Processing And Analysis Methods For Speech Recognition.
1 Linear Prediction. 2 Linear Prediction (Introduction) : The object of linear prediction is to estimate the output sequence from a linear combination.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
1 Linear Prediction. Outline Windowing LPC Introduction to Vocoders Excitation modeling  Pitch Detection.
Basics of Neural Networks Neural Network Topologies.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
CEPSTRAL ANALYSIS Cepstral analysis synthesis on the mel frequency scale, and an adaptative algorithm for it. Cecilia Caruncho Llaguno.
Linear Predictive Analysis 主講人:虞台文. Contents Introduction Basic Principles of Linear Predictive Analysis The Autocorrelation Method The Covariance Method.
Chapter 6 Spectrum Estimation § 6.1 Time and Frequency Domain Analysis § 6.2 Fourier Transform in Discrete Form § 6.3 Spectrum Estimator § 6.4 Practical.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
(Extremely) Simplified Model of Speech Production
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
More On Linear Predictive Analysis
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Normal Equations The Orthogonality Principle Solution of the Normal Equations.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Chapter 2. Fourier Representation of Signals and Systems
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
Linear Prediction.
Adv DSP Spring-2015 Lecture#11 Spectrum Estimation Parametric Methods.
Speech Enhancement Summer 2009
PATTERN COMPARISON TECHNIQUES
CS 591 S1 – Computational Audio
Vocoders.
Linear Prediction.
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Linear Predictive Coding Methods
The Vocoder and its related technology
8-Speech Recognition Speech Recognition Concepts
Digital Systems: Hardware Organization and Design
Linear Prediction.
Speech Processing Final Project
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Feature Extraction for ASR Spectral (envelope) Analysis Auditory Model/ Normalizations

Deriving the envelope (or the excitation) excitation Time-varying filter e(n) h t (n) y(n)=e(n)*h t (n) HOW CAN WE GET e(n) OR h(n) from y(n)?

But first, why? Excitation/pitch: for vocoding; for synthesis; for signal transformation; for prosody extraction (emotion, sentence end, ASR for tonal languages …); for voicing category in ASR Filter (envelope): for vocoding; for synthesis; for phonetically relevant information for ASR

Spectral Envelope Estimation Filters Cepstral Deconvolution (Homomorphic filtering) LPC

Channel vocoder (analysis) e(n)*h(n) Broad w.r.t harmonics

Rectifier Low-pass filterBand-pass filter A BC B C A Bandpass power estimation

speech BP 1 BP 2 BP N rectify LP 1 LP 2 LP N decimate Magnitude signals Deriving spectral envelope with a filter bank

Filterbank properties Original Dudley Voder/Vocoder: 10 filters, 300 Hz bandwidth (based on # fingers!) A decade later, Vaderson used 30 filters, 100 Hz bandwidth (better) Using variable frequency resolution, can use 16 filters with the same quality

Mel filterbank Warping function B(f) = 1125 ln (1 + f/700) Based on listening experiments with pitch

Towards other deconvolution methods Filters seem biologically plausible Other operations could potentially separate excitation from filter Periodic source provides harmonics (close together in frequency) Filter provides broad influence (envelope) on harmonic series Can we use these facts to separate?

“Homomorphic” processing Linear processing is well-behaved Some simple nonlinearities also permit simple processing, interpretation Logarithm a good example; multiplicative effects become additive Sometimes in additive domain, parts more separable Famous example: blind deconvolution of Caruso recordings

Oppenheim: Then all speech compression systems and many speech recognition systems are oriented toward doing this deconvolution, then processing things separately, and then going on from there. A very different application of homomorphic deconvolution was something that Tom Stockham did. He started it at Lincoln and continued it at the University of Utah. It has become very famous, actually. It involves using homomorphic deconvolution to restore old Caruso recordings. Goldstein: I have heard about that. Oppenheim: Yes. So you know that's become one of the well-known applications of deconvolution for speech. … Oppenheim: What happens in a recording like Caruso's is that he was singing into a horn that to make the recording. The recording horn has an impulse response, and that distorts the effect of his voice, my talking like this. [cupping his hands around his mouth] Goldstein: Okay. IEEE Oral History Transcripts: Oppenheim on Stockham’s Deconvolution of Caruso Recordings (1)

Oppenheim: So there is a reverberant quality to it. Now what you want to do is deconvolve that out, because what you hear when I do this [cupping his hands around his mouth] is the convolution of what I'm saying and the impulse response of this horn. Now you could say, "Well why don't you go off and measure it. Just get one of those old horns, measure its impulse response, and then you can do the deconvolution." The problem is that the characteristics of those horns changed with temperature, and they changed with the way they were turned up each time. So you've got to estimate that from the music itself. That led to a whole notion which I believe Tom launched, which is the concept of blind deconvolution. In other words, being able to estimate from the signal that you've got the convolutional piece that you want to get rid of. Tom did that using some of the techniques of homomorphic filtering. Tom and a student of his at Utah named Neil Miller did some further work. After the deconvolution, what happens is you apply some high pass filtering to the recording. That's what it ends up doing. What that does is amplify some of the noise that's on the recording. Tom and Neil knew Caruso's singing. You can use the homomorphic vocoder that I developed to analyze the singing and then resynthesize it. When you resynthesize it you can do so without the noise. They did that, and of course what happens is not only do you get rid of the noise but you get rid of the orchestra. That's actually become a very fun demo which I still play in my class. This was done twenty years ago, but it's still pretty dramatic. You hear Caruso singing with the orchestra, then you can hear the enhanced version after the blind deconvolution, and then you can also hear the result after you get rid of the orchestra,. Getting rid of the orchestra is something you can't do with linear filtering. It has to be a nonlinear technique. IEEE Oral History Transcripts (2)

Log processing Suppose y(n) = e(n)*h(n) Then Y(f) = E(f)H(f) And logY(f) = log E(f) + log H(f) In some cases, these pieces are separable by a linear filter If all you want is H, processing can smooth Y(f)

Windowed speech FFT Log magnitude FFT Time separation Spectral function Excitation Pitch detection Source-filter separation by cepstral analysis

Cepstral features Typically truncated (smoothing) Corresponds to spectral envelope estimation Features also are roughly orthogonal Common transformation for many spectral features, e.g., - filter bank energies - FFT power - LPC coefficients Used almost universally for ASR (in some form)

Key Processing Step for ASR: Cepstral Mean Subtraction Imagine a fixed filter h(n), so y(n)=h(n)*x(n) Same arguments as before, but - let x vary over time - let h be fixed over time Then average cepstra should represent the fixed component (including fixed part of x) (Think about it)

An alternative: Incorporate Production Assume simple excitation/vocal tract model Assume cascaded resonators for vocal tract frequency response (envelope) Find resonator parameters for best spectral approximation

= == r2r2

Some LPC Issues Error criterion Model order

LPC Peak Modeling Total error constrained to be (at best) gain factor squared Error where model spectrum is larger contributes less Model spectrum tends to “hug” peaks

LPC Spectrum

More effects of error criterion Globally tracks, but worse match in log spectrum for low values “Attempts” to model anti-aliasing filter, mic response Ill-conditioned for wide-ranging spectral values

Other LPC properties Behavior in noise Sharpness of peaks Speaker dependence

Model Order Too few, can’t represent formants Too many, model detail, especially harmonics Too many, low error, ill-conditioned matrices

LPC Model Order

Optimal Model Order Akaike Information Criterion (AIC) Cross-validation (trial and error)

Coefficient Estimation Minimize squared error - set derivs to zero Compute in blocks or on-line For blocks, use autocorrelation or covariance methods (pertains to windowing, edge effects)

Solving the Equations Autocorrelation method: Levinson or Durbin recursions, O(P 2 ) ops; uses Toeplitz property (constant along left-right diagonals), guaranteed stable Covariance method: Cholesky decomposition, O(P 3 ) ops; just uses symmetry property, not guaranteed stable

LPC-based representations Predictor polynomial - a i, 1<=i<=p, direct computation Root pairs - roots of polynomial, complex pairs Reflection coefficients - recursion; interpolated values always stable (also called PARCOR coefficients k i, 1<=i<=p) Log area ratios = ln((1-k)/(1+k)), low spectral sensitivity Line spectral frequencies - freq. pts around resonance; low spectral sensitivity, stable Cepstra - can be unstable, but useful for recognition

Autocorrelation Analysis

Spectral Estimation Filter Banks Cepstral Analysis LPC Reduced Pitch Effects Excitation Estimate Direct Access to Spectra Less Resolution at HF Orthogonal Outputs Peak-hugging Property Reduced Computation X X X XX XX X X X