Presentation is loading. Please wait.

Presentation is loading. Please wait.

A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen.

Similar presentations


Presentation on theme: "A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen."— Presentation transcript:

1

2 A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen

3 Overview An Introduction to Speech Signals (Vedrana) Linear Prediction Analysis (Fu) Speech Coding and Synthesis (Fu) Speech Recognition (Vedrana)

4 Speech Coding and Recognition AN INTRODUCTION TO SPEECH SIGNALS

5 AN INTRODUCTION TO SPEECH SIGNALS Speech Production Flow of air from lungs Vibrating vocal cords Speech production cavities Lips Sound wave Vowels (a, e, i), fricatives (f, s, z) and plosives (p, t, k)

6 AN INTRODUCTION TO SPEECH SIGNALS Speech Signals Sampling frequency 8 — 16 kHz Short-time stationary assumption (frames 20 – 40 ms)

7 AN INTRODUCTION TO SPEECH SIGNALS Model for Speech Production Excitation (periodic, noisy) Vocal tract filter (nasal cavity, oral cavity, pharynx)

8 AN INTRODUCTION TO SPEECH SIGNALS Voiced and Unvoiced Sounds Voiced sounds, periodic excitation, pitch period Unvoiced sounds, noise-like excitation Short-time measures: power and zero-crossing

9 AN INTRODUCTION TO SPEECH SIGNALS Frequency Domain Pitch, harmonics (excitation) Formants, envelope (vocal tract filter) Harmonic product spectrum

10 AN INTRODUCTION TO SPEECH SIGNALS Speech Spectrograms Time varying formant structure Narrowband / wideband

11 Speech Coding and Recognition LINEAR PREDICTION ANALYSIS

12 LINEAR PREDICTION ANALYSIS Categories Vocal Tract Filter Linear Prediction Analysis Error Minimization Levison-Durbin Recursion Residual sequence u(n)

13 LINEAR PREDICTION ANALYSIS Vocal Tract Filter(1) Vocal tract filter If we assume an all poles filter? Input: periodic impulse train Output: speech

14 LINEAR PREDICTION ANALYSIS Vocal Tract Filter(2) Auto regressive model: (all poles filter) where p is called the model order Speech is a linear combination of past samples and an extra part, Au g (z)

15 LINEAR PREDICTION ANALYSIS Linear Prediction Analysis(1) Goal: how to find the coefficients a k in this all poles model? all poles model speech, s(n) impulse, Au g (n) Physical model v.s. Analysis system a k here is fixed, but unknown! we try to find α k to estimate a k ? error, e(n)

16 LINEAR PREDICTION ANALYSIS Linear Prediction Analysis(2) What is really inside the ? box? A predictor (P(z), FIR filter) inside, where ŝ(n)= α 1 s(n-1)+ α 2 s(n-2)+… + α p s(n-p) If α k ≈ a k, then e(n) ≈ Au g (n) predictive error, e(n)=s(n)- ŝ(n) P(z) original s(n) - A(z)=1-P(z) predicitve ŝ(n)

17 LINEAR PREDICTION ANALYSIS Linear Prediction Analysis (3) If we can find a predictor generating a smallest error e(n) which is close to Au g (n), then we can use A(z) to estimate filter coefficients. e(n) ≈Au g (n) 1 / A(z) ŝ(n) very similar to vocal tract model

18 LINEAR PREDICTION ANALYSIS Error Minization(1) Problem: How to find the minimum error? Energy of error:, where e(n)=s(n)- ŝ(n) = function( α i ) For quadratic function of α i we can find the smallest value by for each

19 By differentiation, We define that, where This is actually an autocorrelation of s(n) LINEAR PREDICTION ANALYSIS Error Minization(2) a set of linear equations

20 LINEAR PREDICTION ANALYSIS Error Minization(3) Hence, let’s discuss linear equations in matrix: Linear prediction coefficient is our goal. How to solve it efficiently?

21 LINEAR PREDICTION ANALYSIS Levinson-Durbin Recursion(1) In the matrix, LD recursion method is based on following characteristics: Symmetric Toeplitz Hence we can solve matrix in O(p 2 ) instead of O(p 3 ) Don’t forget our objective, which is to find α k to simulate the vocal tract filter.

22 LINEAR PREDICTION ANALYSIS Levinson-Durbin Recursion(2) In exercise, we solve matrix by ‘brute force’ and L-D recursion. There is no difference of corresponding parameters Error energy v.s. Predictor order

23 LINEAR PREDICTION ANALYSIS Residual sequence u(n) After knowing filter coefficients, we can find residual sequence u(n) by inversely filtering computation. Try to compare original s(n) residual u(n) u(n) A(z) s(n)

24 Speech Coding and Recognition SPEECH CODING AND SYNTHESIS

25 SPEECH CODING AND SYNTHESIS Categories Analysis-by-Synthesis Perceptual Weighting Filter Linear Predictive Coding Multi-Pulse Linear Prediction Code-Excited Linear Prediction (CELP) CELP Experiment Quantization

26 SPEECH CODING AND SYNTHESIS Analysis-by-Synthesis(1) Analyze the speech by estimating a LP synthesis filter Computing a residual sequence as a excitation signal to reconstruct signal Encoder/Decoder : the parameters like LP synthesis filter, gain, and pitch are coded, transmitted, and decoded

27 SPEECH CODING AND SYNTHESIS Analysis-by-Synthesis(2) Frame by frame Without error minimization: With error minimization: LP parameters Excitation parameters ENCODERENCODER To channel e(n) s(n) Excitation Generator LP Synthesis Filter ŝ(n) - Error Minimization LP analysis

28 SPEECH CODING AND SYNTHESIS Perceptual Weighting Filter(1) Perceptual masking effect: Within the formant regions, one is less sensitive to the noise Idea: designing a filter that de-emphasizes the error in the formant region Result: synthetic speech with more error near formant peaks but less error in others

29 SPEECH CODING AND SYNTHESIS Perceptual Weighting Filter(2) In frequency domain: LP syn. filter v.s. PW filter Perceptual weighting coefficient: α = 1, no filtering. α decreases, filtering more optimal α depends on perception

30 SPEECH CODING AND SYNTHESIS Perceptual Weighting Filter(3) In z domain, LP filter v.s. PW filter Numerator: generating the zeros which are the original poles of LP synthesis filter Denominator: placing the poles closer to the origin. α determines the distance

31 SPEECH CODING AND SYNTHESIS Linear Predictive Coding(1) Based on above methods, PW filter and analysis-by-synthesis If excitation signal ≈ impulse train, during voicing, we can get a reconstructed signal very close to the original More often, however, the residue is far from the impulse train

32 SPEECH CODING AND SYNTHESIS Linear Predictive Coding(2) Hence, there are many kinds of coding trying to improve this Primarily differ in the type of excitation signal Two kinds: Multi-Pulse Linear Prediction Code-Excited Linear Prediction (CELP)

33 SPEECH CODING AND SYNTHESIS Multi-Pulse Linear Predcition(1) Concept: represent the residual sequence by putting impulses in order to make ŝ(n) closer to s(n). s(n) Error Minimization Excitation Generator LP Synthesis Filter PW Filter ŝ(n) - Multi-pulse, u(n) LP Analysis

34 SPEECH CODING AND SYNTHESIS Multi-Pulse Linear Predcition(2) s1Estimate the LPC filter without excitation s2 Place one impulse (placement and amplitude) s3A new error is determined s4Repeat s2-s3 until reaching a desired min error multi-pulse s1 original synthetic error s2,3s4

35 SPEECH CODING AND SYNTHESIS Code-Excited Linear Prediction(1) The difference: Represent the residue v(n) by codewords (exhaustive searching) from a codebook of zero- mean Gaussian sequence Consider primary pitch pulses which are predictable over consecutive periods

36 Multi-pulse generator SPEECH CODING AND SYNTHESIS Code-Excited Linear Prediction(2) LP synthesis filter LP analysis - PW filter Error minimization Gaussian excitation codebook s(n) ŝ(n) s(n) LP parameters u(n) Pitch synthesis filter Pitch estimate v(n)

37 SPEECH CODING AND SYNTHESIS CELP Experiment(1) An experiment of CELP Original (blue) : Excitation signal (below): Reconstructed (green) :

38 SPEECH CODING AND SYNTHESIS CELP Experiment(2) Test the quality for different settings: 1. LPC model order Initial M=10 Test M=2 2. PW coefficient

39 SPEECH CODING AND SYNTHESIS CELP Experiment(3) 3. Codebook (L,K) K: codebook size K influences the computation time strongly. if K: 1024 to 256, then time: 13 to 6 sec Initial (40,1024) Test (40,16) L: length of the random signal L determines the number of subblock in the frame

40 SPEECH CODING AND SYNTHESIS Quantization With quantization, 16000 bps CELP 9600 bps CELP Trade-off Bandwidth efficiency v.s. speech quality

41 Speech Coding and Recognition SPEECH RECOGNITION

42 SPEECH RECOGNITION Dimensions of Difficulty Speaker dependent / independent Vocabulary size (small, medium, large) Discrete words / continuous utterance Quiet / noisy environment

43 SPEECH RECOGNITION Feature Extraction Overlapping frames Feature vector for each frame Mel-cepstrum, difference cepstrum, energy, diff. energy

44 SPEECH RECOGNITION Vector Quantization Vector quantization K-means algorithm Observation sequence for the whole word

45 SPEECH RECOGNITION Hidden Markov Model (1) Changing states, emitting symbols  (1), A, B 15423

46 SPEECH RECOGNITION Hidden Markov Model (2) Probability of transition State transition matrix State probability vector State equation

47 SPEECH RECOGNITION Hidden Markov Model (3) Probability of observing Observation probability matrix Observation probability vector Observation equation

48 SPEECH RECOGNITION Hidden Markov Model (4) Discrete observation hidden Markov model Two HMM problems Training problem Recognition problem

49 SPEECH RECOGNITION Recognition using HMM (1) Determining the probability that a given HMM produced the observation sequence Using straightforward computation – all possible paths, S T 33 222 2 333 11111 2 states time

50 SPEECH RECOGNITION Recognition using HMM (2) Forward-backward algorithm, only the forward part Forward partial observation Forward probability i

51 SPEECH RECOGNITION Recognition using HMM (3) Initialization Recursion Termination i j

52 SPEECH RECOGNITION Training HMM No known analytical way Forward-backward (Baum-Welch) reestimation, a hill-climbing algorithm Reestimates HMM parameters in such a way that Method: Uses and to compute forward and backward probabilities, calculates state transition probabilities and observation probabilities Reestimates the model to improve probability Need for scaling

53 SPEECH RECOGNITION Experiments Matrices A and B Observation sequences for words ‘one’ and ‘two’

54 Thank you!


Download ppt "A 12-WEEK PROJECT IN Speech Coding and Recognition by Fu-Tien Hsiao and Vedrana Andersen."

Similar presentations


Ads by Google