Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.

Similar presentations


Presentation on theme: "Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage."— Presentation transcript:

1 Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage

2 Speech Signal Representations I Decomposition of the speech signal (x[n]) as a source (e[n]) passed through a linear time- varying filter (h[n]).

3 Speech Signal Representations I Estimation of the filter, inspired by:  Speech production models –Linear Predictive Coding (LPC) –Cepstral analysis  Speech perception models (part II) –Mel-frequency cepstrum –Perceptual Linaer Prediction (PLP) Speech recognizers estimate filter characteristics and ignore the source

4 Speech Signal Representations I Short-Time Fourier Analysis  Spectrogram –Representation of a signal highlighting several of its properties based on short-time Fourier analysis –Two dimensional: time horizontal and frequency vertical –Third ‘dimension’: gray or color level indicating energy

5 Speech Signal Representations I Short-Time Fourier Analysis  Spectrogram –Narrow band  Long windows (> 20 ms) →  Narrow bandwidth  Lower time resolution, better frequency resolution –Wide band  Short windows ( <10 ms) →  Wide bandwidth  Good time resolution, lower frequency resolution –Pitch synchronous  Requires knowledge of local pitch period

6 Speech Signal Representations I Short-Time Fourier Analysis  Spectrogram

7 Speech Signal Representations I Short-Time Fourier Analysis  Window analysis –Series of short segments, analysis frames –Short enough so that the signal is stationary –Usually constant, 20-30 ms –Overlaps possible –Different types of window functions (w m [n]):  Rectangular (equal to no window function)  Hamming  Hanning

8 Speech Signal Representations I Short-Time Fourier Analysis  Window analysis –Window size must be long enough  Rectangular: N ≥ M  Hamming, Hanning: N ≥ 2M –Pitch period not known in advance → –Prepare for lowest pitch period → –At least 20ms for rectangular or 40ms for Hamming/Hanning (50Hz) –But longer windows give a more average spectrum instead of distinct spectra → –Rectangular window has better time resolution

9 Speech Signal Representations I Short-Time Fourier Analysis

10

11

12

13

14

15

16  Window analysis –Frequency response not completely zero outside main lobe → Spectral leakage –Second lobe of a Hamming window is approx. 43dB below main lobe → less spectral leakage –Hamming, Hanning, triangular windows offer less spectral leakage → –Rectangular windows are rarely used despite their better time resolution

17 Speech Signal Representations I Short-Time Fourier Analysis

18

19

20

21 Short-time spectrum of male voice speech a)Time signal /ah/ local pitch 110Hz b)30ms rectangular window c)15ms rectangular window d)30ms Hamming window e)15ms Hamming window

22 Speech Signal Representations I Short-Time Fourier Analysis Short-time spectrum of female voice speech a)Time signal /aa/ local pitch 200Hz b)30ms rectangular window c)15ms rectangular window d)30ms Hamming window e)15ms Hamming window

23 Speech Signal Representations I Short-Time Fourier Analysis Short-time spectrum of unvoiced speech a)Time signal b)30ms rectangular window c)15ms rectangular window d)30ms Hamming window e)15ms Hamming window

24 Speech Signal Representations I Linear Predictive Coding  LPC a.k.a. auto-regressive (AR) modeling  All-pole filter is good approximation of speech, with p as the order of the LPC analysis:  Predicts current sample as linear combination of past p samples

25 Speech Signal Representations I Linear Predictive Coding  To estimate predictor coefficients (a k ), use short- term analysis technique  Per segment, minimize the total prediction error by calculating the minimum squared error  Take the derivative, equate it to 0; expressed as a set of p linear equations: the Yule-Walker equations

26 Speech Signal Representations I Linear Predictive Coding  Solution of the Yule-Walker equations: –Any standard matrix inversion package –Due to the special form of the matrix, efficient solutions:  Covariance method using the Cholesky decomposition  Autocorrelation method using windows, results in equations with Toeplitz matrices, solved by the Durbin recursion algorithm  Lattice method equivalent to Levinson Durbin recursion often used in fixed-point implementations because lack of precision doesn’t result in unstable filters

27 Speech Signal Representations I Linear Predictive Coding

28

29  Spectral analysis via LPC –All-pole (IIR) filter –Peaks at the roots of the denominator

30 Speech Signal Representations I Linear Predictive Coding  Prediction error –Should be (approximately) the excitation –Unvoiced speech, expect white noise; OK –Voiced speech, expect impulse train; NOK  All-pole assumption not altogether valid  Real speech not perfectly periodic  Pitch synchronous analysis gives better results –LPC order  Larger p gives lower prediction errors  Too large a p results in fitting the individual harmonics → separation between filter and source will not be so good

31 Speech Signal Representations I Linear Predictive Coding  Prediction error –Inverse LPC filter gives residual signal

32 Speech Signal Representations I Linear Predictive Coding  Alternatives for the predictor coefficients –Line Spectral Frequencies  local sensitivity  efficiency –Reflection Coefficients  Guaranteed stable → useful for coefficient interpolated over time –Log-area ratios  Flat spectral sensitivity –Roots of the polynomial  Represent resonance frequencies and bandwidths

33 Speech Signal Representations I Cepstral Processing –A homomorphic transformation converts a convolution into a sum:

34 Speech Signal Representations I Cepstral Processing

35


Download ppt "Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage."

Similar presentations


Ads by Google