Presentation is loading. Please wait.

Presentation is loading. Please wait.

EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University.

Similar presentations


Presentation on theme: "EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University."— Presentation transcript:

1 EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

2 EG-348_371_09 2 Features in speech X1....Xi.....X1....Xi..... Acquisition (frame: 20/30 ms & sampling F: 8khz) Feature extraction time

3 EG-348_371_09 3 Features in speech X1....Xi.....X1....Xi..... Acquisition (frame: 20/30 ms & sampling F: 8khz) Feature extraction

4 EG-348_371_09 4 Speech production Air from the lungs Vocal foldVocal tractSpeech

5 EG-348_371_09 5 LPC Short and Long Spectral envelop reflects morphological characteristics of the vocal tract H 1 (z)H 2 (z) noise synthesised Speech Air from the lungs Vocal foldVocal tractSpeech

6 EG-348_371_09 6 Features: building of statistical model T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T1 T2T2

7 EG-348_371_09 7 VT Shape & Some Vowels - Ladefoged ‘62

8 EG-348_371_09 8 Speech Processing - Applications  Why ?  Communications  Synthesis  Recognition  Speech & Speaker  How ?  Frame-based  Systems approach

9 EG-348_371_09 9 Some Books  Flanagan -’Speech Analysis, Synthesis and Perception’, Springer-Verlag, - a classic!  Furui - several books on recognition  Parsons - `Voice and Speech Processing’ - McGraw Hill, one of the first text books on computer speech processing  O’Shaughnessy - ‘Speech Comms - human and machine’ Addison-Wesley  Rabiner & Juang - ‘Fundamentals of Speech Recognition’ Prentice Hall, 1993  Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

10 EG-348_371_09 10 Speech Communications Person-to-Person Person-to-Machine speech/speaker recognition Machine-to-Person speech synthesis

11 EG-348_371_09 11 ( Electronic ) Speech Communications perhaps separated by long distance (or in time)

12 EG-348_371_09 12 Telephony & Broadcasting Acoustic Air Path Electronic Link l Transmission Path

13 EG-348_371_09 13 Speech Comms: Telephony Electronic Link Channel Transmission Path Microphone ADC Analysis Coding Transmitter Receiver Decoding (re-)Synthesis DAC Loudspeaker

14 EG-348_371_09 14 Speech Bit Rates Message Creation Language Coding Human Acoustic generation Transmission Message Realisation Language decoding Human Hearing Extraction Acoustic Space tens hundredsthousandsTens of thousands Approx. bit rate in bps

15 EG-348_371_09 15 Criteria in Speech Comms. Quality versus Bit-rate Quality Excellent Good Fair Poor 48163264 kbps GSM ADPCM CELP 4 Quality Measures: intelligibilityloudness naturalnessease-of-listening

16 EG-348_371_09 16 Low Bit Rate Speech Coding Compandent http://www.compandent.com/

17 EG-348_371_09 17 Speech Processing The three main application areas are:  Speech Comms. (the ‘electronic link’)  Automatic Speech/Speaker recognition  Speech Synthesis Much of the underlying analysis is common, eg linear predictive coding

18 EG-348_371_09 18 What does speech look like?

19 EG-348_371_09 19 What does speech look like? Dynamic Range - for flexibility and robustness Time-varying - to convey information

20 EG-348_371_09 20 Frame-based Analysis To capture time variations: 20-30 ms frames - ‘ centi-second’ labeling spectral analysis FFT Filter-bank Linear Predictive Coding

21 EG-348_371_09 21 Speech Analysis/Coding  Two general cases:  Waveform coders  Source (voice) coders (vo-coders)  Source coders eg linear predictive coding (LPC) :  Model the source ie the vocal tract (VT)  Linear, time varying model of VT, plus excitation H(z) Excitation: voiced unvoiced speech enen snsn

22 EG-348_371_09 22 Systems Approach Vocal Tract ExcitationSpeech Voiced Unvoiced Model Time Varying Parameters Speech f0f0

23 EG-348_371_09 23 LPC Analysis/Synthesis  Synthesis:  Input: Excitation  output: Speech  Analysis:  Input: Speech  output: Excitation H(z) h n S(z) E(z) enen snsn 1/H(z) E(z) S(z) snsn enen

24 EG-348_371_09 24 ‘Perfect’ Analysis/Synthesis H(z) S(z) E(z) enen snsn 1/H(z) E(z) S(z) snsn enen Input s n and output s n are identical (within arithmetic limits)

25 EG-348_371_09 25 Practical Analysis/Synthesis

26 EG-348_371_09 26 Practical Analysis/Synthesis 1/H(z) E(z) S(z) snsn enen H(z) S(z) E(z) enen snsn Transmission ReceivingSending Parameters for Transmission : Input / Excitation e n Source model H(z) Thus Analysis must derive these parameters, and Synthesis must use them to re-generate speech

27 EG-348_371_09 27 Principle of linear prediction:  The next value (or sample) in a series, ie at time n, is predicted or estimated by a weighted sum of previous values, ie those at time n-1, n-2,...  Thus for a predictor of order p, we have: sasasas n nnn     11 2 23 3........ as n  p p Linear Predictive Coding - LPC

28 EG-348_371_09 28 Linear Prediction Transforming to the z-domain gives:

29 EG-348_371_09 29 snsn LPC Error Terms Error is simply difference between predicted and actual values: A’(z) + enen snsn ˆ -

30 EG-348_371_09 30 Synthesis H(z) snsn Parameters updated at frame rate enen A’(z) + snsn enen +  NB ‘hat’ of approximation omitted for simplicity

31 EG-348_371_09 31  The Analysis and Synthesis must match  what is needed for the Synthesis?  Answer: e n - the excitation and H(z) - the system  Thus the Analysis must derive these terms (from s n ):  The speech signal, s n is analysed to give e n and H(z) ie A ’ (z) parameters for transmission. Analysis for Synthesis H(z) snsn enen Synthesis 1/H(z) E(z) S(z) snsn enen Analysis A’(z) + - enen snsn Analysis

32 EG-348_371_09 32 Derivation of LPC Coefficients - A(z) Recall: where a i are the p prediction coefficients.The principle behind LPC is to find a set of p coefficients, a 1, a 2, a 3,... a p, which in some sense minimizes the error signal e n, over a frame of speech, N. This leads to a set p coefficients for each frame.

33 EG-348_371_09 33 Derivation of A(z) – (2) Minimisation of E n is achieved by setting the p partial derivatives to zero: for i = 1, 2,.… p where: From which: In matrix form: or The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques - Durbin’s recursion algorithm being one of the most popular.

34 EG-348_371_09 34 Derivation of A(z) – (3)  When N very large r is the autocorrelation coefficients of s  S comes from e convolved with h (excitation & vocal tract)  we are interested here in separating e and h  the predictor order, p, is small to reflect the short-term periodicities (formants)  with higher predictor orders we will get the longer-term periodicities (pitch)  2 practical problems with evaluating a:  matrix singularities in R -1  unstable resultant H(z)  in practice both are solved by windowing - shaping frame - Hamming

35 EG-348_371_09 35 Speech Signal Characteristics  Duration  Dynamic Range  Periodicities:  vocal tract  pitch  Frame-based Analysis  frame size: quasi-stationary capture transition typically 20 - 30ms  frame rate: task dependent: more means more band-width/computation - up to 100 frames/second

36 EG-348_371_09 36 Harmonic Structures and Periodicities  Harmonic Structures & Periodicities give potential for data reduction  LPC is one way of gaining this compression  Speech has two obvious separate structures  vocal tract resonances  pitch

37 EG-348_371_09 37 Harmonic Structures and Periodicities p Vocal tract voiced or unvoiced H(z) speech enen snsn TpTp Short term prediction Short Term

38 EG-348_371_09 38 Harmonic Structures and Periodicities P Vocal tract voiced unvoiced H st (z) speech epnepn snsn TpTp Long term prediction H lt (z) Pitch enen

39 EG-348_371_09 39 H st (z) snsn H lt (z) enen epnepn Two Structures: short-term (formants) & long-term - pitch (excitation) Harmonic Structures and Periodicities eg 20ms frame 160 samples @ 8Khz a i eg p=3 a i eg p=10 Gain k NB Representations of these parameters are transmitted

40 EG-348_371_09 40 Source Coders (Vocoders)  Waveform & Source Coders (Vocoders)  2 periodicities/redundancies in source  short-term (formants)  long-term - pitch  Excitation e n Practical Coding Systems H st (z) snsn H lt (z) enen ep n

41 EG-348_371_09 41 ‘Perfect’ Analysis/Synthesis (1) H(z) S(z) E(z) enen snsn 1/H(z) E(z) S(z) snsn enen Input s n and output s n are identical (within arithmetic limits)

42 EG-348_371_09 42 ‘Perfect’ Analysis/Synthesis (2) H(z) S(z) E(z) enen snsn 1/H(z) E(z) S(z) snsn enen 1/( 1–A’(z)) S(z) E(z) enen snsn 1 – A’(z) E(z) S(z) snsn enen 1 – A’(z) snsn enen 1/( 1–A’(z)) enen snsn

43 EG-348_371_09 43 ‘Perfect’ Analysis/Synthesis (3) 1 – A’(z) snsn enen 1/( 1–A’(z)) enen snsn snsn enen Z -1 a1a1 aiai apap snsn  snsn s n-1 s n-i s n-p +-+- Note – minus sign: in Matlab combined with a i What determines p? Original Speech Residual

44 EG-348_371_09 44 ‘Perfect’ Analysis/Synthesis (4) 1 – A’(z) snsn enen 1/( 1–A’(z)) enen snsn enen Z -1 a1a1 aiai apap snsn  snsn enen a1a1 aiai apap s n-1 s n-i s n-p snsn s n-1 s n-i s n-p snsn  Original Speech Residual Re-Synth. + Note No minus +-+-

45 EG-348_371_09 45 Practical System Transmitted Data Frame H(z) S(z) E(z) enen 1/H(z) E(z) S(z) snsn enen Input s n and output s n are “similar”      snsn What does the Transmitted Data Frame Contain?

46 EG-348_371_09 46 Analysis-by-Synthesis: LPAS Integrated encoder & decoder at the encoder Basic decoder Adaptive encoder snsn - + LPAS Encoder Weighted error

47 EG-348_371_09 47 Log Spectral Estimates  Comparisons between frames are very important in many situations  log spectral estimates are the most common (though in Comms. An approximation is used to reduce computation) In Comms, compuation is expensive and parameter vector approximations to D are used

48 EG-348_371_09 48 Some Standards GSMEuropean CellularRPE-LTP13kb/s FS1016Secure VoiceCELP4.8 IS54NA CellularVSELP7.95 IS96“QCELP1-8 JDC-FRJapanese CellularVSELP6.7 JDC-HR“PSI-CELP3.67 G.728(terrestrial)LD-CELP16

49 EG-348_371_09 49 Low Bit Rate Speech Coding Compandent http://www.compandent.com/

50 EG-348_371_09 50 Criteria in Speech Comms. Quality versus Bit-rate Quality Excellent Good Fair Poor 48163264 kbps GSM ADPCM CELP 4 Quality Measures: intelligibilityloudness naturalnessease-of-listening

51 EG-348_371_09 51 CELP eg enen H st (z) snsn H lt (z) CB Index Gain Long-term coefficients (pitch) Short-term coefficients (formants) Excitation is represented by address ie CB Index enen 

52 EG-348_371_09 52 CELP – LPAS (Encoder) enen H st (z) snsn H lt (z) CB Index Gain Long-term coefficients (pitch) Short-term coefficients (formants) Excitation is represented by address ie CB Index enen  snsn snsn enen    Basic decoder Adaptive encoder snsn - + Weighted error

53 EG-348_371_09 53 Conversion of LPC Parameters A(z) = 1 + a 1 z - 1 + a 2 z - 2 + …… a p z - p and a i are to be Tx’d Line Spectral Frequencies (LSF) present a clever way of representing the LPC coefficients, the a i ’s of A(z) The a i ’s are floating point numbers and their accuracy is important Factorising A(z) tends to give complex roots in the z-domain LSF’s map these complex roots on to the unit circle LSF’s  Lead to efficient coding  Ensure a minimum phase filter  Bit errors are spectrum localised minimising loss of speech quality z-plane jy x x  wsws LSF = w s.  /2 

54 EG-348_371_09 54 Line Spectral Frequencies Consider P(z) = A(z) + z —(n+1) A(z —1 ) and Q(z) = A(z) - z —(n+1) A(z —1 ) then P(z) and Q(z) lead to what is known as LSF’s Clearly if P(z) and Q(z) are known then A(z) can be found: A(z) = {P(z) + Q(z)} / 2 Roots of P(z) and Q(z) lie on the unit circle in z-domain The locations give:  the LSF’s  P(z) and Q(z), and whence A(z)

55 EG-348_371_09 55 LSF Evaluation Consider one pair of complex roots, A 1 (z) : A 1 (z) = 1 + a 1 z -1 + a 2 z -2 P 1 (z) = 1 + a 1 z -1 + a 2 z -2 + z -3 (1 + a 1 z 1 + a 2 z 2 ) = (z 2 + (a 1 + a 2 - 1) z + 1 )( z + 1 ) z –3 Q 1 (z) = 1 + a 1 z -1 + a 2 z -2 - z -3 (1 + a 1 z 1 + a 2 z 2 ) = (z 2 + (a 1 - a 2 + 1) z + 1 )( z - 1 ) z -3 The roots at 0 and 1 are discarded It follows that the LSF’s,  1 &  2, are given by: cos (  1 ) = - (a 1 + a 2 - 1)/2 andcos (  2 ) = - (a 1 - a 2 + 1)/2 Show: a 1 = -( cos (  1 ) + cos (  2 ) ) and a 2 = ( cos (  2 ) - cos (  1 ) +1 )

56 EG-348_371_09 56 LSF Test Example A 1 (z) = 1 + a 1 z -1 + a 2 z - 2 = (z 2 + a 1 z + a 2 )z - 2 = (z 2 + 2 cos(  ) w n z + w n 2 ) z - 2 where w n is radius and  is angle from . So: radius =  a 2 &  =  -  Note: in P & Q all w n 2 terms (of the multiple 2nd orders) are unity EG 1: a 2 = 1 then cos (  1 ) = - (a 1 + a 2 - 1)/2 = - (a 1 )/2 roots already on circle and do not move (unstable system – not practical) EG 2: a 1 = 0 then cos (  1 ) = - (a 1 + a 2 -1)/2 = - (a 2 - 1)/2 cos (  2 ) = - (a 1 - a 2 + 1)/2 = - (-a 2 + 1)/2 so LSF’s are symmetric about  /4

57 EG-348_371_09 57 LSF Review & Example (1) LSF’s/LSP’s are defined as: P(z) = A(z) + z -(n+1) A(z -1 ) and Q(z) = A(z) - z -(n+1) A(z -1 ) thus A(z) = {P(z) + Q(z)} / 2

58 EG-348_371_09 58 For a second order A(z)= 1 + a 1 z -1 + a 2 z -2 P (z) = 1 + a 1 z -1 + a 2 z -2 + (1 + a 1 z 1 + a 2 z 2 )z -3 = (z 2 + (a 1 + a 2 - 1)z + 1)( z + 1)z –3 Q (z) = 1 + a 1 z -1 + a 2 z -2 - (a 1 z 1 + a 2 z 2 )z -3 = ( z 2 + (a 1 - a 2 + 1)z + 1)( z - 1 )z –3 cf: ( s 2 + ( 2cos(  )w n ) s + w n 2 ) LSF Review & Example (2)

59 EG-348_371_09 59 For a second order A(z)= 1 + a 1 z -1 + a 2 z -2 : P (z) = (z 2 + (a 1 + a 2 - 1)z + 1)( z + 1)z –3 Q (z) = ( z 2 + (a 1 - a 2 + 1)z + 1)( z - 1 )z –3 cf: ( s 2 + ( 2cos(  )w n )s + w n 2 ) Thus:(a 1 + a 2 - 1) = 2cos(  1 ) = - 2cos(  1 ) & (a 1 - a 2 + 1) = - 2cos(  2 ) So, given: i) LPC coeffs., a 1 and a 2, then LSF s  1 &  2 can be found ii) LSFs,  1 &  2, then the LPC coeffs. a 1 and a 2 be found 11 22 P(z) Q(z) P(z) Q(z) 22 11 LSF Review & Example (3)

60 EG-348_371_09 60 For a second order and with P(z) corresponding to the first root, Q(z) to the second root, and so P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3 = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3 for the second pair of q i, 1.37 and 1.77 = (z2 - 2cos(1.37) z + 1 )(z + 1) z–3 = (z3 +(1 - 2cos(1.37) z2 + (1 - 2cos(1.37))z + 1)z–3 Likewise Q (z) = 1 + a1 z-1 + a2 z-2 - (a1 z1 + a2 z2)z-3 = (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3 = (z2 - 2cos(1.77) z + 1 )(z - 1) z–3 = (z3 +(-1 - 2cos(1.77) z2 + (1 + 2cos(1.77))z - 1)z–3 Then A(z) = {P(z) + Q(z)} / 2) = (z3 + (cos(1.37) + cos(1.77))z2 + (1 - cos(1.37) + cos(1.77))z)z–3 LSF Review & Example (4)

61 EG-348_371_09 61 LSF Examples LPC coeffs. LSF’s a1a2 11 22 00.51.31812 1.82348 -1.80.9 0.317560.554811 +1.80.9 π-0.554811π-0. 31756 2.22742.3743

62 EG-348_371_09 62 LSF Examples LPC coeffs. LSF’s a1a2 11 22 00.51.31812 1.82348 -1.80.9 0.317560.554811 +1.80.9 π- 0.554811 π-0. 31756 2.22742.3743 A(z)= 1 + a 1 z -1 + a 2 z -2 P (z) = 1 + a 1 z -1 + a 2 z -2 + (1 + a 1 z 1 + a 2 z 2 )z -3 = (z 2 + (a 1 + a 2 - 1)z + 1)( z + 1)z –3 = (z 2 + (-1.8 + 0.9 - 1)z + 1)( z + 1)z –3 = (z 2 - 1.9 z + 1) ( z + 1)z –3 cf: ( z 2 + ( 2cos(  )w n ) z + w n 2 ) thus cos(  ) = - 1.9/2 or  = 2.824 and  1 = π -  = 0.318

63 EG-348_371_09 63 Example Bit Allocation

64 EG-348_371_09 64 Codebooks & VQ p N = 2 L i (0 … N-1) Identical book Data reduction: (p x B) to L time p

65 EG-348_371_09 65  Principle  representative data sets  data vector is replaced / represented by “nearest” vector, chosen from a “codebook” - a closed set of vectors  Examples  LPC parameter sets  Excitation as in CELP Codebook Compression M N = 2 k i index, i A(z) enen H(z) snsn

66 EG-348_371_09 66 P Codebook Compression - CELP H(z) snsn y ms enen e n are time domain samples (integers) R samples per second (eg 8000 Hz) Frame rate governs vector size P = 2 j Bit rate = j/y bits/ms Codebook of time-domain samples start point y ms NB e n also includes gain

67 EG-348_371_09 67 A[z] at time t time Codebook Compression of H(z) M N = 2 k i index, i Vector with M elements, every x ms Codebook with N = 2 k vectors Bit rate = k/x bits per ms (not a function of M) In practice A[z] is converted to LSF’s. x ms

68 EG-348_371_09 68 Codebook Generation 1 ) Initialise: form a single centroid of all training data, N=1 2) Repeat Split centroids: N -> 2N Repeat Cluster data to nearest centroid until convergence until N large enough

69 EG-348_371_09 69 VQ Performance on Unseen Data Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

70 EG-348_371_09 70 Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995 VQ Performance on Unseen Data

71 EG-348_371_09 71 012345 -40 -20 0 20 40 Magnitude (dB) Frequency (KHz) ( 0-to-Fs/2) 03.26.49.612.81619.222.425.6 -0.5 0 0.5 1 Waveform Time (ms) LPC & FFT Spectra LPC Roots -0.6651 ± 0.6695i -0.0560 ± 0.9709i 0.7228 ± 0.6225i 0.8714 ± 0.3694i 0.5758 -0.4200  2 of Q(z)  1 of P(z) 2.37432.2274 1.65401.5997 0.82610.6954 0.61060.3937 LSFs

72 EG-348_371_09 72 012345 -40 -20 0 20 40 Magnitude (dB) Frequency (KHz) ( 0-to-Fs/2) LPC Spectra & LSF’s LPC Roots -0.6651 ± 0.6695i -0.0560 ± 0.9709i 0.7228 ± 0.6225i 0.8714 ± 0.3694i 0.5758 -0.4200  2 of Q(z)  1 of P(z) 2.37432.2274 1.65401.5997 0.82610.6954 0.61060.3937 LSFs

73 EG-348_371_09 73 012345 -40 -20 0 20 40 Frequency (KHz) ( 0-to-Fs/2) 03.26.49.612.81619.222.425.6 -0.5 0 0.5 1 Time (ms) A(z): 1.5537 - 0.8276 Roots: 0.7769 ± 0.4733i H(0) = K (1- (1.5537 - 0.8276)) H(w s /2 ) = K (1- (-1.5537 - 0.8276)) H(0) K/0.274 = = 21.8dB H(w s /2 ) K/ 3.38 LPC & FFT Spectra - 2nd Order

74 EG-348_371_09 74 GSM  Groupe Special Mobile - EU  First digital cellular system in world  See Hodge 1990  Based on TDMA & FDMA at 900MHz, and RPE-LPC (ie it is an ‘LPAS’ system)  Now at 1800 MHz  Carriers at 200kHz  Supporting 8 TDMA time slots each  Time slots: 577  s - 156.26 bit slots  8 time slots form 1 GSM frame of 4.62 ms  Modulation: Gaussian minimum shift key  26 bit training in every time slot  Round-trip delay ~ 80ms  EU: GSMUS: D-AMPS

75 EG-348_371_09 75 Other Related Topics Spectral Lifting: H(z) = (1-az -1 ) Codebook Training Spectral Differences between 2 frames Cepstra Modeling Speech Space - HMM’s

76 EG-348_371_09 76 Pre-Emphasis Example 1 - 1 1 30ms (a) (b) Figure Q1

77 EG-348_371_09 77 Pre-Emphasis Example a z-plane jy 1+a = 2 w s/2 G(w s/2 ) = 1 + a G(0) = 1 - a For G(w s/2 ) > G(0) then a must be > 0

78 EG-348_371_09 78 1+a = 2 w s/2 012345 -30 -20 -10 0 10 20 30 40 50 Magnitude (dB) Frequency (KHz) ( 0-to-Fs/2) -0.500.51 -0.5 0 0.5 1 Real Part Imaginary Part Z-plane to Magnitude Spectrum

79 EG-348_371_09 79 LPC Short and Long Spectral envelop reflects morphological characteristics of the vocal tract H 1 (z)H 2 (z) noise synthesised Speech Air from the lungs Vocal foldVocal tractSpeech

80 EG-348_371_09 80 ST & LT Prediction 1 – A’(z) snsn enen Residual 1 – A’(z) e`ne`n Z -1 a1a1 aiai aiai snsn  snsn s n-1 s n-i s n-p +-+- Z -1 a1a1 aiai apap +-+- apap LTP STP Speech


Download ppt "EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University."

Similar presentations


Ads by Google