Download presentation

Presentation is loading. Please wait.

Published byKristen Burkman Modified over 2 years ago

1
EG-348_371_09 1 Multimedia Communications (371) Speech and Image Communications (348) John Mason Engineering Swansea University

2
EG-348_371_09 2 Features in speech X1....Xi.....X1....Xi..... Acquisition (frame: 20/30 ms & sampling F: 8khz) Feature extraction time

3
EG-348_371_09 3 Features in speech X1....Xi.....X1....Xi..... Acquisition (frame: 20/30 ms & sampling F: 8khz) Feature extraction

4
EG-348_371_09 4 Speech production Air from the lungs Vocal foldVocal tractSpeech

5
EG-348_371_09 5 LPC Short and Long Spectral envelop reflects morphological characteristics of the vocal tract H 1 (z)H 2 (z) noise synthesised Speech Air from the lungs Vocal foldVocal tractSpeech

6
EG-348_371_09 6 Features: building of statistical model T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T2T1T2 T1T1 T2T2

7
EG-348_371_09 7 VT Shape & Some Vowels - Ladefoged ‘62

8
EG-348_371_09 8 Speech Processing - Applications Why ? Communications Synthesis Recognition Speech & Speaker How ? Frame-based Systems approach

9
EG-348_371_09 9 Some Books Flanagan -’Speech Analysis, Synthesis and Perception’, Springer-Verlag, - a classic! Furui - several books on recognition Parsons - `Voice and Speech Processing’ - McGraw Hill, one of the first text books on computer speech processing O’Shaughnessy - ‘Speech Comms - human and machine’ Addison-Wesley Rabiner & Juang - ‘Fundamentals of Speech Recognition’ Prentice Hall, 1993 Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

10
EG-348_371_09 10 Speech Communications Person-to-Person Person-to-Machine speech/speaker recognition Machine-to-Person speech synthesis

11
EG-348_371_09 11 ( Electronic ) Speech Communications perhaps separated by long distance (or in time)

12
EG-348_371_09 12 Telephony & Broadcasting Acoustic Air Path Electronic Link l Transmission Path

13
EG-348_371_09 13 Speech Comms: Telephony Electronic Link Channel Transmission Path Microphone ADC Analysis Coding Transmitter Receiver Decoding (re-)Synthesis DAC Loudspeaker

14
EG-348_371_09 14 Speech Bit Rates Message Creation Language Coding Human Acoustic generation Transmission Message Realisation Language decoding Human Hearing Extraction Acoustic Space tens hundredsthousandsTens of thousands Approx. bit rate in bps

15
EG-348_371_09 15 Criteria in Speech Comms. Quality versus Bit-rate Quality Excellent Good Fair Poor kbps GSM ADPCM CELP 4 Quality Measures: intelligibilityloudness naturalnessease-of-listening

16
EG-348_371_09 16 Low Bit Rate Speech Coding Compandent

17
EG-348_371_09 17 Speech Processing The three main application areas are: Speech Comms. (the ‘electronic link’) Automatic Speech/Speaker recognition Speech Synthesis Much of the underlying analysis is common, eg linear predictive coding

18
EG-348_371_09 18 What does speech look like?

19
EG-348_371_09 19 What does speech look like? Dynamic Range - for flexibility and robustness Time-varying - to convey information

20
EG-348_371_09 20 Frame-based Analysis To capture time variations: ms frames - ‘ centi-second’ labeling spectral analysis FFT Filter-bank Linear Predictive Coding

21
EG-348_371_09 21 Speech Analysis/Coding Two general cases: Waveform coders Source (voice) coders (vo-coders) Source coders eg linear predictive coding (LPC) : Model the source ie the vocal tract (VT) Linear, time varying model of VT, plus excitation H(z) Excitation: voiced unvoiced speech enen snsn

22
EG-348_371_09 22 Systems Approach Vocal Tract ExcitationSpeech Voiced Unvoiced Model Time Varying Parameters Speech f0f0

23
EG-348_371_09 23 LPC Analysis/Synthesis Synthesis: Input: Excitation output: Speech Analysis: Input: Speech output: Excitation H(z) h n S(z) E(z) enen snsn 1/H(z) E(z) S(z) snsn enen

24
EG-348_371_09 24 ‘Perfect’ Analysis/Synthesis H(z) S(z) E(z) enen snsn 1/H(z) E(z) S(z) snsn enen Input s n and output s n are identical (within arithmetic limits)

25
EG-348_371_09 25 Practical Analysis/Synthesis

26
EG-348_371_09 26 Practical Analysis/Synthesis 1/H(z) E(z) S(z) snsn enen H(z) S(z) E(z) enen snsn Transmission ReceivingSending Parameters for Transmission : Input / Excitation e n Source model H(z) Thus Analysis must derive these parameters, and Synthesis must use them to re-generate speech

27
EG-348_371_09 27 Principle of linear prediction: The next value (or sample) in a series, ie at time n, is predicted or estimated by a weighted sum of previous values, ie those at time n-1, n-2,... Thus for a predictor of order p, we have: sasasas n nnn as n p p Linear Predictive Coding - LPC

28
EG-348_371_09 28 Linear Prediction Transforming to the z-domain gives:

29
EG-348_371_09 29 snsn LPC Error Terms Error is simply difference between predicted and actual values: A’(z) + enen snsn ˆ -

30
EG-348_371_09 30 Synthesis H(z) snsn Parameters updated at frame rate enen A’(z) + snsn enen + NB ‘hat’ of approximation omitted for simplicity

31
EG-348_371_09 31 The Analysis and Synthesis must match what is needed for the Synthesis? Answer: e n - the excitation and H(z) - the system Thus the Analysis must derive these terms (from s n ): The speech signal, s n is analysed to give e n and H(z) ie A ’ (z) parameters for transmission. Analysis for Synthesis H(z) snsn enen Synthesis 1/H(z) E(z) S(z) snsn enen Analysis A’(z) + - enen snsn Analysis

32
EG-348_371_09 32 Derivation of LPC Coefficients - A(z) Recall: where a i are the p prediction coefficients.The principle behind LPC is to find a set of p coefficients, a 1, a 2, a 3,... a p, which in some sense minimizes the error signal e n, over a frame of speech, N. This leads to a set p coefficients for each frame.

33
EG-348_371_09 33 Derivation of A(z) – (2) Minimisation of E n is achieved by setting the p partial derivatives to zero: for i = 1, 2,.… p where: From which: In matrix form: or The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques - Durbin’s recursion algorithm being one of the most popular.

34
EG-348_371_09 34 Derivation of A(z) – (3) When N very large r is the autocorrelation coefficients of s S comes from e convolved with h (excitation & vocal tract) we are interested here in separating e and h the predictor order, p, is small to reflect the short-term periodicities (formants) with higher predictor orders we will get the longer-term periodicities (pitch) 2 practical problems with evaluating a: matrix singularities in R -1 unstable resultant H(z) in practice both are solved by windowing - shaping frame - Hamming

35
EG-348_371_09 35 Speech Signal Characteristics Duration Dynamic Range Periodicities: vocal tract pitch Frame-based Analysis frame size: quasi-stationary capture transition typically ms frame rate: task dependent: more means more band-width/computation - up to 100 frames/second

36
EG-348_371_09 36 Harmonic Structures and Periodicities Harmonic Structures & Periodicities give potential for data reduction LPC is one way of gaining this compression Speech has two obvious separate structures vocal tract resonances pitch

37
EG-348_371_09 37 Harmonic Structures and Periodicities p Vocal tract voiced or unvoiced H(z) speech enen snsn TpTp Short term prediction Short Term

38
EG-348_371_09 38 Harmonic Structures and Periodicities P Vocal tract voiced unvoiced H st (z) speech epnepn snsn TpTp Long term prediction H lt (z) Pitch enen

39
EG-348_371_09 39 H st (z) snsn H lt (z) enen epnepn Two Structures: short-term (formants) & long-term - pitch (excitation) Harmonic Structures and Periodicities eg 20ms frame 160 8Khz a i eg p=3 a i eg p=10 Gain k NB Representations of these parameters are transmitted

40
EG-348_371_09 40 Source Coders (Vocoders) Waveform & Source Coders (Vocoders) 2 periodicities/redundancies in source short-term (formants) long-term - pitch Excitation e n Practical Coding Systems H st (z) snsn H lt (z) enen ep n

41
EG-348_371_09 41 ‘Perfect’ Analysis/Synthesis (1) H(z) S(z) E(z) enen snsn 1/H(z) E(z) S(z) snsn enen Input s n and output s n are identical (within arithmetic limits)

42
EG-348_371_09 42 ‘Perfect’ Analysis/Synthesis (2) H(z) S(z) E(z) enen snsn 1/H(z) E(z) S(z) snsn enen 1/( 1–A’(z)) S(z) E(z) enen snsn 1 – A’(z) E(z) S(z) snsn enen 1 – A’(z) snsn enen 1/( 1–A’(z)) enen snsn

43
EG-348_371_09 43 ‘Perfect’ Analysis/Synthesis (3) 1 – A’(z) snsn enen 1/( 1–A’(z)) enen snsn snsn enen Z -1 a1a1 aiai apap snsn snsn s n-1 s n-i s n-p +-+- Note – minus sign: in Matlab combined with a i What determines p? Original Speech Residual

44
EG-348_371_09 44 ‘Perfect’ Analysis/Synthesis (4) 1 – A’(z) snsn enen 1/( 1–A’(z)) enen snsn enen Z -1 a1a1 aiai apap snsn snsn enen a1a1 aiai apap s n-1 s n-i s n-p snsn s n-1 s n-i s n-p snsn Original Speech Residual Re-Synth. + Note No minus +-+-

45
EG-348_371_09 45 Practical System Transmitted Data Frame H(z) S(z) E(z) enen 1/H(z) E(z) S(z) snsn enen Input s n and output s n are “similar” snsn What does the Transmitted Data Frame Contain?

46
EG-348_371_09 46 Analysis-by-Synthesis: LPAS Integrated encoder & decoder at the encoder Basic decoder Adaptive encoder snsn - + LPAS Encoder Weighted error

47
EG-348_371_09 47 Log Spectral Estimates Comparisons between frames are very important in many situations log spectral estimates are the most common (though in Comms. An approximation is used to reduce computation) In Comms, compuation is expensive and parameter vector approximations to D are used

48
EG-348_371_09 48 Some Standards GSMEuropean CellularRPE-LTP13kb/s FS1016Secure VoiceCELP4.8 IS54NA CellularVSELP7.95 IS96“QCELP1-8 JDC-FRJapanese CellularVSELP6.7 JDC-HR“PSI-CELP3.67 G.728(terrestrial)LD-CELP16

49
EG-348_371_09 49 Low Bit Rate Speech Coding Compandent

50
EG-348_371_09 50 Criteria in Speech Comms. Quality versus Bit-rate Quality Excellent Good Fair Poor kbps GSM ADPCM CELP 4 Quality Measures: intelligibilityloudness naturalnessease-of-listening

51
EG-348_371_09 51 CELP eg enen H st (z) snsn H lt (z) CB Index Gain Long-term coefficients (pitch) Short-term coefficients (formants) Excitation is represented by address ie CB Index enen

52
EG-348_371_09 52 CELP – LPAS (Encoder) enen H st (z) snsn H lt (z) CB Index Gain Long-term coefficients (pitch) Short-term coefficients (formants) Excitation is represented by address ie CB Index enen snsn snsn enen Basic decoder Adaptive encoder snsn - + Weighted error

53
EG-348_371_09 53 Conversion of LPC Parameters A(z) = 1 + a 1 z a 2 z …… a p z - p and a i are to be Tx’d Line Spectral Frequencies (LSF) present a clever way of representing the LPC coefficients, the a i ’s of A(z) The a i ’s are floating point numbers and their accuracy is important Factorising A(z) tends to give complex roots in the z-domain LSF’s map these complex roots on to the unit circle LSF’s Lead to efficient coding Ensure a minimum phase filter Bit errors are spectrum localised minimising loss of speech quality z-plane jy x x wsws LSF = w s. /2

54
EG-348_371_09 54 Line Spectral Frequencies Consider P(z) = A(z) + z —(n+1) A(z —1 ) and Q(z) = A(z) - z —(n+1) A(z —1 ) then P(z) and Q(z) lead to what is known as LSF’s Clearly if P(z) and Q(z) are known then A(z) can be found: A(z) = {P(z) + Q(z)} / 2 Roots of P(z) and Q(z) lie on the unit circle in z-domain The locations give: the LSF’s P(z) and Q(z), and whence A(z)

55
EG-348_371_09 55 LSF Evaluation Consider one pair of complex roots, A 1 (z) : A 1 (z) = 1 + a 1 z -1 + a 2 z -2 P 1 (z) = 1 + a 1 z -1 + a 2 z -2 + z -3 (1 + a 1 z 1 + a 2 z 2 ) = (z 2 + (a 1 + a 2 - 1) z + 1 )( z + 1 ) z –3 Q 1 (z) = 1 + a 1 z -1 + a 2 z -2 - z -3 (1 + a 1 z 1 + a 2 z 2 ) = (z 2 + (a 1 - a 2 + 1) z + 1 )( z - 1 ) z -3 The roots at 0 and 1 are discarded It follows that the LSF’s, 1 & 2, are given by: cos ( 1 ) = - (a 1 + a 2 - 1)/2 andcos ( 2 ) = - (a 1 - a 2 + 1)/2 Show: a 1 = -( cos ( 1 ) + cos ( 2 ) ) and a 2 = ( cos ( 2 ) - cos ( 1 ) +1 )

56
EG-348_371_09 56 LSF Test Example A 1 (z) = 1 + a 1 z -1 + a 2 z - 2 = (z 2 + a 1 z + a 2 )z - 2 = (z cos( ) w n z + w n 2 ) z - 2 where w n is radius and is angle from . So: radius = a 2 & = - Note: in P & Q all w n 2 terms (of the multiple 2nd orders) are unity EG 1: a 2 = 1 then cos ( 1 ) = - (a 1 + a 2 - 1)/2 = - (a 1 )/2 roots already on circle and do not move (unstable system – not practical) EG 2: a 1 = 0 then cos ( 1 ) = - (a 1 + a 2 -1)/2 = - (a 2 - 1)/2 cos ( 2 ) = - (a 1 - a 2 + 1)/2 = - (-a 2 + 1)/2 so LSF’s are symmetric about /4

57
EG-348_371_09 57 LSF Review & Example (1) LSF’s/LSP’s are defined as: P(z) = A(z) + z -(n+1) A(z -1 ) and Q(z) = A(z) - z -(n+1) A(z -1 ) thus A(z) = {P(z) + Q(z)} / 2

58
EG-348_371_09 58 For a second order A(z)= 1 + a 1 z -1 + a 2 z -2 P (z) = 1 + a 1 z -1 + a 2 z -2 + (1 + a 1 z 1 + a 2 z 2 )z -3 = (z 2 + (a 1 + a 2 - 1)z + 1)( z + 1)z –3 Q (z) = 1 + a 1 z -1 + a 2 z -2 - (a 1 z 1 + a 2 z 2 )z -3 = ( z 2 + (a 1 - a 2 + 1)z + 1)( z - 1 )z –3 cf: ( s 2 + ( 2cos( )w n ) s + w n 2 ) LSF Review & Example (2)

59
EG-348_371_09 59 For a second order A(z)= 1 + a 1 z -1 + a 2 z -2 : P (z) = (z 2 + (a 1 + a 2 - 1)z + 1)( z + 1)z –3 Q (z) = ( z 2 + (a 1 - a 2 + 1)z + 1)( z - 1 )z –3 cf: ( s 2 + ( 2cos( )w n )s + w n 2 ) Thus:(a 1 + a 2 - 1) = 2cos( 1 ) = - 2cos( 1 ) & (a 1 - a 2 + 1) = - 2cos( 2 ) So, given: i) LPC coeffs., a 1 and a 2, then LSF s 1 & 2 can be found ii) LSFs, 1 & 2, then the LPC coeffs. a 1 and a 2 be found 11 22 P(z) Q(z) P(z) Q(z) 22 11 LSF Review & Example (3)

60
EG-348_371_09 60 For a second order and with P(z) corresponding to the first root, Q(z) to the second root, and so P (z) = 1 + a1 z-1 + a2 z-2 + (1 + a1 z1 + a2 z2)z-3 = (z2 + (a1 + a2 - 1)z + 1)(z + 1)z–3 for the second pair of q i, 1.37 and 1.77 = (z2 - 2cos(1.37) z + 1 )(z + 1) z–3 = (z3 +(1 - 2cos(1.37) z2 + (1 - 2cos(1.37))z + 1)z–3 Likewise Q (z) = 1 + a1 z-1 + a2 z-2 - (a1 z1 + a2 z2)z-3 = (z2 + (a1 - a2 + 1)z + 1)(z - 1 )z–3 = (z2 - 2cos(1.77) z + 1 )(z - 1) z–3 = (z3 +(-1 - 2cos(1.77) z2 + (1 + 2cos(1.77))z - 1)z–3 Then A(z) = {P(z) + Q(z)} / 2) = (z3 + (cos(1.37) + cos(1.77))z2 + (1 - cos(1.37) + cos(1.77))z)z–3 LSF Review & Example (4)

61
EG-348_371_09 61 LSF Examples LPC coeffs. LSF’s a1a2 11 2 π π

62
EG-348_371_09 62 LSF Examples LPC coeffs. LSF’s a1a2 11 2 π π A(z)= 1 + a 1 z -1 + a 2 z -2 P (z) = 1 + a 1 z -1 + a 2 z -2 + (1 + a 1 z 1 + a 2 z 2 )z -3 = (z 2 + (a 1 + a 2 - 1)z + 1)( z + 1)z –3 = (z 2 + ( )z + 1)( z + 1)z –3 = (z z + 1) ( z + 1)z –3 cf: ( z 2 + ( 2cos( )w n ) z + w n 2 ) thus cos( ) = - 1.9/2 or = and 1 = π - = 0.318

63
EG-348_371_09 63 Example Bit Allocation

64
EG-348_371_09 64 Codebooks & VQ p N = 2 L i (0 … N-1) Identical book Data reduction: (p x B) to L time p

65
EG-348_371_09 65 Principle representative data sets data vector is replaced / represented by “nearest” vector, chosen from a “codebook” - a closed set of vectors Examples LPC parameter sets Excitation as in CELP Codebook Compression M N = 2 k i index, i A(z) enen H(z) snsn

66
EG-348_371_09 66 P Codebook Compression - CELP H(z) snsn y ms enen e n are time domain samples (integers) R samples per second (eg 8000 Hz) Frame rate governs vector size P = 2 j Bit rate = j/y bits/ms Codebook of time-domain samples start point y ms NB e n also includes gain

67
EG-348_371_09 67 A[z] at time t time Codebook Compression of H(z) M N = 2 k i index, i Vector with M elements, every x ms Codebook with N = 2 k vectors Bit rate = k/x bits per ms (not a function of M) In practice A[z] is converted to LSF’s. x ms

68
EG-348_371_09 68 Codebook Generation 1 ) Initialise: form a single centroid of all training data, N=1 2) Repeat Split centroids: N -> 2N Repeat Cluster data to nearest centroid until convergence until N large enough

69
EG-348_371_09 69 VQ Performance on Unseen Data Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995

70
EG-348_371_09 70 Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995 VQ Performance on Unseen Data

71
EG-348_371_ Magnitude (dB) Frequency (KHz) ( 0-to-Fs/2) Waveform Time (ms) LPC & FFT Spectra LPC Roots ± i ± i ± i ± i 2 of Q(z) 1 of P(z) LSFs

72
EG-348_371_ Magnitude (dB) Frequency (KHz) ( 0-to-Fs/2) LPC Spectra & LSF’s LPC Roots ± i ± i ± i ± i 2 of Q(z) 1 of P(z) LSFs

73
EG-348_371_ Frequency (KHz) ( 0-to-Fs/2) Time (ms) A(z): Roots: ± i H(0) = K (1- ( )) H(w s /2 ) = K (1- ( )) H(0) K/0.274 = = 21.8dB H(w s /2 ) K/ 3.38 LPC & FFT Spectra - 2nd Order

74
EG-348_371_09 74 GSM Groupe Special Mobile - EU First digital cellular system in world See Hodge 1990 Based on TDMA & FDMA at 900MHz, and RPE-LPC (ie it is an ‘LPAS’ system) Now at 1800 MHz Carriers at 200kHz Supporting 8 TDMA time slots each Time slots: 577 s bit slots 8 time slots form 1 GSM frame of 4.62 ms Modulation: Gaussian minimum shift key 26 bit training in every time slot Round-trip delay ~ 80ms EU: GSMUS: D-AMPS

75
EG-348_371_09 75 Other Related Topics Spectral Lifting: H(z) = (1-az -1 ) Codebook Training Spectral Differences between 2 frames Cepstra Modeling Speech Space - HMM’s

76
EG-348_371_09 76 Pre-Emphasis Example ms (a) (b) Figure Q1

77
EG-348_371_09 77 Pre-Emphasis Example a z-plane jy 1+a = 2 w s/2 G(w s/2 ) = 1 + a G(0) = 1 - a For G(w s/2 ) > G(0) then a must be > 0

78
EG-348_371_ a = 2 w s/ Magnitude (dB) Frequency (KHz) ( 0-to-Fs/2) Real Part Imaginary Part Z-plane to Magnitude Spectrum

79
EG-348_371_09 79 LPC Short and Long Spectral envelop reflects morphological characteristics of the vocal tract H 1 (z)H 2 (z) noise synthesised Speech Air from the lungs Vocal foldVocal tractSpeech

80
EG-348_371_09 80 ST & LT Prediction 1 – A’(z) snsn enen Residual 1 – A’(z) e`ne`n Z -1 a1a1 aiai aiai snsn snsn s n-1 s n-i s n-p +-+- Z -1 a1a1 aiai apap +-+- apap LTP STP Speech

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google