Contents Conversion Scheme Analysis Speech production model Transformation Preprocessing analysis Synthesis Results, Conclusions & Future plans
Conversion Scheme Speech Analysis Transformation function Speech Synthesis Source Orator Target Orator Source Parameters Target Parameters Requires robust parameterization of speech. Transformation is done on-line, based upon previous off-line data coordination, via codebooks, histogram equalization, neural networks.
Vocal Tract Model Linear all-pole filter, varying slowly in time relative to pitch period : Radiation Model Simulates the lips derivation. Differentiation filter with constant parameters. Unvoiced Excitation White random noise. This model was derived from the analytical solution of the acoustic speech model equations. Conversion Scheme Analysis Glottal Pulse Model G(z) Vocal Tract Model V(z) Radiation Model R(z) Voiced/ Unvoiced Switch Speech Production Model Voiced Excitation Impulse train with pitch period passed through glottal pulse model.
Signal Cleaning Noise reduction through use of signals energy and zero- crossing computation. Glottal Pulse estimation LSP conversion Global Par. estimation Pitch estimation Signal cleaning Source Signal Source Parameters Conversion Scheme Analysis Source Parameters Estimation LPC estimation Phoneme Segmentation Phoneme segmentation Manual or semi-automatic – using energy,zero-crossing,pitch. Automatic – using Hidden Markov Models. Pitch estimation Phonemes ’ pitch contour evaluation. LPC estimation Calculation of the Linear Prediction Coefficients set for each phoneme. LSP conversion Calculation of Line Spectrum Pairs corresponding to each work frame. Glottal Pulse Parameters estimation Calculation for corresponding work frames of the phoneme. Global Parameters Estimation Phonemes characteristics such as duration and global LSP.
Conversion Scheme Transformation Source Parameters Target Parameters GPP Transformation Pitch Glottal Pulse Parameters Pitch Glottal Pulse Parameters LSP Code Book LSP Code Book LSP Duration Transformation Function Source Code Book Target Code Book Phoneme LSP Distance Measure Find the source codeword closest to the phonemes LSP (given the distance measure). There is 1to1 correspondence between source and target codeword entries. Transform the phonemes duration according to the average source and target durations of a corresponding codeword: For each work frame transform the LSP through secondary 1to1 source-target LSP code books, corresponding to the n-th code words of the primary books. For each work frame pitch and energy transform through histogram equalization using source-target histograms of the n-th code word. histogram equalization Residue is substituted by corresponding to the target LSP, accepted via secondary codebooks.
For each phoneme the LSP and duration are extracted. Given the identical source and target utterances the phoneme coordination is done manually (with aid of preliminary phoneme segmentation) or using HMM. The LSP of target phonemes, corresponding to source LSP in each quantization region, are clustered to obtain the primary codebook, with centroids of phonemes ’ LSP as codewords. Averaging the durations corresponding to source and target phonemes at each quantization region gives the codebook for phonemes ’ durations. For each of the work frames of every phoneme the LSP, residue, pitch and energy are extracted. Vector quantization is performed upon the phonemes ’ LSP of the source, clustering the similar phonemes. Codebooks Creation – Training Stage Source LSP Quantization Phoneme par. calculation Framework Par. Calc. Phoneme Coordination Source & Target utterances Pitch & Energ. histogram Secondary codebook Framework coordination Duration Averaging Target LSP Clustering Primary codebook Conversion Scheme Transformation The source-target coordination on work frame level is achieved using Dynamic Time Warping – thus for each primary codeword the itemized LSP pairs of the corresponding phonemes establish the secondary codebook. For each of the primary codewords the pitch and energy information of every work frame of the corresponding phonemes are used to create source and target histograms. The normalized residues corresponding to the itemized LSP are kept as well.
For each phoneme – The excitation for each work frame is (according to model) 1. impulse pair with given pitch and energy (voiced) or 2. residue interpolated/decimated to 2-pitch length. The work frames are linearly interpolated according to duration. The speech is produced by exiting the prediction filter with corresponding coefficients. Excitation Generation V(z) LPC Conversion Duration Control LSP Duration Pitch GPP Target parameters Target speech Conversion Scheme Synthesis Target Speech Production
Vocal Results Vocal Coding SS SS SS SS SS 11122 Conversion ST STST ST 1111 SourceTarget 1212 Legend ?? No codebook ?? Phoneme codebook ?? Clustered codebook Non-modified pitch excitation Modified pitch excitation Residue excitation
Conclusions The parametric approach with codebook attains waveform coding of about 5600 bps. The training stage phoneme clustering allows global parameter (pitch,duration) conversion. balances between global frameworks search and single phoneme correspondence. LSP conversion alone miscaptures significant voice characteristics. The quality difference between conversion based upon Euclidian and IS distances is insignificant.
Future plans The parametric approach limits the optimum conversion to 5600 bps quality. Improve the parametric model (GPP), or use non parametric conversion - residue codebook (CELP). Better clustering method (other then VQ) may improve global parameter conversion as well as phoneme recognition. Improve LSP transformation-interpolation.
DTW determines optimal “ least-cost ” path through the grid, minimizing sum of visited nodes costs. 0 1234I 1 2 3 4 J For a given phoneme, we set: Work frames parameters (LPC or LSP) of target - along i and of source - along j axis. Node cost - distance between corresponding source and target parameters. Conversion Scheme Transformation Dynamic Time Warping Path constrains, for avoiding distortion, forcing time advancing with limiting stretching/contraction ratio. The optimal path determines desired alignment through node pairs.
* - code vectors - quant. regions d – distance measure (Euclidian or I-S). Conversion Scheme Transformation Vector Quantization VQ subdivides the space into quantization regions each represented by code vector. we find Given a set of LSP - training sequence which result in smallest average distance: We use LBG algorithm with PNN initialization for Euclidian and random for I-S distance.
Advantages Robustness to errors. Support of inter-vector operations. Given the LPC define : LSP are the positive angles of the roots of : Conversion Scheme Analysis LSP Conversion Q(z) roots P(z) roots LSP V(z) zeros 12 V(z) poles LPC For stable vocal filter roots of P&Q lie on the unit circle and are interleaved. Close PQ pairs correspond to dominant formants (vocal filter poles).
Euclidian - squared distance between source and target LSP : Conversion Scheme Transformation Speech Distance Measures Itakura-Saito (gain normalized) - in matrix notation : where are LPC vectors and is the covariance matrix of process excited by normalized white noise. Motivation - the error variance of any random process passed trough error filter is :
5060708090100110120 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 5060708090100110120 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 Source Pitch HistogramTarget Pitch Histogram Given the histograms we calculate the source and target histogram equalization functions: and Conversion Scheme Transformation Histogram Equalization Given the source pitch value the target pitch value is calculated by:
Initialization For each segment Calculation Segmentation Determination Pitch Voiced/Unvoiced Speech utterance Next segment Segment lengthStep length Segmentation Constant length and overlap – for each segment the pitch value is determined. Segmentation Speech utterance Initialization Set 2 adjacent segments of arbitrary minimal length – estimated pitch period. Pitch? Calculation Increase the segments length. For each calculate their cross-correlation. Stop at arbitrary maximal length. Calculation Determination Pitch period – length of the segment with maximal cross-correlation value. C.C. must achieve given threshold or classify the segment as unvoiced. Pitch! Conversion Scheme Analysis Pitch Estimation Determination Pitch Voiced/Unvoiced
Segmentation Windowing Pre-emphasis Calculation Signal,Pitch LPC Windowing Multiply each segment by hamming window. Overlapping hamming windows are approximately rectangular. Signal,Pitch 2 Pitch Pitch Segmentation Work Frame – segment of twice a pitch period duration for voiced or constant duration for unvoiced speech. The segments are overlapped by half. F Conversion Scheme Analysis LPC Estimation Pre-emphasis Constant parameter HPF compensating for the spectral tilt due to lips derivation. Windowing Pre-emphasis Calculation The gain and the denominator coefficients estimated using Linear Prediction methods. Pre-emphasis Calculation Spectral envelop of V(f) Signal ’ s FFT Calculation LPC
We use 2 methods of residue coding: Full residue preservation - obtained by passing the speech segment through the prediction error filter: Conversion Scheme Analysis GPP Estimation Residue ’ s energy – the excitation is pitch train (voiced) or noise (unvoiced) only.
Segmentation Windowing Pre-emphasis Calculation Signal,Pitch LPC Signal,Pitch Pre-emphasis Constant parameter HPF compensating for the spectral tilt due to lips derivation. Pre-emphasis Conversion Scheme Analysis LPC Estimation