6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December 1 2002
6/3/20152 Project Goals Gradually change a source speaker ’ s voice, to sound like the voice of a target speaker. The inputs : two reference voice signals, one for each speaker. The output : N voice signals that gradually change from source to target. sourcetargetinterp
6/3/20153 Applications ● Multimedia and video entertainment: voice Morphing, just like it’s “face” counterpart. While seeing a face gradually changing from one person to another’s (like often done in video clips),we could simultaneously hear his voice changing as well. ● Forensic voice identification by synthesis: Identifying a suspect voice by creating voice- bank of different pitch, rate and timbre. Similar method was developed for face recognition to replace police sketch artists.
6/3/20154 The Challenges The source and target voice signals will never be of the same length. A time varying method is needed. It is needed to change the source voice’s characteristics to those of the target speaker : pitch,duration,spectral parameters. Produce a natural sounding speech signal. sourcetarget
6/3/20155 The Challenges cont. Here are two identical words (“shade”) form source and target The target speaker’s word lasts longer then the source speaker’s and it’s “shape” is quite different.
6/3/20156 Speech Modeling A1A2A3Ak Sound transmission in the vocal tract can be modeled as sound passing through concatenated loss less acoustic tubes. A known mathematical relation between the areas of these tubes and the vocal tract’s filter, will help in the implementation of our algorithm
6/3/20157 Speech Modeling - Synthesis The basic synthesis of digital speech is done by the discrete- time system model.
6/3/20158 sample align interpolate Prototype Waveform Interpolation PWI is a speech coding method, which is based on the presentation of a speech signal, or its residual error function, by a 3-D surface. The creation of such a surface is described by 3 main steps.
6/3/201510 The Solution Use 3D surfaces that will capture each vocal phoneme’s residual error signal characteristics and interpolate between the two speakers. Unvoiced phonemes are not dealt with due to complexity and the fact that they carry little information about the speaker.
6/3/201512 The Algorithm + Intermediate surface Speaker A Speaker B The new error function, that will be reconstructed from that surface, will then be the input of a new Vocal Tract Filter. Once the surfaces for both source and target speakers are created (for each phoneme) an interpolated surface is created.
6/3/201513 After creating a surface for each speaker’s phoneme, an intermediate surface is created. The Waveform – Intermediate
6/3/201514 The Waveform - Reconstruction The new, intermediate error signal can be evaluated from the new surface by the equation : And that : Assuming the pitch cycle changes slowly in time :
6/3/201515 Algorithm – Cont. The areas of the new tube model will be an interpolation between the source ’ s and the target ’ s. Once the new Areas are computed the LPC parameters and V(z) can be calculated, and the signal can be synthesized.
6/3/201516 New Voiced Phoneme Creation Block Diagram
6/3/201517 The transfer function The factor could be invariant, and then the voice produced will be an intermediate between the two speakers, or it could vary in time, (from at t=0 to at t=T), yielding a gradual change from one voice to the other. In order for one to hear a “linear” change between the source’s and the target’s voices, the coefficient, (the relative part of ) has to vary in time nonlinearly with a fast transition to a = 0.5 and a slow transitions around it.
6/3/201518 The Algorithm – cont. The final and new speech signal is created by concatenating the new vocal phonemes, in order, along with the source’s/target’s unvoiced phonemes and silent periods. 1 silent 2 voiced 3 unvoiced 4 voiced
6/3/201519 Conclusion The utterances produced by the algorithm were shown to consist of intermediate features of the two speakers. Both intermediate sounds: And gradually changing sounds were produced.
6/3/201520 Algorithm’s limitations Although most of the speaker’s individuality is concentrated in the voiced speech segments, degradation could be noticed, when interpolating between two speech signals that differ greatly in their unvoiced speech segments, such as heavy breathing, long periods of silence, etc.
6/3/201521 What has been done in the second part of this project? Basic implementation of reconstruction algorithm. Interpolation between two surfaces. Final implementation of the algorithm. Fixes in the algorithm such as: –Maximization of the crosscoralation between surfaces. –Modifications to the morphing factor.
6/3/201522 Future work proposals The effect of the unvoiced/silent segments of speech on voice individuality. The effect of the Vocal Tract’s formants shift around the unit cycle on the human’s ear, in order to find a robust mapping of the transform function alpha.