Auditory Morphing Weyni Clacken

Auditory Morphing Weyni Clacken wc2121@columbia.edu
Speech and Audio Processing & Recognition

Objective To simulate a person’s voice and speech characteristics by altering the parameters of another person’s recorded speech. Identify the parameters which have most effect on a synthesized speech that resembles a particular speaker. Manipulate these parameters and observe the effects.

Introduction Auditory morphing is to transform one speech example into the other speech example in a parameterized manner. STRAIGHT is a versatile speech manipulation tool invented by Hideki Kawahara when he was in ATR. STRAIGHT is based on a simple channel VOCODER. It decomposes input speech signals into source parameters and spectral parameters. Advanced Telecommunications research institute (ATI) – in Japan Four main types of vocoders arose. They are Channel, Homomorphic, Formant and Phase. Channel vocoders Split the modulator (formant) and carrier into frequency bands For each frequency band: Find the volume of the modulator band and modulate the carrier band with that volume Mix the bands back together to form the output

Project Outline Learn characteristics of target speech (specific sentences – training data). Extract parameters of speech using STRAIGHT Have source repeat the same utterances and obtain their characteristics. Formulate morphing algorithm (mapping one point in feature space to another) Train system by matching the sources characteristics to the target and use DTW to verify accuracy

Project Outline (cont)
Obtain speech signal from source (new sentence) Modify source’s speech signal to match target’s characteristics Modify parameters: F0, Frequency, temporal, etc Use STRAIGHT to synthesis speech using new parameters Compare results of synthesized speech with the true target. Use DTW to confirm accuracy. The resulting path should be a straight diagonal line through the similarity matrix. Verify comparable synthesis through human recognition

Sample Analysis from STRAIGHT

Project Details – Parameters
Morphing the speech to a given target will be done through a combination of parameters. (F0, Frequency and Temporal Axes) Each parameter will have it’s own impact and on it’s own does not prove successful. For example, using a signal, we can modifying the F0 to 0.3 of the original [example1] or to 3 times the original [example 2] [Original] [Example 1] [Example 2]

Project Details - DTW Users speak at different rates. We have to align the speech signals in time. DTW maybe able to help us achieve this. DTW may also be able to help us identify when we have a reasonable accuracy in synthesis Does DTW vary with the number of words? No, differences do appear though if one signal is not of the same length as the other. DTW is an algorithm that has been used in the past for voice recognition projects as well as for lining up signals to be compared. The theory is based on creating a 2-D matrix with the reference signal on one axis and the test signal on the other axis. For each cell in the matrix, the "distance" is calculated between the coefficients of the frames corresponding with the reference signal and the test signal. Once the entire distance matrix has been filled, the next step is to find the lowest cumulative distance path from the lower right-hand corner cell (representing the beginning of the signals) to the upper right-hand corner cell (representing the end of the signals). If the test and reference signals were exact, then the lowest CDP would be the y=x line (since there would be zero difference between the linear predictive coefficients of each frame). But, if the are not equal, then the path will model which cells need to be picked from the test signal in order to map it to the original reference signal. Finding the actual distance matrix is a simple enough process, but finding the actual path for the shortest distance can become tedious and computationally expensive if large signals are being used. Another way in which this algorithm is used is to check for word recognition. By comparing the value obtained in the upper right-hand corner to a specific threshold, one can check to see if the test and reference words are in fact similar

DTW results using the same signal

DTW results using different signals

Challenges/Concerns What is a good frame size?
Is there an optimal general purpose morphing method? Is there any way to validate the synthesis without comparing the synthesized results with the actual target’s speech? Can we quantify our results? Is it more difficult to go from male to female or from female to male etc.. ?

Applications Audio Recording Editing movie clips Archiving tapes
Toys and Gaming machines

Questions Auditory Morphing
This sound clip is taken from Kawahara’s website introducing STRAIGHT. (Male voice morphed to female voice)

Auditory Morphing Weyni Clacken

Similar presentations

Presentation on theme: "Auditory Morphing Weyni Clacken"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Auditory Morphing Weyni Clacken

Similar presentations

Presentation on theme: "Auditory Morphing Weyni Clacken"— Presentation transcript:

Similar presentations

About project

Feedback