Auditory Morphing Weyni Clacken

Slides:



Advertisements
Similar presentations
KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)
Advertisements

Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Word Spotting DTW.
Liner Predictive Pitch Synchronization Voiced speech detection, analysis and synthesis Jim Bryan Florida Institute of Technology ECE5525 Final Project.
University of Ioannina - Department of Computer Science Wavelets and Multiresolution Processing (Background) Christophoros Nikou Digital.
Hierarchy of Design Voice Controlled Remote Voice Input Control Path Speech Processing IR Interface.
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
A System for Hybridizing Vocal Performance By Kim Hang Lau.
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
G.S.MOZE COLLEGE OF ENGINNERING BALEWADI,PUNE -45.
Speech Group INRIA Lorraine
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Pattern Recognition and Machine Learning
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Effects in frequency domain Stefania Serafin Music Informatics Fall 2004.
03/04/2005ENEE408G Spring 2005 Multimedia Signal Processing 1 ENEE408G: Capstone Design Project: Multimedia Signal Processing Design Project 3: Digital.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
09/09/2005ENEE408G Fall 2005 Multimedia Signal Processing 1 ENEE408G: Capstone Design Project: Multimedia Signal Processing Design Project 1: Digital Speech.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
A Full Frequency Masking Vocoder for Legal Eavesdropping Conversation Recording R. F. B. Sotero Filho, H. M. de Oliveira (qPGOM), R. Campello de Souza.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Introduction to Automatic Speech Recognition
Lecture 1 Signals in the Time and Frequency Domains
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Multimedia Specification Design and Production 2013 / Semester 2 / week 3 Lecturer: Dr. Nikos Gazepidis
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Implementing a Speech Recognition System on a GPU using CUDA
Speech Coding Submitted To: Dr. Mohab Mangoud Submitted By: Nidal Ismail.
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
1 Linear Prediction. Outline Windowing LPC Introduction to Vocoders Excitation modeling  Pitch Detection.
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
DR.D.Y.PATIL POLYTECHNIC, AMBI COMPUTER DEPARTMENT TOPIC : VOICE MORPHING.
Variation of aspect ratio Voice section Correct voice section Voice Activity Detection by Lip Shape Tracking Using EBGM Purpose What is EBGM ? Experimental.
Takeshi SAITOU 1, Masataka GOTO 1, Masashi UNOKI 2 and Masato AKAGI 2 1 National Institute of Advanced Industrial Science and Technology (AIST) 2 Japan.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Chapter 3 Response Charts.
Performance Comparison of Speaker and Emotion Recognition
Predicting Voice Elicited Emotions
Chapter 20 Speech Encoding by Parameters 20.1 Linear Predictive Coding (LPC) 20.2 Linear Predictive Vocoder 20.3 Code Excited Linear Prediction (CELP)
DYNAMIC TIME WARPING IN KEY WORD SPOTTING. OUTLINE KWS and role of DTW in it. Brief outline of DTW What is training and why is it needed? DTW training.
Topic: Pitch Extraction
Project and Project Formulation and Management
High Quality Voice Morphing
Trajectory Generation
Mr. Darko Pekar, Speech Morphing Inc.
ARTIFICIAL NEURAL NETWORKS
Vocoders.
Artificial Intelligence for Speech Recognition
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
T. Chernyakova, A. Aberdam, E. Bar-Ilan, Y. C. Eldar
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Linear Predictive Coding Methods
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
AIRWays Benchmark Previewing System
Linear Prediction.
A9 Graphs of non-linear functions
Speech Processing Final Project
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Keyword Spotting Dynamic Time Warping
Presentation transcript:

Auditory Morphing Weyni Clacken wc2121@columbia.edu Speech and Audio Processing & Recognition

Objective To simulate a person’s voice and speech characteristics by altering the parameters of another person’s recorded speech. Identify the parameters which have most effect on a synthesized speech that resembles a particular speaker. Manipulate these parameters and observe the effects.

Introduction Auditory morphing is to transform one speech example into the other speech example in a parameterized manner. STRAIGHT is a versatile speech manipulation tool invented by Hideki Kawahara when he was in ATR. STRAIGHT is based on a simple channel VOCODER. It decomposes input speech signals into source parameters and spectral parameters. Advanced Telecommunications research institute (ATI) – in Japan Four main types of vocoders arose. They are Channel, Homomorphic, Formant and Phase. Channel vocoders Split the modulator (formant) and carrier into frequency bands For each frequency band: Find the volume of the modulator band and modulate the carrier band with that volume Mix the bands back together to form the output

Project Outline Learn characteristics of target speech (specific sentences – training data). Extract parameters of speech using STRAIGHT Have source repeat the same utterances and obtain their characteristics. Formulate morphing algorithm (mapping one point in feature space to another) Train system by matching the sources characteristics to the target and use DTW to verify accuracy

Project Outline (cont) Obtain speech signal from source (new sentence) Modify source’s speech signal to match target’s characteristics Modify parameters: F0, Frequency, temporal, etc Use STRAIGHT to synthesis speech using new parameters Compare results of synthesized speech with the true target. Use DTW to confirm accuracy. The resulting path should be a straight diagonal line through the similarity matrix. Verify comparable synthesis through human recognition

Sample Analysis from STRAIGHT

Project Details – Parameters Morphing the speech to a given target will be done through a combination of parameters. (F0, Frequency and Temporal Axes) Each parameter will have it’s own impact and on it’s own does not prove successful. For example, using a signal, we can modifying the F0 to 0.3 of the original [example1] or to 3 times the original [example 2] [Original] [Example 1] [Example 2]

Project Details - DTW Users speak at different rates. We have to align the speech signals in time. DTW maybe able to help us achieve this. DTW may also be able to help us identify when we have a reasonable accuracy in synthesis Does DTW vary with the number of words? No, differences do appear though if one signal is not of the same length as the other. DTW is an algorithm that has been used in the past for voice recognition projects as well as for lining up signals to be compared. The theory is based on creating a 2-D matrix with the reference signal on one axis and the test signal on the other axis. For each cell in the matrix, the "distance" is calculated between the coefficients of the frames corresponding with the reference signal and the test signal. Once the entire distance matrix has been filled, the next step is to find the lowest cumulative distance path from the lower right-hand corner cell (representing the beginning of the signals) to the upper right-hand corner cell (representing the end of the signals). If the test and reference signals were exact, then the lowest CDP would be the y=x line (since there would be zero difference between the linear predictive coefficients of each frame). But, if the are not equal, then the path will model which cells need to be picked from the test signal in order to map it to the original reference signal. Finding the actual distance matrix is a simple enough process, but finding the actual path for the shortest distance can become tedious and computationally expensive if large signals are being used. Another way in which this algorithm is used is to check for word recognition. By comparing the value obtained in the upper right-hand corner to a specific threshold, one can check to see if the test and reference words are in fact similar

DTW results using the same signal

DTW results using different signals

Challenges/Concerns What is a good frame size? Is there an optimal general purpose morphing method? Is there any way to validate the synthesis without comparing the synthesized results with the actual target’s speech? Can we quantify our results? Is it more difficult to go from male to female or from female to male etc.. ?

Applications Audio Recording Editing movie clips Archiving tapes Toys and Gaming machines

Questions Auditory Morphing This sound clip is taken from Kawahara’s website introducing STRAIGHT. (Male voice morphed to female voice)