SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Advertisements

Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Speech Recognition with Hidden Markov Models Winter 2011
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Machine Learning Neural Networks
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
4/25/2001ECE566 Philip Felber1 Speech Recognition A report of an Isolated Word experiment. By Philip Felber Illinois Institute of Technology April 25,
Speaker Adaptation for Vowel Classification
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner.
Autoencoders Mostafa Heidarpour
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
SOMTIME: AN ARTIFICIAL NEURAL NETWORK FOR TOPOLOGICAL AND TEMPORAL CORRELATION FOR SPATIOTEMPORAL PATTERN LEARNING.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Eng. Shady Yehia El-Mashad
Functional Brain Signal Processing: EEG & fMRI Lesson 8 Kaushik Majumdar Indian Statistical Institute Bangalore Center M.Tech.
Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.
Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab
Hugo Froilán Vega Huerta Ana María Huayna Dueñas Artificial Vision for the Recognition of Exportable Mangoes by Using Neural Networks UNMSM.
Prepared by: Waleed Mohamed Azmy Under Supervision:
What is a neural network? Collection of interconnected neurons that compute and generate impulses. Components of a neural network include neurons, synapses,
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
NEURAL NETWORKS FOR DATA MINING
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Supervisor: Dr. Eddie Jones Co-supervisor: Dr Martin Glavin Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification.
An Evaluation of Many-to-One Voice Conversion Algorithms with Pre-Stored Speaker Data Sets Daisuke Tani, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari.
Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Indian Institute of Information Technology and Management Gwalior24/12/2008 DR. ANUPAM SHUKLA DR. RITU TIWARI HEMANT KUMAR MEENA RAHUL KALA Speaker Identification.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
DR.D.Y.PATIL POLYTECHNIC, AMBI COMPUTER DEPARTMENT TOPIC : VOICE MORPHING.
Speech Signal Processing I
HMM-Based Synthesis of Creaky Voice
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
Lecture 5 Neural Control
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Artificial Neural Networks (ANN). Artificial Neural Networks First proposed in 1940s as an attempt to simulate the human brain’s cognitive learning processes.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Topic: Pitch Extraction
1 Speech Compression (after first coding) By Allam Mousa Department of Telecommunication Engineering An Najah University SP_3_Compression.
語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山 助教: 熊信寬
High Quality Voice Morphing
Mr. Darko Pekar, Speech Morphing Inc.
ARTIFICIAL NEURAL NETWORKS
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
Final Year Project Presentation --- Magic Paint Face
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Self organizing networks
Artificial Neural Networks
Voice conversion using Artificial Neural Networks
Speech Recognition Christian Schulze
PROJECT PROPOSAL Shamalee Deshpande.
network of simple neuron-like computing elements
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Presenter: Shih-Hsiang(士翔)
Auditory Morphing Weyni Clacken
Presentation transcript:

SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute of Information Technology, Hyderabad, India

Voice Conversion Framework 2 Conversion of speech of speaker A into speaker B’s voice. Conversion achieved through transformation of spectral and excitation parameters. Spectral parameters: MFCC, LPCC, Formants etc. Excitation parameters:, residual etc Voice Conversion Speaker ASpeaker B

Modes of VC Intra-Lingual Voice Conversion (ILVC)  Parallel data:  The source and the target speaker record a same set of utterances.  Non-parallel data:  The source and the target speaker record different sets of utterances. Cross-Lingual Voice Conversion (CLVC)  The source speaker and the target speaker record utterances in two different languages. 3

VC with parallel training data Parallel Data Source Speaker Target Speaker Target Speaker Feature Extraction Alignment Mapping Function Feature ExtractionConversionSynthesis TRAINING TESTING

Alignment o Plot of speech files after alignment

VC with non-parallel training data Non-parallel data Feature Extraction Clustering Mapping Function Feature ExtractionConversionSynthesis TRAINING TESTING Clustering Source Speaker Target Speaker Target Speaker

Limitations Requires parallel/pseudo-parallel data.  Hence, training data from both the speakers is always needed. Model trained on such data can be used to transform speech between the trained speaker pairs only.  Hence, any arbitrary speakers’ speech cannot be transformed. 4

Capturing speaker-specific characteristics (Hypothesis) Target Speaker Data Formants & B.Ws VTLN ANN MCEP Source Speaker Data Formants & B.Ws VTLNANN TRAINING TESTING

Vocal Tract Length Normalization (VTLN) Graph of LP spectrum, before and after VTLN - Formant / BW Frequency - Pitch value for frame i - Sampling Frequency

Artificial Neural Networks (ANN) ‏ 10 ANN consists of interconnected processing nodes  Each node represents model of an artificial neuron  Interconnection between nodes has a weight associated with it Different topologies perform different pattern recognition tasks  Feedforward networks for pattern mapping  Feedback networks for pattern association This work uses feedforward networks for mapping source speaker’s spectral features onto target speaker’s spectral space. XY N M M 1 2 N

Hypothesis Testing Three type of experiments  Use of parallel data (ILVC)  Formant related features from source speaker and MCEPS from target speaker.  Use of non-parallel data (ILVC)  Both the formant related features and MCEPs from the target speaker.  CLVC  Both the formant related features and MCEPs from the target speaker.

Evaluation Objective  Mel-Cepstral Distortion 12 Subjective –Mean Opinion Score ( 5: excellent, 4:good, 3:fair, 2:poor, 1:bad )‏ –Similarity Scores (5: Same speakers, 1: different speakers) ‏ 12

Database 13 ILVC  CMU ARCTIC databases  SLT, CLB (US Female) ‏  BDL, RMS (US Male) ‏  JMK (Canadian Male) ‏  AWB (Scottish Male) ‏  KSP (Indian Male) ‏. CLVC  NK (Telugu Female)  PRA (Hindi Female)

ILVC with parallel training data No.FeaturesANN architectureMCD [dB] 14 F4L 50N 12L 50N 25L F + 4 B8L 16N 4L 16N 25L F + 4 B + UVN8L 16N 4L 16N 25L F + 4 B + Δ + ΔΔ + UVN24L 50N 50N 25L F0 + 4 F + 4 B + UVN9L 18N 3L 18N 25L F0 + 4 F + 4 B + Δ + ΔΔ + UVN27L 50N 50N 25L F0 + Prob. of voicing + 4 F + 4 B + Δ + ΔΔ + UVN 30L 50N 50N 25L F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN 42L 75N 75N 25L (F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN) + (3L3R MCEP to MCEP error correction) (42L 75N 75N 25L) + (175L 525N 525N 175L) 5.615

ILVC with non-parallel training data Speaker pairsMCD [dB] SLT to SLT3.966 BDL to SLT6.153 RMS to SLT6.650 CLB to SLT5.405 JMK to SLT6.754 AWB to SLT6.758 KSP to SLT7.142 Speaker pairsMCD [dB] BDL to BDL4.263 SLT to BDL6.887 RMS to BDL6.565 CLB to BDL6.444 JMK to BDL7.023 AWB to BDL7.017 KSP to BDL7.444 Target SpeakerMOSSimilarity Score BDL SLT

CLVC Source SpeakerTarget SpeakerMOSSimilarity Score NK (Telugu)BDL (English) PRA (Hindi)BDL (English)

Conclusion 17 The proposed algorithm could be used to capture speaker-specific characteristics. Hence, can be used in both ILVC and CLVC tasks.

Thank You 18