Presentation is loading. Please wait.

Presentation is loading. Please wait.

SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.

Similar presentations


Presentation on theme: "SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute."— Presentation transcript:

1

2 SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute of Information Technology, Hyderabad, India

3 Voice Conversion Framework 2 Conversion of speech of speaker A into speaker B’s voice. Conversion achieved through transformation of spectral and excitation parameters. Spectral parameters: MFCC, LPCC, Formants etc. Excitation parameters:, residual etc Voice Conversion Speaker ASpeaker B

4 Modes of VC Intra-Lingual Voice Conversion (ILVC)  Parallel data:  The source and the target speaker record a same set of utterances.  Non-parallel data:  The source and the target speaker record different sets of utterances. Cross-Lingual Voice Conversion (CLVC)  The source speaker and the target speaker record utterances in two different languages. 3

5 VC with parallel training data Parallel Data Source Speaker Target Speaker Target Speaker Feature Extraction Alignment Mapping Function Feature ExtractionConversionSynthesis TRAINING TESTING

6 Alignment o Plot of speech files after alignment

7 VC with non-parallel training data Non-parallel data Feature Extraction Clustering Mapping Function Feature ExtractionConversionSynthesis TRAINING TESTING Clustering Source Speaker Target Speaker Target Speaker

8 Limitations Requires parallel/pseudo-parallel data.  Hence, training data from both the speakers is always needed. Model trained on such data can be used to transform speech between the trained speaker pairs only.  Hence, any arbitrary speakers’ speech cannot be transformed. 4

9 Capturing speaker-specific characteristics (Hypothesis) Target Speaker Data Formants & B.Ws VTLN ANN MCEP Source Speaker Data Formants & B.Ws VTLNANN TRAINING TESTING

10 Vocal Tract Length Normalization (VTLN) Graph of LP spectrum, before and after VTLN - Formant / BW Frequency - Pitch value for frame i - Sampling Frequency

11 Artificial Neural Networks (ANN) ‏ 10 ANN consists of interconnected processing nodes  Each node represents model of an artificial neuron  Interconnection between nodes has a weight associated with it Different topologies perform different pattern recognition tasks  Feedforward networks for pattern mapping  Feedback networks for pattern association This work uses feedforward networks for mapping source speaker’s spectral features onto target speaker’s spectral space. XY N 2 1 1 1 M M 1 2 N

12 Hypothesis Testing Three type of experiments  Use of parallel data (ILVC)  Formant related features from source speaker and MCEPS from target speaker.  Use of non-parallel data (ILVC)  Both the formant related features and MCEPs from the target speaker.  CLVC  Both the formant related features and MCEPs from the target speaker.

13 Evaluation Objective  Mel-Cepstral Distortion 12 Subjective –Mean Opinion Score ( 5: excellent, 4:good, 3:fair, 2:poor, 1:bad )‏ –Similarity Scores (5: Same speakers, 1: different speakers) ‏ 12

14 Database 13 ILVC  CMU ARCTIC databases  SLT, CLB (US Female) ‏  BDL, RMS (US Male) ‏  JMK (Canadian Male) ‏  AWB (Scottish Male) ‏  KSP (Indian Male) ‏. CLVC  NK (Telugu Female)  PRA (Hindi Female)

15 ILVC with parallel training data No.FeaturesANN architectureMCD [dB] 14 F4L 50N 12L 50N 25L9.786 24 F + 4 B8L 16N 4L 16N 25L9.557 34 F + 4 B + UVN8L 16N 4L 16N 25L6.639 44 F + 4 B + Δ + ΔΔ + UVN24L 50N 50N 25L6.352 5F0 + 4 F + 4 B + UVN9L 18N 3L 18N 25L6.713 6F0 + 4 F + 4 B + Δ + ΔΔ + UVN27L 50N 50N 25L6.375 7F0 + Prob. of voicing + 4 F + 4 B + Δ + ΔΔ + UVN 30L 50N 50N 25L6.105 8F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN 42L 75N 75N 25L5.992 9(F0 + Prob. of voicing + 6 F + 6 B + Δ + ΔΔ + UVN) + (3L3R MCEP to MCEP error correction) (42L 75N 75N 25L) + (175L 525N 525N 175L) 5.615

16 ILVC with non-parallel training data Speaker pairsMCD [dB] SLT to SLT3.966 BDL to SLT6.153 RMS to SLT6.650 CLB to SLT5.405 JMK to SLT6.754 AWB to SLT6.758 KSP to SLT7.142 Speaker pairsMCD [dB] BDL to BDL4.263 SLT to BDL6.887 RMS to BDL6.565 CLB to BDL6.444 JMK to BDL7.023 AWB to BDL7.017 KSP to BDL7.444 Target SpeakerMOSSimilarity Score BDL2.9262.715 SLT2.7312.47

17 CLVC Source SpeakerTarget SpeakerMOSSimilarity Score NK (Telugu)BDL (English)2.882.77 PRA (Hindi)BDL (English)2.622.15

18 Conclusion 17 The proposed algorithm could be used to capture speaker-specific characteristics. Hence, can be used in both ILVC and CLVC tasks.

19 Thank You 18


Download ppt "SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute."

Similar presentations


Ads by Google