Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Similar presentations


Presentation on theme: "Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute."— Presentation transcript:

1

2 Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute of Technology September 23, 2010

3 2 Background  Bayesian speech synthesis [Hashimoto et al., ’08]  Represent the problem of speech synthesis  All processes can be derived from one single predictive distribution  Approximation for estimating posterior  Posterior is independent of synthesis data ⇒ Training and synthesis processes are separated  Integration of training and synthesis processes  Derive an algorithm that posterior and synthesis data are iteratively updated

4 Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Problem & Proposed method  Approximation of posterior  Integration of training and synthesis processes  Experiments  Conclusion & Future work 3

5 Model training and speech synthesis Bayesian speech synthesis (1/2) 4 : Model parameters : Label seq. for synthesis : Label seq. for training: Training data : Synthesis data ML Bayes Training Synthesis Training & Synthesis

6 Bayesian speech synthesis (2/2) Predictive distribution (marginal likelihood) 5 : HMM state seq. for synthesis data Variational Bayesian method [Attias; ’99] : HMM state seq. for training data : Likelihood of synthesis data : Likelihood of training data : Prior distribution for model parameters

7 Estimate approximate posterior distribution ⇒ Maximize a lower bound Variational Bayesian method (1/2) 6 : Expectation w.r.t. ( Jensen’s inequality ) : Approximate distribution of the true posterior distribution

8  Random variables are statistically independent  Optimal posterior distributions Variational Bayesian method (2/2) 7 : Normalization terms Iterative updates as the EM algorithm

9  Speech parameter generation based on Bayesian approach  Lower bound approximates true marginal likelihood well  Maximize the lower bound Speech parameter generation 8

10 Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Problem & Proposed method  Approximation of posterior  Integration of training and synthesis processes  Experiments  Conclusion & Future work 9

11 Bayesian speech synthesis  Maximize the lower bound of log marginal likelihood consistently  Estimation of posterior distributions  Speech parameter generation ⇒ All processes are derived from the single predictive distribution 10

12 Approximation of posterior  depends on synthesis data ⇒ Synthesis data is not observed  Assume that is independent of synthesis data [Hashimoto et al., ’08] ⇒ Estimate posterior from only training data 11

13 Separation of training & synthesis 12 Training Synthesis Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

14 Use of generated data  Problem:  Posterior distribution depends on synthesis data  Synthesis data is not observed  Proposed method:  Use generated data instead of observed data for estimating posterior distribution  Iterative updates as the EM algorithm 13

15 Previous method 14 Training Synthesis Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

16 Proposed method 15 Update of posterior distribution (Model parameters) Update of posterior distribution (HMM state sequence of training data) Update of posterior distribution (HMM state sequence of synthesis data) Generation of synthesis data Training dataSynthesis data

17  Synthesis data can include several utterances  Synthesis data impacts on posterior distributions  How many utterances are generated in one update process?  Two methods are discussed  Batch-based method Update posterior distributions for several test sentences  Sentence-based method Update posterior distributions for one test sentence 16

18 Update method (1/2)  Batch-based method  Generated synthesis data of all test sentences is used for update of posterior distributions  Synthesis data of all test sentences is generated by using the same posterior distributions 17 Sentence 1 Sentence 2 Sentence N ・・・

19 Update method (2/2)  Sentence-based method  Generated synthesis data of one test sentence is used for update of posterior distributions  Synthesis data of each test sentence is generated by using different posterior distributions 18 Sentence 1 Sentence 2 Sentence N ・・・

20 Outline  Bayesian speech synthesis  Variational Bayesian method  Speech parameter generation  Problem & Proposed method  Approximation of posterior  Integration of training and synthesis processes  Experiments  Conclusion & Future work 19

21 20 Experimental conditions DatabaseATR Japanese speech database B-set SpeakerMHT Training data450 utterances Test data53 utterances Sampling rate16 kHz WindowBlackman window Frame size / shift25 ms / 5 ms Feature vector 24 mel-cepstrum + Δ + ΔΔ and log F0 + Δ + ΔΔ (78 dimension) HMM 5-state left-to-right HSMM without skip transition

22 Iteration process  Update of posterior dists. and synthesis data 1. Posterior dists. are estimated from training data 2. Initial synthesis data is generated 3. Context-clustering using training and generated synthesis data 4. Posterior dists. are re-estimated from training data and generated synthesis data (Number of updates is 5) 5. Synthesis data is re-generated 6. Step 3, 4, and 5 are iterated 21

23 22 Comparison of number of updates Data for estimation of posterior distributions Iteration0450 training utterances Iteration1 450 utterances + 1 utterance generated in Iteration0 Iteration2 450 utterances + 1 utterance generated in Iteration1 Iteration3 450 utterances + 1 utterance generated in Iteration2

24 Experimental results  Comparison of the number of updates 23

25 24 Comparison of Batch and Sentence Training & Generation Data for estimation of posterior distributions ML 450 utterances Baseline Bayes450 utterances BatchBayes 450 + 53 generated utterances SentenceBayes 450 + 1 generated utterance (53 different posterior dists.)

26 Experimental results  Comparison of Batch and Sentence 25

27 26 Conclusions and future work  Integration of training and synthesis processes  Generated synthesis data is used for estimating posterior distributions  Posterior distributions and synthesis data are updated iteratively  Outperform the baseline method  Future work  Investigation of the relation between the amount of training and synthesis data  Experiments on various amounts of training data

28 Thank you

29 Advantage  Represent predictive distribution more exactly  Optimize posterior distributions more accurately 28

30 Integration training and synthesis  Estimate posterior from generated data instead of observed data  Bayesian speech synthesis  Synthesis and training processes are iterated Training process includes model selection 29

31 Prior distribution  Conjugate prior distribution ⇒ Posterior dist. becomes a same family of dist. with prior dist.  Determination using statistics of prior data 30 : Dimension of feature : Covariance of prior data : # of prior data : Mean of prior data Conjugate prior distribution Likelihood function

32 Relation between Bayes and ML Compare with the ML criterion  Use of expectations of model parameters  Can be solved by the same fashion of ML 31 Output dist. ML ⇒ Bayes ⇒

33 Impact of prior distribution  Affect model selection as tuning parameters ⇒ Require determination technique of prior dist.  Maximize the marginal likelihood  Lead to the over-fitting problem as the ML  Tuning parameters are still required  Determination technique of prior distribution using cross validation [Hashimoto; ’08] 32

34 Speech parameter generation  Speech parameter Consist of static and dynamic features ⇒ Only static feature sequence is generated  Speech parameter generation based on Bayesian approach ⇒ Maximize the lower bound 33

35 Bayesian context clustering Context clustering based on maximizing 34 yes no Select question Gain of Stopping condition ⇒ Split node based on gain : Is this phoneme a vowel?

36 Use of generated data  Problem: Synthesis data is not observed  Proposed method: Generated data is used for estimating posterior distribution instead of observed data  Synthesis data and posterior distributions have influence on each other  Iteratively update as the EM algorithm 35

37  Batch-based method  Sentence-based method Batch-based & Sentence-based 36 Sentence 1 Sentence 2 Sentence N Sentence 1 Sentence 2Sentence N ・・・


Download ppt "Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute."

Similar presentations


Ads by Google