Download presentation

Presentation is loading. Please wait.

Published byKatherine Holland Modified over 2 years ago

1
Adaptation of orofacial clones to the morphology and control strategies of target speakers for speech articulation Julián Andrés VALDÉS VARGAS Jury: Michel DESVIGNES (President) Yves LAPRIE (Reviewer) Rudolph SOCK (Reviewer) Thierry LEGOU (Examiner) Pierre BADIN (Thesis Director) 1

2
gipsa-lab Context of visual articulatory feedback Articulatory data Individual models and characterisation Multi-speaker models Conclusions and perspectives 2 Summary

3
gipsa-lab Context of visual articulatory feedback Articulatory data Individual models and characterisation Multi-speaker models Conclusions and perspectives 3 Summary

4
gipsa-lab Context Mastery of articulators for speech production Skill maintained/improved by Perception-action loop (Matthies et al., 1996) Feedback in speech – Auditory – proprioceptive 4

5
gipsa-lab Vision of articulators Augmented speech Visual feedback – Display of articulators Vision of lips and face – Improves speech intelligibility (Sumby and Pollack, 1954) – Speech imitation is faster (Fowler et al., 2003) Vision of hidden articulations – Increases intelligibility (Badin et al.,2010) 5

6
gipsa-lab Visual articulatory feedback system System of visual articulatory feedback (Ben Youssef et al., 2011) Applications – Speech rehabilitation – Computer Aided Pronunciation Training (CAPT) 6 Speech sound signal of a given speaker Visual articulatory feedback system Clones animation

7
gipsa-lab Problem of articulatory adaptation Animation of clone based on a single speaker Adaptation to several speakers 7 Speech sound speaker 1 Visual articulatory feedback system Speech sound speaker 2 Speech sound speaker n Animation based on reference speaker Mismatch between clones animation and real speakers Acoustic Adaptation (Atef BEN YOUSSEF) Articulatory adaptation Animation based on entry speaker

8
gipsa-lab Morphology – Different vocal tracts Size, vertical / horizontal lengths ratios Shape (e.g. concave / flat palates) Articulatory control strategies – Cope with morphology different articulatory strategies to achieve sounds considered equivalent for speech communication purposes 8 Inter-speaker variability

9
gipsa-lab Illustration of speaker differences /a/ /i/ /u/ Speaker PBSpeaker AASpeaker YL 9

10
gipsa-lab 10 Objectives Articulatory adaptation (Initial objective) normalization: extraction of common components (patterns) to control the articulators of several speakers. To acquire knowledge about inter-speaker variability

11
gipsa-lab Context of visual articulatory feedback Articulatory data Individual models and characterisation Multi-speaker models Conclusions and perspectives 11 Summary

12
gipsa-lab Articulatory data Type of data Articulatory data Building articulatory models Inter-speaker variability: 11 French speakers (6 males and 5 females) Articulatory phonetic coverage: 13 vowels 10 consonants in 5 vocalic contexts (vowel-consonant-vowel) 63 articulations in total 12

13
gipsa-lab Recording Methods Several recording methods considered: X-ray (Meyer (1907),Mosher (1927)) Difficult to accurately identify the contours Electro-Magnetic Articulography (EMA) No recording of the whole vocal tract Magnetic Resonance Imaging (MRI) (Rokkaku et al., 1986) Tomographic (imaging by sections) Maintained vocal tract positions Speakers in supine position Gravitational effect is moderate (Engwall (2003; 2006) ) 13

14
gipsa-lab Decision to use MRI Whole vocal tract information EMA Contours easier to identify compared to X-ray No health hazard compared to X-ray Recording parameters: Midsagittal image of the vocal tract Slice thickness: 4 mm Spatial resolution: 1 mm / pixel Acquisition time: seconds 14

15
gipsa-lab MRI Recording The speaker is asked to go through several stages Speakers lay in supine position Bed shifted into the MRI machine Setting up of alignment recording properties Maintained pronunciation of articulations for 8-16 seconds. Speakers are asked not to move their heads 15

16
gipsa-lab Processing of MRI Midsagittal contours manually edited 16 Rigid contours are drawn once for a given speaker Positioning of palate using skull bones as reference Rotation and translation Positioning of jaw by means of rototranslationsEdition of deformable contours: Lips, tongue, velum, etc.Palate of all articulations are aligned Avoidance of noise introduced by head moving /a//i//u/

17
gipsa-lab Contours modelled Upper tongue: 150 (x,y) points Lips: 100 (x,y) points Velum: 150 (x,y) points 17 Static data Articulatory study/models

18
gipsa-lab Context of visual articulatory feedback Articulatory data Individual models and characterisation Multi-speaker models Conclusions and perspectives 18 Summary

19
gipsa-lab Universal control parameters Extraction of common set of patterns (components) Goals: – Building individual-speaker articulatory models – Controlling all individual articulatory models from a universal set of components 19 Universal Set of Components Speaker 1 Speaker 2 /a/ /i/ /u/ /a/ /i/ /u/ /a/ /i/ /u/ /a/ /i/ /u/ Articulator contours of individual speakers Universal model Universal model Speaker specific weights Speaker specific weights CP /a/ CP /i/ CP /u/ CP /a/ CP /i/ CP /u/ CP /a/ CP /i/ CP /u/ CP /a/ CP /i/ CP /u/ Components M speaker1 Speaker 1 Speaker 2 M speaker2 /a/ /i/ /u/ /a/ /i/ /u/ /a/ /i/ /u/ /a/ /i/ /u/ Articulator contours of individual speakers CP /a/ CP /i/ CP /u/ CP /a/ CP /i/ CP /u/ Individual articulatory models

20
gipsa-lab Method for individual models of speakers Principal component analysis (PCA) dimensionality reduction extraction of orthogonal components 20

21
gipsa-lab Evaluation of model for a individual speaker X Variance explanation Root Mean Square Error (RMSE) 21 Assessment of models

22
gipsa-lab Performance of models to reconstruct data that was not used for training Leave-one-out cross validation procedure ( a.k.a. Jackknife ) Observation left out Reconstruction of observation left out by inverting the model Validation of generalization properties Valuable predictors retained 22 Generalization properties of models

23
gipsa-lab Guided PCA model (Badin & Serrurier (2006)) 4 components extracted 23 Individual tongue models First component extracted by Linear regression Jaw Height (predictor) Three degrees of freedom: x,y translation and rotation (Edwards & Harris, 1990) Normalized value of the y-coordinate of the lower incisor (Badin & Serrurier (2006)) (X,Y) Corr(Y, θ) 0.92

24
gipsa-lab 24 Individual tongue models Other 3 components extracted by PCA from the residue: Tongue Body (TB) Tongue Dorsum (TD) Tongue Tip (TT)

25
gipsa-lab 25 Individual tongue models Other 3 components extracted by PCA from the residue: Tongue Body (TB) Tongue Dorsum (TD) Tongue Tip (TT)

26
gipsa-lab 26 Individual tongue models Other 3 components extracted by PCA from the residue: Tongue Body (TB) Tongue Dorsum (TD) Tongue Tip (TT)

27
gipsa-lab 27 Comparison between components JH component: Max. variance: LD Min. variance: RL, MG, AK Compensation strategy of MG TB component: Represents more variance than other components Horizontal/diagonal back-front movement Speaker LDSpeaker RLSpeaker AK TD component: vertical/diagonal arching movement TT component: Used in different proportion according to the speaker Y-Tongue = Coefficients_LR * JH Nomograms: graphical representation of components Variation between -3 to 3

28
gipsa-lab 28 Speaker RL Individual lips models Speaker LD 3 components extracted by Guided PCA model (Badin et al., 2012) Jaw Height More influence on LL than UL Little influence on UL for RL Protrusion ULP > LLP for speaker LD LLP > ULP for speaker RL Lip height ULH > LLH for all speakers Except for speaker LD 25.2% 44.6% 52.7% 28.6% 12.7% 15.4% 1.7% 31% 21.9% 34.8% 55% 20.5%

29
gipsa-lab 2 components extracted by PCA (Serrurier & Badin, 2008): Velum levator (Oblique movement) - VL Superior pharyngeal constrictor (horizontal movement) - VS 29 Individual velum models VL VS

30
gipsa-lab 30 Individual velum models: consonant / ʁ / Speaker AASpeaker HL /ʁa//ʁa/ VL VS

31
gipsa-lab 31 Conclusions: individual models Tongue PCA models: 4 components (JH,TB,TD,TT) Variance Explained: 93%, RMSE: 0.13 cm Lip models: 3 components (JH, Protrusion, Height) Variance Explained: 94%, RMSE: 0.04 cm Velum models: 2 components (VL, VS) Variance Explained: 90%, RMSE: 0.08 cm

32
gipsa-lab Context of visual articulatory feedback Articulatory data Individual models and characterisation Multi-speaker models Conclusions and perspectives 32 Summary

33
gipsa-lab 33 Literature on multi-speaker models PARAFAC models : 2 components extracted Studies based on EMA ( Hoole(1998), Geng(2000), Hu(2006) ) 6-7 speakers, vowels, 3-4 sensors on the tongue, 80%-96% variance explained. Study based on X-ray: Harshman(1977) 5 speakers, 10 vowels, 13 points, 92.7% Studies based on MRI (Hoole(2000), Zheng(2003), Ananth(2010)) 3-9 speakers, 7-13 vowels, points, 71%-87% of variance exp.

34
gipsa-lab Multi-speaker decomposition methods Extraction of common set of components PARAFAC ( Harshman,1970 ) (three-way factor analysis, diagonal speaker adaptation matrix) 34

35
gipsa-lab TUCKER 3 Extension of PARAFAC Decomposition in all modes of variation 35 Multi-speaker decomposition methods

36
gipsa-lab Joint PCA (two-way analysis adapted to multi- speaker models) (Ananthakrishnan et al. (2010) – KTH(Sweden)) All speakers articulatory measurements for one phoneme considered as one set of data forces common components 36 Multi-speaker decomposition methods

37
gipsa-lab RMSE and Variance Explained (VarEx) multi-speaker model (red, green, black) vs. average of individual speakers models (blue) VarExRMSE Comparison of performance between methods 37

38
gipsa-lab Reference PCA model with 4 components Total number of components: 11 x 4 = 44 Student's t-test for RMSE at 5% signif. level Joint PCA: 14 – 21 components ( TUCKER ) PARAFAC: 21 components VarExRMSE Multi-speaker Tongue models 38 Student's t-test -> determine if the RMSE of models are significantly different from each other

39
gipsa-lab Individual models: Reference PCA model with 44 (11 x 4) components VarEx: % RMSE: 0.13 cm Multi-speaker models: Joint PCA with 4 components VarEx: % RMSE: 0.27 cm Interpretation of components: JH, TB, TD and TT Equivalent solution: Joint PCA, 21 components VarEx: 94.88% RMSE: 0.12 cm Lack of interpretation from the 5 th component Literature No. Components: 2 VarExp: 71% - 96% Corpus: 7-15 vowels Speakers: 3-9 Present study Corpus: 63 articulations (vowels and consonants) Speakers: 11 speakers Multi-speaker Tongue models 39

40
gipsa-lab Multi-speaker modelslips and velum Lips and velum models comparable with tongue models Lips individual models: 33 components (3 * 11) multi-speaker joint PCA models: equivalent with 21 components Reduced no. of components: 3 interpretable components (JH, protrusion, lip height) Velum individual models: 22 components (2 * 11) multi-speaker joint PCA models: equivalent with 14 components Reduced no. of components: 2 components (Oblique, horizontal) 40

41
gipsa-lab Context of visual articulatory feedback Articulatory data Individual models and characterisation Multi-speaker models Conclusions and perspectives 41 Summary

42
gipsa-lab Conclusions Data Unique set of articulatory data for French MRI for the whole vocal tract for 11 French speakers Contours Vowels and consonants More speakers compared to the literature Characterisation of different speakers strategies Tongue Upper and lower lip Velum Multi-speaker models (normalisation) of tongue, lips and velum contours No work in the literature on lips and velum 42

43
gipsa-lab Perspectives 43 More speakers Relation between articulatory strategies and acoustics Cross-speaker velum variability Influence of the tongue movement Nasality new modelling solutions Non-linear methods: Kernel PCA Artificial Neural Networks (ANN) Support Vector Machines (SVM)

44
gipsa-lab Acknowledgments Laurent Lamalle (IRMaGe, Grenoble) Speakers ARTIS project (GIPSA-lab, LORIA) 43

45
gipsa-lab Thank you for your attention Questions? 44

46
gipsa-lab Maeda S. (1979) Fix grid Busset J.(2013) : Adaptive grid system Euclidean coordinates (intersections) Distances and extreme angles Polar coordinates (distances and angles for each grid line) Beautemps et al. (2001): adapted to each articulation Euclidean coordinates Distances and TngAdv + TngBot 46 Grid system

47
gipsa-lab PB = YL = LH = RL = LD = BR = HL = AA = MG = AK = MGO = Corr(Y-jaw,Angle_rotation) (X,Y)

48
gipsa-lab Grid system Midsagittal function vocal tract area function (series of areas and lengths of each sagittal section) α, β models (Beautemps et al.1995; Heinz & Stevens, 1965) A = Area of a given grid section, d = midsagittal distance α, β coefficients depending on subject and vocal tract location α, β according to speaker of reference: PB vocal tract acoustic transfer function (Fant, 1960; Badin & Fant, 1984) Formants 48 Acoustic simulation

49
gipsa-lab 49 No. Coefficients by method

50
gipsa-lab Essentially, all models are wrong, but some are useful George Edward Pelham Box 50

51
gipsa-lab Joint PCA (two-way analysis adapted to multi- speaker models) (Ananthakrishnan et al. (2010) – KTH(Sweden)) All speakers articulatory measurements for one phoneme considered as one set of data forces common components 51 Multi-speaker decomposition methods

52
gipsa-lab 52 Generalisation

53
gipsa-lab Estimation of non visible landmarks (Tongue tip and jaw attachment) Computed as the average position of the articulations in which is distinguishable 53 Articulatory data Not distinguishable tongue tip Not distinguishable jaw attachment

54
gipsa-lab State of the art on articulatory normalisation Articulatory normalisation based on linear decomposition methods PARAFAC tongue models, 2 components extracted Data: 7 – 15 vowels, 3 – 9 speakers Performance: 71% - 96% of variance explanation Geometric normalisation Scaling transformations -> do not normalise articulatory control strategies employed by different speakers Challenge Modelling of other contours such as lips and velum Extension to consonants 54

55
gipsa-lab Linear regression between couple of speakers Prediction of PCA control parameters of a target speaker ( π TS ) from PCA control parameters of a source speaker ( π SS ) Multi-linear Regression VarExRMSE Overfitted from 10 th component on LOOCV 10 th components % variance explained, 0.37 cm (RMSE) 55

56
gipsa-lab 56 Individual tongue models Individual tongue models: Synergy jaw-tongue MaxMin ~= speakers RL, MG,AK

57
gipsa-lab Evaluation of model for a individual speaker X Variance explanation Root Mean Square Error (RMSE) Xp = speaker data predicted, n = number of observations, m = number of articulator measurements 57 Assessment of models

58
gipsa-lab Multi-speaker modelslips and velum Lips and velum models comparable with tongue models Lips individual models: 33 components (3 * 11) multi-speaker joint PCA models: 21 components Reduced no. of components: 3 interpretable components Velum individual models: 22 components (2 * 11) multi-speaker joint PCA models: 14 components Reduced no. of components: 2 components 58 Contour Average PCAJoint PCA according to Student's t-testJoint PCA with reduced no. of components No. Components Variance Exp. RMSE No. Components Variance Exp.RMSE No. Components Variance Exp.RMSE Upper tongue44 (4 *11)93.23%0.13 cm %0.12 cm472.16%0.27 cm Upper lip33 (3*11)94.89%0.03 cm %0.03 cm374.28%0.08 cm Lower lip33 (3*11)94.50%0.05 cm %0.04 cm369.26%0.15 cm Velum22(2*11)90%0.08 cm %0.07 cm276.01%0.14 cm

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google