Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research & development Learning optimal audiovisual phasing for an HMM-based control model for facial animation O. Govokhina (1,2), G. Bailly (2), G. Breton.

Similar presentations


Presentation on theme: "Research & development Learning optimal audiovisual phasing for an HMM-based control model for facial animation O. Govokhina (1,2), G. Bailly (2), G. Breton."— Presentation transcript:

1 research & development Learning optimal audiovisual phasing for an HMM-based control model for facial animation O. Govokhina (1,2), G. Bailly (2), G. Breton (1) (1) France Telecom R&D – Rennes (2) GIPSA-lab, dpt. Parole&Cognition – Grenoble SSW6, Bonn, August 2007

2 research & development France Telecom Group GIPSA-lab 2 Facial animation Data and articulatory model Trajectory formation models State of the art First improvement: Task-Dynamics for Animation (TDA) Multimodal coordination AV asynchrony PHMM: Phased Hidden Markov Model Results and conclusions 1 2 3 Agenda 4

3 research & development France Telecom Group GIPSA-lab 3 1 Facial Animation

4 research & development France Telecom Group GIPSA-lab 4 Facial Animation Domain: Visual speech synthesis Control model Computes multiparametric trajectories from phonetic input Shape model Specifies how facial geometry is modified by articulatory parameters Appearance model Final image rendering Data from Motion Capture Shape model Shape model Appearance model Appearance model Control model AV Data Learning Motion Capture Motion Capture Analysis Synthesis Phonetic input Facial animation

5 research & development France Telecom Group GIPSA-lab 5 2 Data and articulatory model

6 research & development France Telecom Group GIPSA-lab 6 Data and articulatory model Audiovisual database FT 540 sentences, one female subject 150 colored beads, automatic tracking Cloning methodology developed at ICP Badin et al., 2002; Revéret et al., 2000 Visual parameters: 3 geometric parameters: Lips aperture/closure, Lips width, Lips protrusion 6 articulatory parameters: Jaw opening, Jaw advance, Lips rounding, Upper lip movements, Lower lip movements, Throat movements protrusion aperture width

7 research & development France Telecom Group GIPSA-lab 7 3 Trajectory formation systems

8 research & development France Telecom Group GIPSA-lab 8 Trajectory formation systems. State of the art Control models Visual-only Coarticulation models Massaro-Cohen; Öhman, … Triphones, cinematic models Deng; Okadome, Kaburagi & Honda, … From acoustics Linear vs. Nonlinear mappings Yehia et al; Berthommier Nakamura et al: voice conversion (GMM, HMM) used for speech to articulatory inversion Multimodal Synthesis by concatenation Minnis et al; Bailly et al,... HMM synthesis Masuko et al; Tokuda et al, …

9 research & development France Telecom Group GIPSA-lab 9 Trajectory formation systems. Concatenation Linguistic processing Linguistic processing Unit selection/concatenation Unit selection/concatenation Prosodic model Prosodic model Parametric synthesis Parametric synthesis Principles Multi-represented multimodal segments Selection & concatenation costs Optimal selection by DTW Selection costs Between features or more complex phonological structures Between stored cues and cues computed by external models: e.g. prosody Post-processing Smoothing Advantages/disadvantages + Quality of the synthetic speech (units from natural speech). MOS test (rule-based, concatenation, linear acoustic-to-visual mapping) : Concatenation is considered as almost equivalent to original movements Gibert et al, IEEE SS 2002 - Requires very large audiovisual database - Bad joins and/or inappropriate units are very visible

10 research & development France Telecom Group GIPSA-lab 10 Trajectory formation systems. HMM-based synthesis Principles Learning Contextual phone-sized HMM Static & dynamic parameters Gaussian/multiGaussian pdfs Generation Selection of HMM Distribution of phone durations among states (z-scoring) Solving linear equations Smoothing due to dynamic pdfs Advantages/disadvantages + Statistical parametrical synthesis Requires relatively small database It can be easily modified for different applications (languages, speaking rate, emotions, …) MOS test (concatenation, hmm, linear acoustic-to-visual mapping) : In average HMM synthesis rated better than concatenation… but under-articulated Govokhina et al, Interspeech 2006 HMM learning Segmentation Visual parameters Audio HMM and state duration models HMM and state duration models Phonetic input Visual parameters generation Visual parameters generation State duration generation State duration generation Synthetic trajectories HMM sequency … ap

11 research & development France Telecom Group GIPSA-lab 11 First improvement: TDA HMM+Concatenation HMM synthesis HMM synthesis Unit selection/concatenation Unit selection/concatenation Geometric score Geometric score Articulatory score Articulatory score Dictionnary: visual segments (geometric and articulatory) Phonetic input Planning Execution Task dynamics Saltzman & Munhall; 1989 Specifying phonetic targets with geometric/aerodynamic gauges Testing TDA Encouraging results [Govokhina et al, Interpseech 2006] but HMM planning fails to predict precise AV timing relations Planning Geometric goals HMM synthesis smooth, coherent Execution Articulatory parameters Synthesis by concatenation Detailed articulation & intrinsic multimodal synchronization

12 research & development France Telecom Group GIPSA-lab 12 4 PHMM: Phased Hidden Markov Model

13 research & development France Telecom Group GIPSA-lab 13 AV asynchrony Possible/known asynchrony Non audible gestures: during silences (ex: pre-phonatory gestures), plosives, etc. Visual salience with few acoustic impact Anticipatory gestures: rounding within consonants (/stri/ vs/. /stry/) Predominance of phonatory modes over articulation for determining phone boundaries Cause (articulation) precedes effect (sound) Modeling synchrony Few attempts in AV recognition Coupled HMMs: Alissali, 1996; Luettin et al, 2001; Gravier et al 2002 Non significant improvements Hazen, 2005 But AV fusion more problematic than timing Very few in AV synthesis Okadome et al

14 research & development France Telecom Group GIPSA-lab 14 PHMM: Phased Hidden Markov Model Visual speech synthesis Synchronizing gesture with sound boundaries in the state-of-the-art systems Simultaneous automatic learning Classical HMM learning applied to articulatory parameters Proposed audiovisual delays learning algorithm is applied. This iterative analysis by synthesis algorithm is based on Viterbi algorithm. Simple phasing model: average delay associated with each context-dependent HMM Tested using FT AV database

15 research & development France Telecom Group GIPSA-lab 15 Results Rapid convergence Within a few iterations But constraints Simple phasing model Min. durations for gestures Large improvement 10% for context-independent HMMs Combines to context Further & larger improvement for context- dependent HMMs Significant delays Largest for first & last segments (prephonatory gestures ~150 msec) Positive for vowels, glides and bilabials Negative for back and nasal consonants In accordance with Öhman numerical theory of coarticulation: slow vocalic gestures expand whereas rapid consonantal gestures shrink

16 research & development France Telecom Group GIPSA-lab 16 Illustration Features Prephonation Postphonation (see final /o/) Rounding (see / ɥ i/): longer gestural duration enables complete protrusion) Original HMM synthesis PHMM synthesis

17 research & development France Telecom Group GIPSA-lab 17 Conclusions Speech-specific trajectory formation models Trainable and parameterized by data TDA: robustness & detailed articulation PHMM: learning phasing relations between modalities Perspectives Combining TDA and PHMM Notably segmenting multimodal units using PHMM Subjective evaluation Intelligibility, adequacy & cognitive load PHMM More sophisticated phasing models: regression trees, etc Using state boundaries as possible anchor points Applying to other gestures: CS, deictic/iconic gestures that should be coordinated with speech

18 research & development France Telecom Group GIPSA-lab 18 Examples

19 research & development France Telecom Group GIPSA-lab 19 Thank you for your attention For further details Mail me at : oxana.govokhina@orange-ftgroup.comoxana.govokhina@orange-ftgroup.com

20 research & development France Telecom Group GIPSA-lab 20 PHMM Classical context-dependent HMM learning on articualtory parameters Temporal information on phonetic boundaries (audio segmentation: SA) Phoneme realignment on articulatory parameters by Viterbi Average audiovisual delay. Constraint of minimal phoneme duration (30 ms) Visual segmentation (SV) calculated from average audiovisual delay model and audio segmentation (SA) SV(i) Stop if Corr(SV(i), SV(i-1))  1 2 3 1 4 5

21 research & development France Telecom Group GIPSA-lab 21 Examples


Download ppt "Research & development Learning optimal audiovisual phasing for an HMM-based control model for facial animation O. Govokhina (1,2), G. Bailly (2), G. Breton."

Similar presentations


Ads by Google