Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003 June 2005.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003 rahimzad@ce.sharif.edu June 2005

System Description Inputs : Speech signal Outputs: Facial Animation A generic 3D face in MPEG4 standard Speech stream

Agenda MPEG4 Standard Speech Processing Different Approaches Learning Phase Face Feature Extraction Training Neural Networks Experimental Results Conclusion

MPEG4 Standard Multimeida Communication Standard 1999 / Moving Picture Expert Group High quality / Low bit rate Interaction of users with media Object Oriented Object Properties Scalable quality SNHC (Synthetic Natural Hybrid Coding) Synthetic faces and bodies

Facial Animation in MPEG4 FDP (Face Definition Parameters)  Shape 84 Feature Points  Texture FAP ( Face Animation Parameters)  For animating feature points  68 parameter  High level / Low level  Global and local parameters  FAP units

Face Definition Parametes

Face Animation Parameter Units

Speech Processing Phases: Noise Reduction  Simple noise Framing Feature Extraction Speech features: LPC,MFCC, Delta MFCC, Delta Delta MFCC Frame 1 Frame 2 Feature Vector X 1 Feature Vector X 2

Two Approaches Phoneme-Viseme Mapping Approaches Transitions among visemes Discrete phonetic units Extremely stylized Language dependent Acoustic-Visual Mapping Approaches Relation between speech features and facial expressions Functional approximation Language independent Neural networks and HMM : learning machines for mapping

Learning Phase Speaker Video Speech stream Feature Extraction Training NN FAP Extraction FAP Player

Face Feature Extraction Deformable template based approach Semi automatic Candid model A wire frame model For model based coding Parameterized 113 vertex 168 face

Candid Model Parameters of WFM  Global 3d Rotation, 2d Translation, Scale  Shape Units Lip Width, Eyes Distance, …  Action Units Lip Shape, Eyebrow, … Each parameter value is a real number Texture

New Face Generation

Transformation (a 1, b 1 ) P P*P*P*P*   O O*O*O*O* Y X Correspondences: (a 1, b 1 )  (x 1, y 1 ), (a 2, b 2 )  (x 2, y 2 ), (a 3, b 3 )  (x 3, y 3 ),    *    * (a 2, b 2 ) (a 3, b 3 ) (x 2, y 2 ) (x 3, y 3 ) (x 1, y 1 )  **** source target

Transformation (cont.)

New Face Generation

Model Adaptation Selecting Optimal Parameters Global Parameters: 3d Rotation, 2d Translation, Scale Lip Parameters:  Upper Lip  Jaw Open  Lip Width  Lip Corners Vertical Movements Full Search ( expensive ) Using Previous Frame Information

Lip Reading Using of color data to guess lip area Using extracted lip area to guess lip model parameters. Upper lip, jaw open, mouth width, lip corners Using related vertex of Candide model. Two regions from first frame: Lip regions Non lip regions

Lip Area Classification Fisher Linear Discriminant Simple Fast Two point sets X, Y in n dimensions m1 is projection of X on unit vector α m2 is projection of Y on unit vector α Find α that maximizes

Estimating Lip Parameters FLD is trained by first frames pixels rgb data of pixels HSV is better than RGB. Robust in different brightness conditions

Lip Area Classification A simple approach for estimating lip parameters. Column scanning Row scanning

Generating FAPs from model Generating FAP file from model FAP file format Trial and error approach Open source FAP players FAP and wave file as input

Training Neural Networks 60 videos as data set 45 sentences for train 15 sentences for test Multilayer Perceptrons One input layer, One hidden layer, One output layer Back propagation algorithm Nine neuron in output layer Five global parameters Four lip parameters

Training Neural Networks Four speech features LPC, MFCC, Delta MFCC, Delta Delta MFCC Six networks for each speech feature One feature vector as input  30, 60, 90 neuron in hidden layer Three feature vector as input  90, 120, 150 neuron in hidden layer frame rate Video : 25 fps Speech : 50 fps

Generating Results From NNs Generating four lip parameters for each frame

Assessment Criterion A performance metric to measure the predicted accuracy of audio-visual mapping Correlation Coefficients G is one if two vectors are equal k : frame number N : number of frames in the test set

Results For LPC Networks

Results For MFCC Networks

Results For Delta MFCC Networks

Results For Delta Delta MFCC Networks

Conclusion Speech driven facial animation is possible! Delta Delta MFCC has the best performance Using previous and next speech frames improves the performance. Using combination of different speech features

Future Works More train data Speaker independent train data Multi language Other speech features Combination of speech features Facial emotions HMM for storing the mappings

Thanks…

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003 June 2005.

Similar presentations

Presentation on theme: "Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003 June 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003 June 2005.

Similar presentations

Presentation on theme: "Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003 June 2005."— Presentation transcript:

Similar presentations

About project

Feedback