Presentation is loading. Please wait.

Presentation is loading. Please wait.

Create Photo-Realistic Talking Face Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang.

Similar presentations


Presentation on theme: "Create Photo-Realistic Talking Face Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang."— Presentation transcript:

1 Create Photo-Realistic Talking Face Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang

2 Outline Introduction of talking face Motivations System overview TechniquesConclusions

3 Introduction What is a talking face Face (lip) animation, driven by voice Face (lip) animation, driven by voice Applications Applications The process of talking face Face model Face model Motion capture Motion capture Mapping between Mapping between audio and video audio and video Rendering, Rendering,Photo-realistic?

4 Literatures Walter,93, DecFace, 2Dwire frame model Walter,93, DecFace, 2Dwire frame model Terzopoulos,95, Skin and muscle model Terzopoulos,95, Skin and muscle model Breglar,97, Video Rewrite, Sample image based Breglar,97, Video Rewrite, Sample image based TS Huang,98,Mesh model from range data TS Huang,98,Mesh model from range data Poggio,98, MikeTalk, Viseme morphing Poggio,98, MikeTalk, Viseme morphing Guenter,99, Making face, 3D from multicamera Guenter,99, Making face, 3D from multicamera Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraint Cosatto,00, Planar quads model Cosatto,00, Planar quads model

5 Some Face models

6 Motivations Aim: a graphics interface for conversation agent Photo-realistic Photo-realistic Driven by Chinese Driven by Chinese Smooth connection between sentences Smooth connection between sentences Extended from “Video rewrite”

7 System overview: Pipeline of the system(1)

8 System overview: Pipeline of the system(2) New text Wav sound TTS system Triphone sequence Segmentation Synthesized triphone sequence Train database Lip motion sequence Rewrite to faces Background sequence

9 Techniques Analysis: Audio process Audio process Image process Image processSynthesis Lip image Lip image Background image Background image Stitch together Stitch together

10 Audio part: Sound Segmentation Given the wav file and the script Using HMM to train the segment system Segment wav file to phoneme sequence Example of the segmentation result: SILOPEN023 SILOPEN2442 s4361 if46274 j7580 ia18197 sh98109 ang1110121 y122130 e4131133 y134145 in2146154 h155164 ang2165194

11 Annotation with Phoneme Using phoneme to annotate video frames Each phoneme in a sentence corresponds to a short time of video sequence Training Sentence Video Frames Frames for Phoneme1 Frames for Phoneme2 … Audio Frames Frames for Phoneme1 Frames for Phoneme2 … Phoneme Sequence Phoneme1 Phoneme2 …

12 Phoneme Distance Analysis Phoneme&triphone basics Chinese Phoneme vs. English Phoneme Distance Metrics definitions Results

13 Phoneme Basics Phonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes. CH, JH, S, EH, EY, OY, AE, SIL… Triphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information. T-IY-P, IY-P-AA, P-AA-T…

14 Chinese Phoneme vs. English Chinese phoneme has two basic groups: Initials and Finals. Initials: B, P, M, F, … Finals: a3, o1, e2, eng3, iang4, ue5, … Chinese finals each has 5 tones: 1,2,3,4,5. Different tones: a1, a2, a3, a4, a5. Chinese finals actually is not a basic elements of speech. For example: iang1, iao1, uang1, iong1… Chinese phoneme set is much larger than English.

15 Phoneme Distance Analysis Define the distance between any two phonemes. Since we only synthesis video but not sound, so tone is ignored Lip shape motion is the core element for distance metrics.

16 Phoneme Distance Analysis Video 1Video 2Video 4 Video 1Video 2 Video 3 Phoneme 1: Phoneme 2: Time Align to an uniform length Video 2Video 3Video 4 Video 2Video 1 Average the videos to get an average video Video Average By comparing the two aligned average videos, we generate the distance matrix of the whole phoneme set.

17 Image part: Pose Tracking Assume a plane model for face Standard minimization method to find transform matrix (affine transform)[Black,95] Mask is used to constrain interests part of the face Template Picture Mask Image

18 Pose tracking Motion prediction using parameters with physical meaning

19 Pose Tracking Some tracking results:

20 Lip Motion Tracking Using Eigen Points (Covell, 91) Feature Points include Jaw, lip and teeth Training database specified manually Auto tracking through all pose-tracked images

21 Lip motion tracking

22 Lip Motion Tracking Train Database (hand-labeled) Auto Tracking Results

23 Synthesis new sentences New text converted by TTS system to wav Wav is segmented to phoneme sequence Using DP to find an optimal video sequence from the training database Time-align triphone videos and stitch them together. Transform the lip sequence and paste them to background faces.

24 Lip sequence synthesis Optimal phoneme sequences Triphone 1 Triphone 2Triphone 5 Triphone 3 Triphone 4 Triphone 6 Triphone 7 Triphone 8Triphone B Triphone 9 Triphone A Triphone C New phoneme sequences

25 Dynamic Programming Begin Triphone1Triphone3Triphone2Triphone4 End Triphone5

26 Edge Cost Definition Two parts: 1.phoneme distance: 3 phonemes’ distances added together 2.Lip shape distance for the overlap portion of triphone video Weighted add together two part

27 Background video generation Background is a video sequence when the virtual character spoke something else Similarity measurement of background Select “standard frame” The frame with maximal number of frames similar to it Filter out the frames with jerkiness

28

29 Stitch the time-aligned result to background faces Write back with a mask Transform the synthesized lip to the background face

30 Mask image for write-back operation Original background frameWrite-back result of the same frame

31 More video results

32

33 Conclusion and Future Work Pose tracking and lip motion tracking Size of the train database Talking face with expression Real-time generation? Fast modeling for different person

34 Animation

35 Thank you


Download ppt "Create Photo-Realistic Talking Face Changbo Hu 2001.11.26 * This work was done during visiting Microsoft Research China with Baining Guo and Bo Zhang."

Similar presentations


Ads by Google