Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Similar presentations


Presentation on theme: "Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs."— Presentation transcript:

1 Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs

2 Topics of Presentation Introduction and Background Introduction and Background Linear Prediction Theory Linear Prediction Theory Sound Signatures Sound Signatures Viseme Scoring Viseme Scoring Rendering System Rendering System Results Results Conclusions Conclusions

3 Justification Need: Need: Existing methods are labor intensive Existing methods are labor intensive Poor results Poor results Expensive Expensive Solution: Solution: Automatic method Automatic method “Decent” results “Decent” results

4 Applications of Automatic System Typical applications benefiting from an automatic method: Typical applications benefiting from an automatic method: Real-time video communication Real-time video communication Synthetic computer agents Synthetic computer agents Low-budget animation scenarios: Low-budget animation scenarios: Video games industry Video games industry

5 Automatic Is Possible Spoken word is broken into phonemes Spoken word is broken into phonemes Phonemes are comprehensive Phonemes are comprehensive Visemes are visual correlates Visemes are visual correlates Used in lip-reading and traditional animation Used in lip-reading and traditional animation

6 Existing Methods of Synchronization Text Based Text Based Analyze text to extract phonemes Analyze text to extract phonemes Speech Based Speech Based Volume tracking Volume tracking Speech recognition front-end Speech recognition front-end Linear Prediction Linear Prediction Hybrids Hybrids Text & Speech Text & Speech Image & Speech Image & Speech

7 Speech Based is Best Doesn’t need script Doesn’t need script Fully automatic Fully automatic Can use original sound sample (best quality) Can use original sound sample (best quality) Can use source-filter model Can use source-filter model

8 Source-Filter Model Models a sound signal as a source passed through a filter Models a sound signal as a source passed through a filter Source: lungs & vocal cords Source: lungs & vocal cords Filter: vocal tract Filter: vocal tract Implemented using Linear Prediction Implemented using Linear Prediction

9 Speech Related Topics Phoneme recognition Phoneme recognition How many to use? How many to use? Mapping phonemes to visemes Mapping phonemes to visemes Use visually distinctive ones (e.g. vowel sounds) Use visually distinctive ones (e.g. vowel sounds) Coarticulation effect Coarticulation effect

10 The Coarticulation Effect The blending of sound based on adjacent phonemes (common in every-day speech) The blending of sound based on adjacent phonemes (common in every-day speech) Artifact of discrete phoneme recognition Artifact of discrete phoneme recognition Causes poor visual synchronization (transitions are jerky and unnatural) Causes poor visual synchronization (transitions are jerky and unnatural)

11 Speech Encoding Methods Pulse Code Modulation (PCM) Pulse Code Modulation (PCM) Vocoding Vocoding Linear Prediction Linear Prediction

12 Pulse Code Modulation Raw digital sampling Raw digital sampling High quality sound High quality sound Very high bandwidth requirements Very high bandwidth requirements

13 Vocoding Stands for VOice-enCODing Stands for VOice-enCODing Origins in military applications Origins in military applications Models physical entities (tongue, vocal cord, jaw, etc.) Models physical entities (tongue, vocal cord, jaw, etc.) Poor sound quality (tin can voices) Poor sound quality (tin can voices) Very low bandwidth requirements Very low bandwidth requirements

14 Linear Prediction Hybrid method (of PCM and Vocoding) Hybrid method (of PCM and Vocoding) Models sound source and filter separately Models sound source and filter separately Uses original sound sample to calculate recreation parameters (minimum error) Uses original sound sample to calculate recreation parameters (minimum error) Low bandwidth requirements Low bandwidth requirements Pitch and intonation independence Pitch and intonation independence

15 Linear Prediction Theory Source-Filter model Source-Filter model P coefficients are calculated P coefficients are calculated Source Filter

16 Linear Prediction Theory (cont.) The a k coefficients are found by minimizing the original sound (S t ) and the reconstructed sound (s i ). The a k coefficients are found by minimizing the original sound (S t ) and the reconstructed sound (s i ). Can be solved using Levinson-Durbin recursion. Can be solved using Levinson-Durbin recursion.

17 Linear Prediction Theory (cont.) Coefficients represent the filter part Coefficients represent the filter part The filter is assumed constant for small “windows” on the original sample (10- 30ms windows) The filter is assumed constant for small “windows” on the original sample (10- 30ms windows) Each window has its own coefficients Each window has its own coefficients Sound source is either Pulse Train (voiced) or white noise (unvoiced) Sound source is either Pulse Train (voiced) or white noise (unvoiced)

18 Linear Prediction for Recognition Recognition on raw coefficients is poor Recognition on raw coefficients is poor Better to FFT the values Better to FFT the values Take only first “half” of FFT’d values Take only first “half” of FFT’d values This is the “signature” of the sound This is the “signature” of the sound

19 Sound Signatures 16 values represent the sound 16 values represent the sound Speaker independent Speaker independent Unique for each phoneme Unique for each phoneme Easily recognized by machine Easily recognized by machine

20 Viseme Scoring Phonemes were chosen judiciously Phonemes were chosen judiciously Map one-to-one to visemes Map one-to-one to visemes Visemes scored independently using history Visemes scored independently using history V i = 0.9 * V i-1 + 0.1 * {1 if matched at i, else 0} V i = 0.9 * V i-1 + 0.1 * {1 if matched at i, else 0} Ramps up and down with successive matches/mismatches Ramps up and down with successive matches/mismatches

21 Rendering System Uses Alias|Wavefront’s Maya package Uses Alias|Wavefront’s Maya package Built-in support for “blend shapes” Built-in support for “blend shapes” Mapped directly to viseme scores Mapped directly to viseme scores Very expressive and flexible Very expressive and flexible Script generated and later read in Script generated and later read in Rendered to movie, QuickTime used to add in original sound and produce final movie. Rendered to movie, QuickTime used to add in original sound and produce final movie.

22 Results (Timing) Precise timing can be achieved Precise timing can be achieved Smoothing introduces “lag” Smoothing introduces “lag”

23 Results (Other Examples) A female speaker using male phoneme set A female speaker using male phoneme set Slower speech, male speaker

24 Results (Other Examples) (cont.) Accented speech with fast pace Accented speech with fast pace

25 Results (Summary) Good with basic speech Good with basic speech Good speaker independence (for normal speech) Good speaker independence (for normal speech) Poor performance when speech: Poor performance when speech: Is too fast Is too fast Is accented Is accented Contains phonemes not in the reference set (e.g. “w” and “th”) Contains phonemes not in the reference set (e.g. “w” and “th”)

26 Conclusion Linear Prediction provides several benefits: Linear Prediction provides several benefits: Speaker independence Speaker independence Easy to recognize automatically Easy to recognize automatically Results are reasonable, but can be improved Results are reasonable, but can be improved

27 Future Work Identify best set of phonemes and visemes Identify best set of phonemes and visemes Phoneme classification could be improved with better matching algorithm (neural net?) Phoneme classification could be improved with better matching algorithm (neural net?) Larger phoneme reference set for more robust matching Larger phoneme reference set for more robust matching

28 Results Simple cases work very well Simple cases work very well Timing is good and very responsive Timing is good and very responsive Robust with respect to speaker Robust with respect to speaker Cross-gender, multiple male speakers Cross-gender, multiple male speakers Fails on: accents, speed, unknown phonemes Fails on: accents, speed, unknown phonemes Problems with noisy samples Problems with noisy samples Can be smoothed but introduces “lag” Can be smoothed but introduces “lag”

29 End

30 Automatic Is Possible Spoken word is broken into phonemes Spoken word is broken into phonemes Phonemes are comprehensive Phonemes are comprehensive Visemes are visual correlates Visemes are visual correlates Used in lip-reading and traditional animation Used in lip-reading and traditional animation Physical speech (vocal cords, vocal tract) can be modeled Physical speech (vocal cords, vocal tract) can be modeled Source-filter model Source-filter model

31 Sound Signatures (Speaker Independence)

32 Sound Signatures (For Phonemes)

33 Results (Normal Speech) Normal speech, moderate pace Normal speech, moderate pace


Download ppt "Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs."

Similar presentations


Ads by Google