Presentation on theme: "APGV04 Towards Perceptually Realistic Talking Heads: Models, Methods and McGurk David Marshall, Darren Cosker and Paul Rosin Cardiff School of Computer."— Presentation transcript:
APGV04 Towards Perceptually Realistic Talking Heads: Models, Methods and McGurk David Marshall, Darren Cosker and Paul Rosin Cardiff School of Computer Science Susan Paddock and Simon Rushton Cardiff School of Psychology Cardiff University
APGV04 Context: A Talking Head Development of a Video-Realistic Talking Head Animation from Continuous Speech Perceptual Analysis -> Realism
APGV04 Contribution of this Paper: Perceptual Realism Test Perceptual Analysis via McGurk Test Perceptual Test with no prior bias Used to improve talking head synthesis
APGV04 Outline of Talk Video Realistic Talking Head (Overview) Perceptual Analysis and Testing The McGurk Effect + McGurk Test Results : Implications of McGurk Conclusions + Future Work
APGV04 Our Talking Head Image based synthesis Continuous Speech Flexible framework – emotion, behaviour BASIC IDEA: Train on input video and audio Extracting only low level image and audio features No phonetic labelling Synthesise new video using only input audio Unseen utterances Speaker Independent
APGV04 Hierarchical Facial Model Active Appearance Models – Control of shape and texture using single ‘appearance parameter’ Based on Principal Component Analysis (PCA) Non-linear Hierarchical PCA (developed at Cardiff) Greater Separation of Variation High Degree of Control – Sub-Facial variation not orthogonal in standard PCA model Coupling of Speech Model (Cardiff Idea)
APGV04 Building A Talking Head - Initialisation For Each Video Frame Extract: Shape – Key Landmark Points (Tracker Helps) Textures – Colour Pixel Values Normalised to Shape Speech Features – Mel-Cepstral, Linear Predictive Coding (LPC)
APGV04 Building A Talking Head - Tracking Semi Automated Hand Place Few Frames Build Interim Shape Model Track Other Frames Build Final Shape Model
APGV04 Building A Talking Head - Learning/Model Building Active Appearance Model (AAM)-> Shape (PCA) and Texture (PCA) Speech/Appearance Model (SAAM NEW) -> Speech (PCA) and AAM Nonlinear PCA: Gaussian Mixture Model (GMM) Model of Dynamics: Hidden Markov Model (HMM)
APGV04 Building A Talking Head - Synthesis + Reconstruction Input Speech -> Extract Speech Features + Find Best Clusters Bottom up reconstruction: Mouth Driven
APGV04 Talking Head Examples
APGV04 Talking Head Example: Independent Speaker
APGV04 Perceptual Analysis of Talking Heads Current Talking Head Analysis Methods Subjective Evaluation Analyse and Compare Trajectories Improved Perception in Noisy environments Forced Choice Testing
APGV04 Perceptual Analysis of Talking Heads Subject and Trajectory Evaluation Analyse and Compare Trajectories Ground truth quantitative assessment Comparison to “seen” data No perceptual quality measurement Subjective Evaluation Does it “look good”? No formative comparison No feedback to improve model
APGV04 Perceptual Analysis of Talking Heads: Noisy Environment Evaluation Noisy Environment Evaluation Perceptual Evaluation Compare Performance of Synthetic v Real Talking Head in realistic situations Good overall test of talking head Lip-syncing, realism No Quantitative Measure of Performance
APGV04 Perceptual Analysis of Talking Heads: Forced Choice Testing Forced Choice Testing: Users Asked if Video is Real or Synthetic Only says if it looks realistic + lip sync is good Big Prior Introduced Users look for artefacts Randomness Bias in User selection Bored/Uninterested User No Quantitative Feedback for Model Improvement What makes it real/synthetic?
APGV04 Perceptual Analysis of Talking Heads: An New McGurk Test McGurk Test for Perceptual Analysis Subject doesn’t develop a prior Helps address strengths and weaknesses Suggests improvements based on these Compliments other tests
APGV04 Perceptual Analysis of Talking Heads: The McGurk Effect MacDonald and McGurk (1976): Auditory Syllable Dubbed onto Videotape of Different Syllables Gives Perception of and Entirely Different Syllable, e.g.: Audio ‘Ba’ Visual ‘Ga’ Perception ‘Da’ “Close Eyes – Illusion Vanishes” Raises Psychological Audio-Visual questions: How is Auditory and Visual Stimuli combined? Why combine when audio is enough?
APGV04 Perceptual Analysis of Talking Heads: Some More McGurk Effect Examples
APGV04 Perceptual Analysis of Talking Heads: Our McGurk Test McGurk Perceptual Evaluation Test: Mix Real and Synthetic tuples. What word do you perceive? Users asked to note anything differences NO PRIORS as to real/synthetic forced choice User only asked about they hear/perceive Best Viewing resolution Tested different resolutions (72x75, 36x289, 720x576 pixels)
APGV04 Perceptual Analysis of Talking Heads: Our McGurk Experimental Procedure Mix of Real and Synthetic McGurk Examples Real examples are a control Users Presented with a series of 60 (30 real 30 Synthetic) random examples Users asked only to focus on the mouth area Two initial example “training” sequences (not in trial) Soundproofed booths with adjustable volume and artificial lighting Replay option for all example Users simply record the word they perceive Users asked three questions after viewing all clips “Did you notice anything about the videos that you can comment on?” “Could you tell that some of the videos were computer generated?” “Did you use the replay button at all?” 20 psychology undergrad test subjects (4 Male/16 female) with normal hearing/vision
APGV04 Perceptual Analysis of Talking Heads: How is Our McGurk Test a Test How is this a test? Correct Lip Synch = McGurk Effect Incorrect Lip Synch = Audio/Other Audio should be dominant Questions Assess Behaviour/Output After test procedure participants asked whether they noticed anything unnatural?
APGV04 Perceptual Analysis of Talking Heads: Results Four Types of Analysis of Results: Standard McGurk Response From tuples form accepted audio and accepted McGurk response Original McGurk observation Enhanced McGurk Response Assemble a List of All participants McGurk Reponses Allows for greater variability in accents/articulation Allows for greater analysis and Improvement of Head Models Effects of Resolution on McGurk Effect End of Test Questions Analysis General overall response, qualitative analysis
APGV04 Perceptual Analysis of Talking Heads: Standard McGurk Response
APGV04 Perceptual Analysis of Talking Heads: Enhanced McGurk Response
APGV04 Perceptual Analysis of Talking Heads: Image Resolution
APGV04 Perceptual Analysis of Talking Heads: End of Test Questions Results “Notice anything to comment on?” Some audio didn’t match video “Could you tell some synthetic?” No, 1 participant = some unnatural? “Did you use replay?” Few = once, One = twice
APGV04 Perceptual Analysis of Talking Heads: Overall Results Analysis Realistic behaviour Most users were unaware of synthetic output More McGurk effects in real output Points to some weakness in model Good Synthesis of /F/, /D/, /S/, /A/ and /E/ Poor Synthesis of /V/ Some weak real and synthetic McGurk responses Beige-Gaze-Deige -> 2X Audio v McGurk Mock-Dock-Knock -> 50:50 Audio:McGurk Resolution has effect on real only Due to overall lower synthetic McGurk response
APGV04 Conclusions Suggested a perceptual approach to analysis and development of a Talking Head Unbiased by prior forced choice making Insight into performance of algorithms Complements other tests
APGV04 Perceptual Analysis of Talking Heads: Future Work Talking Head Full Emotion Performance Driven Animation 3D Modelling Full 3D appearance modelling Other perceptual tests Longer videos – McGurk sentences Real/Synthesised correct lip synch: McGurk = bad synch? Emotion – A McGurk emotion test?
APGV04 Web Links Paper Downloads McGurk Video Clips and McGurk Test Software (Macromedia Director)