Presentation is loading. Please wait.

Presentation is loading. Please wait.

SecurePhone Workshop - 24/25 June 2004 1 Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.

Similar presentations


Presentation on theme: "SecurePhone Workshop - 24/25 June 2004 1 Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano."— Presentation transcript:

1 SecurePhone Workshop - 24/25 June 2004 1 Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano

2 SecurePhone Workshop - 24/25 June 2004 2 Outline - Speaking faces verification problem - State of the art in speaking faces verification - Choice of system architecture - Fusion of audio and visual modalities - Initial results using BANCA database (Becars: voice only system)

3 SecurePhone Workshop - 24/25 June 2004 3 Problem definition -Detection and tracking of lips in the video sequence: -Locate head/face in image frame -Locate mouth/lips area (Region of Interest) -Determine/calculate lip contours coordinates and intensity parameters (visual feature extraction) -Other parameters: visible teeth, tongue jaw movement, eyebrows, cheeks etc… -Modelling parameters -Model deformation of lip (or other) parameters over time: -HMMs, GMMs… -Fusion of visual and acoustic parameters/models -Calculate likelihood of model relative to client/world model in order to accept/reject -Augment in-house speaker verification system (Becars) with visual parameters

4 SecurePhone Workshop - 24/25 June 2004 4 Limitations -Limited device (storage and CPU processing power) -Subject variability (aging, beard, glasses…), pose, illumination -Low complexity algorithms -Subspace transforms, learning methods -Image based approaches, hue colouration/chromaticity clues -Model based approaches

5 SecurePhone Workshop - 24/25 June 2004 5 Active Shape Models -Identification: based on spatio-temporal analysis of video sequence -Person represented by deformable parametric model of visible speech articulators (usually lips) with their temporal characteristics - Active Shape Model consists of shape parameters (lip contours) and greyscale/colour intensity (for illumination) -Model trained on training set using PCA to recover principal modes of deformation of the model - Model used to track lips over time, model parameters recovered from lip tracking results - Shape and intensity modelled by GMMs, temporal dependencies (state transition probabilities) by HMMs -Verification: using a Viterbi algorithm, if estimation of likelihood of model generating the observed sequence of features corresponding to a client is above a threshold, then accept, else reject

6 SecurePhone Workshop - 24/25 June 2004 6 Active Shape Models -Robust detection, tracking & parameterisation of visual features -Statistical, avoids use of constraints, thresholds, penalties -Model only allowed to deform to shapes similar to those seen in training set (trained using PCA) -Represent object by set of labelled points representing contours, height width, area etc. -Model consists of 5 Bézier curves (B-spline functions), each defined as two end points P O and P 1 and one control point P 1 : P(t) = θ 0 (t)P 0 + θ 1 (t)P 1 + θ 2 (t)P 2 points distribution modelshape approximation

7 SecurePhone Workshop - 24/25 June 2004 7 Spatio-temporal model -Visual observation of speaker: O = o 1, o 2 …o T -Assumption: feature vectors follow normal distribution as in acoustic domain, modelled by GMMs -Assumption: temporal changes are piece-wise stationary and follow first order Markov process -Each state in HMM represents several consecutive feature vectors

8 SecurePhone Workshop - 24/25 June 2004 8 ASM: Training

9 SecurePhone Workshop - 24/25 June 2004 9 ASM: Tracking

10 SecurePhone Workshop - 24/25 June 2004 10 ASM: Lip Tracking Examples

11 SecurePhone Workshop - 24/25 June 2004 11 Image Based Approach -Hue and saturation levels to find lip region (ROI) -Eliminate outliers (red blobs) by constraints (geometric, gradient, saturation) -Motion constraints: difference image (1d) pixelwise absolute difference between two adjacent frames -a) greyscale image -b) hue image -c) binary hue/saturation threshholding -c) accumulated difference image -e) binary image after threshholding -f) combined binary image c AND e -Find largest connecting region

12 SecurePhone Workshop - 24/25 June 2004 12 Image Based Approach (2) -Derive lip dimensions using colour and edge information -Random Markov field framework to combine two sources of info and segment lips from background -Implementation close to completion

13 SecurePhone Workshop - 24/25 June 2004 13 Other Approaches -Deformable template/model/contour based: -Geometric shapes, shape models, eigen vectors, appearance models, deform in order to minimise energy/distance function relating to template paramaters and image, template matching (correlation), best fit template, active shape models, active appearance models, model fitting problem -Learning based approach: -MLP, SVMs… -Knowledge based approach: -Subject rules or information to find and extract features, eye/nose detection symmetry -Visual Motion analysis: -Motion analysis techniques, motion cues, difference images after thresholding and filtering -Optical flow, filter tracking (computationally expensive) -Hue and saturation threshholding -Intensity of ruddy areas, pb of removal of outliers -Image subspace transforms: -DCT, PCA, Discrete Wavelet, KLT (DWT + PCA analysis of ROI), FFT

14 SecurePhone Workshop - 24/25 June 2004 14 Fusion of audio-visual information -Instance of general classifier problem (bimodal classifier) -2 observation streams: audio + video providing info about hidden class labels -Typically each observation stream used to train a single modality classifier -Aim: combine both streams to produce bimodal classifier to recognise pertinent classes with higher level of accuracy -2 general types/levels of fusion: -Feature fusion -Decision fusion

15 SecurePhone Workshop - 24/25 June 2004 15 Feature Fusion -Feature fusion: HMM classifier, concatenated feature vector of audio and visual parameters – time synchronous features, possibly including upsampling) -Generation process of feature vector -Using single stream HMM with emission (class conditional observation) probabilities given by Gaussian distribution:

16 SecurePhone Workshop - 24/25 June 2004 16 Decision Fusion -State synchronous decision fusion -Captures reliability of each stream -HMM state level -combine single modality HMM classifier outputs -Class conditional log-likelihoods from the 2 classifiers linearly combined with appropriate weights -Various level: state (phone, syllable, word…) -multi-stream HMMs classifier, state emission probs: -Product HMMs, factorial HMMs… -Other classifiers (SVMs, Bayesian classifiers, MLP…)

17 SecurePhone Workshop - 24/25 June 2004 17 Banca: results


Download ppt "SecurePhone Workshop - 24/25 June 2004 1 Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano."

Similar presentations


Ads by Google