Database and Visual Front End Makis Potamianos.

Database and Visual Front End Makis Potamianos

Active Appearance Model Visual Features Iain Matthews

Acknowledgments Cootes, Edwards, Talyor, Manchester Sclaroff, Boston

AAM Overview Shape & Appearance Appearance Region of interest Warp to reference Shape Landmarks

Relationship to DCT Features External feature detector vs. model-based learned tracking Face Detector Face Detector AAM Tracker AAM Tracker DCT AAM Features ROI ‘box’ vs. explicit shape + appearance modeling

Training Data 4072 hand labelled images = 2m 13s (/ 50h)

Final Model 33 33 mean

Image under model Warp to reference Fitting Algorithm Difference Predicted Update cc  weight  c Image Current model projection Appearance c is all model parameters Error Iterate until convergence Current model projection Appearance Image under model Warp to reference Difference

Tracking Results Worst sequence - mean, mean square error = 548.87 Best sequence - mean, mean square error = 89.11

Tracking Results Full-face AAM tracker on subset of VVAV database 4,952 sequences 1,119,256 images @ 30fps= 10h 22m Mean, mean MSE per sentence = 254.21 Tracking rate (  m2p decode)  4 fps Beard area and lips only models will not track Regions lack sharp texture gradients needed locate model?

Features Use AAM full-face features directly (86 dimensional)

Audio Lattice Rescoring Results Lattice random path = 78.14% DCT with LM = 51.08% DCT no LM = 61.06%

Audio Lattice Rescoring Results AAM vs. DCT vs. Noise

Tracking Errors Analysis AAM vs. Tracking error

Analysis and Future Work Models are under trained Little more than face detection on 2m of training

Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? reproject

Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? Improve the reference shape Minimal information loss through the warping? reproject

Asynchronous Stream Modelling Juergen Luettin

The Recognition Problem M: word (phoneme) sequence M * : most likely word sequence O A : acoustic observation sequence O V : visual observation sequence

Integration at the Feature Level Assumption: conditional dependence between modalities integration at the feature level

Integration at the Decision Level Assumption: conditional independence between modalities integration at the unit level

Multiple Synchronous Streams Assumption: conditional independence integration at the state level Two streams in each state: X: state sequence a ij : trans. prob. from i to j b j : probability density c jm : m th mixture weight of multivariate GaussianN

Multiple Asynchronous Streams Assumption: conditional independence integration at the unit level Decoding: individual best state sequences for audio and video

Composite HMM definition 1 5 4 3 2 6 8 79 Speech-noise decomposition (Varga & Moore, 1993) Audio-visual decomposition (Dupont & Luettin, 1998)

Stream Clustering

AVSR System 3-state HMM with 12 mixture components, 7-state HMM for composite model context dependent phone models (silence, short pause), tree-based state clustering cross-word context dependent decoding, using lattices computed at IBM trigram language model global stream weights in multi stream models, estimated on held out set

Speaker independent word recognition

Conclusions AV 2 Stream asynchronous model beats other models in noisy conditions Future directions: Transition matrices: context dependent, pruning transitions with low probability, cross-unit asynchrony Stream weights: model based, discriminative Clustering: taking stream-tying into account

Phone Dependent Weighting Dimitra Vergyri

Weight Estimation Hervé Glotin

Visual Clustering June Sison

Outline Motivation for use of visemes in triphone classification Definition of visemes Goals of viseme usage Inspection of phone trees (validity check)

Equivalence Classification Combats problem of data sparseness Must be sufficiently refined so that equivalence classification can serve as a basis for prediction Use of decision trees to achieve equivalence classification [co-articulation] To derive EC: 1] collect speech data realizing each phone 2] classify [cluster] this speech into appropriately distinct categories

Definition of visemes Canonical mouth shapes that accompany speech utterances complements phonetic stream [examples]

Visual vs Audio Contexts 276 QS [total] 84 single phoneme QS 116 audio QS 76 visual QS No. root nodes: 123 33 visual 74 audio 16 single phoneme

Visual Models Azad Mashari

Visual Speech Recognition The Model Trinity Audio-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 2) The "Results" (From which we learn what not to do) The Analysis Places to Go, Things to Do...

The Questions Set 1: Original Audio Questions 202 Questions based primarily on voicing and manner Set 2: Audio-Visual Questions 274 (includes Set 1) includes questions regarding place of articulation

The Trinity Audio-Clustered model Decision trees generated from the audio data using question set 1 Visual triphone models clustered using the trees Self-Clustered old Decision trees generated from the visual data using question set 1 Self-Clustered new Decision trees generated from the visual data using question set 2

Experiment I 3 major factors Independence / Complementarity of the two streams Quality of the representation Generalization Speaker-Independent test Noisy audio lattices rescored using visual models

Experiment I Rescoring noisy audio lattices using the visual models

Experiment I

Speaker variability of visual models follows variability of audio models. (we don't know why.. lattices?) This does not mean that they are not "complementary". Viseme clustering gives better results for some speakers only. No overall gain. (we don't know why) Are the new questions being used? Over-training? ~7000 clusters in audio models for ~40 phonemes. Same number in visual models but there are only ~12 "visemes" -> Experiments with fewer clusters Is the Greedy clustering algorithm, making a less optimal tree with the new questions?

Experiment II Several ways to get fewer clusters: Increase minimum cluster size Increase likelihood gain threshold Remove questions (specially those frequently used at higher depths, as well as unused ones) Any combination of the above Triple min likelihood gain threshold (single mixture models) -> insignificant increase in error. ~7000 clusters -> 54.24% ~2500 clusters -> 54.57% Even fewer clusters (~150-200)? Different reduction strategy?

Places to Go, Things to See... Finding optimal clustering parameters. Current values are optimized for mfcc- based audio models. Clustering with viseme-based questions only Looking at errors in recognition of particular phones/classes

Visual Model Adaptation Jie Zhou

Visual Model Adaptation Problem The Speaker Independent system is not sufficient to accurately model each new speaker Soluion Use adaptation to make the Speaker Independent System to better fit the characteristics of each new speaker

HMM Adaptation To get a new estimate of the adapted mean, µ, We use the transformation matrix given by: µ = Wε Where W is the (n x n) transformation matrix n is the dimensionality of the data and ε is the original mean vector

Speaker independent data Speaker specific data ε μ

HEAdapt VVAV HMM Models Recognition Speaker Adapted Test Data (ε, σ) Speaker Independent Data Transformed Speaker Independent Model (µ = W ε)

Procedure A speaker adaptation on visual models was performed using: MLLR (method of adaptation) Global transform Single mixture triphones Adaptation data: Average 5 minutes per speaker Test data: Average 6 minutes per speaker

Results Speaker s Speaker Independent Speaker Adapted AXK44.05%41.92% JFM61.41%59.23% JXC62.28%60.48% LCY31.23%29.32% MBG83.73%83.56% MDP30.16%29.89% RTG57.44%55.73% BAE36.81%36.17% CNM84.73%83.89% DJF71.96%71.15% Average 58.98%55.49% Word error, %

Future Better adaptation can be achieved by : Employ Multiple transforms instead of single transform Attempt other methods of adaptation such as MAP with more data Use mixture Gaussians in the model

Summary and Conclusions Chalapathy Neti

The End.

Extra Slides…

State based Clustering

Error rate on DCT features Language Model No Language Model Lattice Depth 1 Clean Audio 24.7927.79 Lattice Depth 3 Clean Audio 25.5534.58 Lattice Noisy Audio 49.7955.00 Word error rate on small multi-speaker test set

Audio Lattice Rescoring Results Visual FeatureWord Error Rate, % AAM - 86 features65.69 AAM - 30 features65.66 AAM - 30 +  +   69.50 AAM - 86 LDA  24, WiLDA ±7 64.00 DCT - 18 +  +   61.80 DCT - 24, WiLDA ±758.14 Noise - 3061.37 DCT WiLDA no LM = 65.14 Lattice random path = 78.32

Overview Shape Appearance

Database and Visual Front End Makis Potamianos.

Similar presentations

Presentation on theme: "Database and Visual Front End Makis Potamianos."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Database and Visual Front End Makis Potamianos.

Similar presentations

Presentation on theme: "Database and Visual Front End Makis Potamianos."— Presentation transcript:

Similar presentations

About project

Feedback