Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database and Visual Front End Makis Potamianos.

Similar presentations


Presentation on theme: "Database and Visual Front End Makis Potamianos."— Presentation transcript:

1

2

3

4

5

6

7 Database and Visual Front End Makis Potamianos

8

9

10

11

12

13

14 Active Appearance Model Visual Features Iain Matthews

15 Acknowledgments Cootes, Edwards, Talyor, Manchester Sclaroff, Boston

16 AAM Overview Shape & Appearance Appearance Region of interest Warp to reference Shape Landmarks

17 Relationship to DCT Features External feature detector vs. model-based learned tracking Face Detector Face Detector AAM Tracker AAM Tracker DCT AAM Features ROI ‘box’ vs. explicit shape + appearance modeling

18 Training Data 4072 hand labelled images = 2m 13s (/ 50h)

19 Final Model 33 33 mean

20 Image under model Warp to reference Fitting Algorithm Difference Predicted Update cc  weight  c Image Current model projection Appearance c is all model parameters Error Iterate until convergence Current model projection Appearance Image under model Warp to reference Difference

21 Tracking Results Worst sequence - mean, mean square error = 548.87 Best sequence - mean, mean square error = 89.11

22 Tracking Results Full-face AAM tracker on subset of VVAV database 4,952 sequences 1,119,256 images @ 30fps= 10h 22m Mean, mean MSE per sentence = 254.21 Tracking rate (  m2p decode)  4 fps Beard area and lips only models will not track Regions lack sharp texture gradients needed locate model?

23 Features Use AAM full-face features directly (86 dimensional)

24 Audio Lattice Rescoring Results Lattice random path = 78.14% DCT with LM = 51.08% DCT no LM = 61.06%

25 Audio Lattice Rescoring Results AAM vs. DCT vs. Noise

26 Tracking Errors Analysis AAM vs. Tracking error

27 Analysis and Future Work Models are under trained Little more than face detection on 2m of training

28 Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? reproject

29 Analysis and Future Work Models are under trained Little more than face detection on 2m of training Project face through a more compact model Retain only useful articulation information? Improve the reference shape Minimal information loss through the warping? reproject

30 Asynchronous Stream Modelling Juergen Luettin

31 The Recognition Problem M: word (phoneme) sequence M * : most likely word sequence O A : acoustic observation sequence O V : visual observation sequence

32 Integration at the Feature Level Assumption: conditional dependence between modalities integration at the feature level

33 Integration at the Decision Level Assumption: conditional independence between modalities integration at the unit level

34 Multiple Synchronous Streams Assumption: conditional independence integration at the state level Two streams in each state: X: state sequence a ij : trans. prob. from i to j b j : probability density c jm : m th mixture weight of multivariate GaussianN

35 Multiple Asynchronous Streams Assumption: conditional independence integration at the unit level Decoding: individual best state sequences for audio and video

36 Composite HMM definition 1 5 4 3 2 6 8 79 Speech-noise decomposition (Varga & Moore, 1993) Audio-visual decomposition (Dupont & Luettin, 1998)

37 Stream Clustering

38 AVSR System 3-state HMM with 12 mixture components, 7-state HMM for composite model context dependent phone models (silence, short pause), tree-based state clustering cross-word context dependent decoding, using lattices computed at IBM trigram language model global stream weights in multi stream models, estimated on held out set

39 Speaker independent word recognition

40 Conclusions AV 2 Stream asynchronous model beats other models in noisy conditions Future directions: Transition matrices: context dependent, pruning transitions with low probability, cross-unit asynchrony Stream weights: model based, discriminative Clustering: taking stream-tying into account

41 Phone Dependent Weighting Dimitra Vergyri

42

43

44

45

46 Weight Estimation Hervé Glotin

47

48

49

50

51

52

53

54

55 Visual Clustering June Sison

56 Outline Motivation for use of visemes in triphone classification Definition of visemes Goals of viseme usage Inspection of phone trees (validity check)

57 Equivalence Classification Combats problem of data sparseness Must be sufficiently refined so that equivalence classification can serve as a basis for prediction Use of decision trees to achieve equivalence classification [co-articulation] To derive EC: 1] collect speech data realizing each phone 2] classify [cluster] this speech into appropriately distinct categories

58

59 Definition of visemes Canonical mouth shapes that accompany speech utterances complements phonetic stream [examples]

60

61 Visual vs Audio Contexts 276 QS [total] 84 single phoneme QS 116 audio QS 76 visual QS No. root nodes: 123 33 visual 74 audio 16 single phoneme

62

63 Visual Models Azad Mashari

64 Visual Speech Recognition The Model Trinity Audio-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 1) Self-Clustered Model (Question Set 2) The "Results" (From which we learn what not to do) The Analysis Places to Go, Things to Do...

65 The Questions Set 1: Original Audio Questions 202 Questions based primarily on voicing and manner Set 2: Audio-Visual Questions 274 (includes Set 1) includes questions regarding place of articulation

66 The Trinity Audio-Clustered model Decision trees generated from the audio data using question set 1 Visual triphone models clustered using the trees Self-Clustered old Decision trees generated from the visual data using question set 1 Self-Clustered new Decision trees generated from the visual data using question set 2

67 Experiment I 3 major factors Independence / Complementarity of the two streams Quality of the representation Generalization Speaker-Independent test Noisy audio lattices rescored using visual models

68 Experiment I Rescoring noisy audio lattices using the visual models

69 Experiment I

70 Speaker variability of visual models follows variability of audio models. (we don't know why.. lattices?) This does not mean that they are not "complementary". Viseme clustering gives better results for some speakers only. No overall gain. (we don't know why) Are the new questions being used? Over-training? ~7000 clusters in audio models for ~40 phonemes. Same number in visual models but there are only ~12 "visemes" -> Experiments with fewer clusters Is the Greedy clustering algorithm, making a less optimal tree with the new questions?

71 Experiment II Several ways to get fewer clusters: Increase minimum cluster size Increase likelihood gain threshold Remove questions (specially those frequently used at higher depths, as well as unused ones) Any combination of the above Triple min likelihood gain threshold (single mixture models) -> insignificant increase in error. ~7000 clusters -> 54.24% ~2500 clusters -> 54.57% Even fewer clusters (~150-200)? Different reduction strategy?

72 Places to Go, Things to See... Finding optimal clustering parameters. Current values are optimized for mfcc- based audio models. Clustering with viseme-based questions only Looking at errors in recognition of particular phones/classes

73 Visual Model Adaptation Jie Zhou

74 Visual Model Adaptation Problem The Speaker Independent system is not sufficient to accurately model each new speaker Soluion Use adaptation to make the Speaker Independent System to better fit the characteristics of each new speaker

75 HMM Adaptation To get a new estimate of the adapted mean, µ, We use the transformation matrix given by: µ = Wε Where W is the (n x n) transformation matrix n is the dimensionality of the data and ε is the original mean vector

76 Speaker independent data Speaker specific data ε μ

77 HEAdapt VVAV HMM Models Recognition Speaker Adapted Test Data (ε, σ) Speaker Independent Data Transformed Speaker Independent Model (µ = W ε)

78 Procedure A speaker adaptation on visual models was performed using: MLLR (method of adaptation) Global transform Single mixture triphones Adaptation data: Average 5 minutes per speaker Test data: Average 6 minutes per speaker

79 Results Speaker s Speaker Independent Speaker Adapted AXK44.05%41.92% JFM61.41%59.23% JXC62.28%60.48% LCY31.23%29.32% MBG83.73%83.56% MDP30.16%29.89% RTG57.44%55.73% BAE36.81%36.17% CNM84.73%83.89% DJF71.96%71.15% Average 58.98%55.49% Word error, %

80 Future Better adaptation can be achieved by : Employ Multiple transforms instead of single transform Attempt other methods of adaptation such as MAP with more data Use mixture Gaussians in the model

81 Summary and Conclusions Chalapathy Neti

82

83

84

85

86 The End.

87 Extra Slides…

88 State based Clustering

89 Error rate on DCT features Language Model No Language Model Lattice Depth 1 Clean Audio 24.7927.79 Lattice Depth 3 Clean Audio 25.5534.58 Lattice Noisy Audio 49.7955.00 Word error rate on small multi-speaker test set

90 Audio Lattice Rescoring Results Visual FeatureWord Error Rate, % AAM - 86 features65.69 AAM - 30 features65.66 AAM - 30 +  +   69.50 AAM - 86 LDA  24, WiLDA ±7 64.00 DCT - 18 +  +   61.80 DCT - 24, WiLDA ±758.14 Noise - 3061.37 DCT WiLDA no LM = 65.14 Lattice random path = 78.32

91 Overview Shape Appearance

92


Download ppt "Database and Visual Front End Makis Potamianos."

Similar presentations


Ads by Google