Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan *, Constantine Kotropoulos **, Ioannis Pitas ** * Faculty.

Similar presentations


Presentation on theme: "A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan *, Constantine Kotropoulos **, Ioannis Pitas ** * Faculty."— Presentation transcript:

1 A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan *, Constantine Kotropoulos **, Ioannis Pitas ** * Faculty of Electronics and Telecommunications Technical University of Cluj-Napoca 15 C. Daicoviciu, 3400 Cluj-Napoca, Romania ** Department of Informatics, Aristotle University of Thessaloniki Artificial Intelligence and Information Analysis Laboratory GR-54006 Thessaloniki Box 451, Greece This work was supported by the European Union Research Training Network ``Multi-modal Human-Computer Interaction (HPRN-CT-2000-00111)'' Department of Informatics Aristotle University of Thessaloniki

2 Brief Overview Visual speech recognition (lipreading): important component of audiovisual speech recognition systems; emerging research field. Support vector machines (SVMs): powerful classifiers for various visual classification tasks (face recognition; medical image processing; object tracking)  Goal of this work: to examine the suitability of using SVMs for visual speech recognition, by developing an SVM-based visual speech recognition system.  In brief: we use SVMs for viseme recognition & & integrate them as nodes in a Viterbi decoding lattice The good results: slightly higher WRR for very simple input features; possibility of easy generalization to larger vocabulary tasks, encourage the continuation of research. Department of Informatics Aristotle University of Thessaloniki

3 Contents 1. State of the art & research trends 2. Principles of the proposed visual speech recognition approach 3. SVMs and their use for mouth shape recognition 4. Modeling the temporal dynamics of visual speech 5. Block diagram of the proposed visual speech recognition system 6. Experimental results 7. Conclusions Department of Informatics Aristotle University of Thessaloniki

4 1. State of the art & research trends Visual speech recognition = recognize the spoken words based on visual examination of speaker’s face only, mainly mouth area. State of the art for visual speech recognition: many methods reported, very different in respect to: the feature types (lip contour coordinates, GLDP, gray levels of mouth image); the classifier used (TDNN, HMM); the class definition. Active research trends in the area: Find the most suitable features and classification techniques for efficient discrimination between different mouth shapes, individual-independent Reduce the required processing of the mouth image to increase the speed; Find solutions to facilitate easy integration of audio and visual recognizer. Use of SVMs in speech recognition: recently employed in audio speech recognition with very good results; no attempts in visual speech recognition. Department of Informatics Aristotle University of Thessaloniki

5 “o”“f” 2. Principles of the proposed visual speech recognition approach - I Visemes = basic units of visual speech  basic shapes of the mouth during speech production. Discrimination between visemes  pattern recognition problem: Feature vector = a representation of the mouth image (e.g. at pixel level: gray levels of the pixels in the mouth image scanned in raw order); Pattern classes = the different visemes (mouth shapes) during the pronunciation of the words from the dictionary. Department of Informatics Aristotle University of Thessaloniki

6 The proposed strategy: Having a given visual speech recognition task (i.e. a given dictionary of words), 1.Find the phonetic description of each word; 2.Derive the viseme-to- phoneme mapping according to the application (will be one-to-many, due to the involvement of non-visible parts of vocal tract in speech production & dependent to the nationality of the speaker; no universal viseme-to-phoneme mapping currently available); 3.Use the phonetic words descriptions and the viseme-to-phoneme mapping to derive visemic words descriptions (  visemic models = sequences of mouth shapes that could produce the phonetic word realization). 2. Principles of the proposed visual speech recognition approach - II Department of Informatics Aristotle University of Thessaloniki

7 2. Principles of the proposed visual speech recognition approach - III Department of Informatics Aristotle University of Thessaloniki Viseme-to-phoneme mapping Phonetic and visemic word description models

8 3. SVMs and their use for mouth shape recognition - I Department of Informatics Aristotle University of Thessaloniki SVMs = statistical learning classifiers based on optimal hyperplane algorithm: Minimize a bound on the empirical error & the complexity of the classifier Capable of learning in sparse high-dimensional spaces with few training examples. Classical SVMs solve 2-class pattern recognition problems: = training examples; = M-dimensional pattern - indicates if example i is a negative / positive example Linear SVMs: the data to be classified are separable in their original domain

9 3. SVMs and their use for mouth shape recognition - II Department of Informatics Aristotle University of Thessaloniki · = Nonlinear SVMs: the data to be classified are not separable in their original domain ð We project the data in a higher dimensional Hilbert space, , where the data are linearly separable, via the nonlinear mapping and express the dot product of the data by a kernel function: ðthe decision function of the SVM classifier is: where: = the non-negative Lagrange multipliers associated with the QP aiming to maximize the distance between classes and the separating hyperplane;, = hyperplane’s parameters.

10 The real valued output function of the SVM gives the degree of confidence in the class assignment. SVM = binary classifier  need to train one SVM for each mouth shape (viseme). The features used: the gray levels of pixels in the mouth image scanned in raw order. The set of training patterns = common to all SVMs; just the labels assigned to each training pattern are different. Use only unambiguous positive & negative examples. Training patterns (mouth images) are preprocessed for normalization in respect to scale, translation and rotation. 3. SVMs and their use for mouth shape recognition - III Department of Informatics Aristotle University of Thessaloniki

11 4. Modeling the temporal dynamics of visual speech - I Department of Informatics Aristotle University of Thessaloniki Symbolic visemic description of a word = L-R sequence of visemes; no information about the relative duration of each viseme in the word realization (strongly person-dependent) Given: –the symbolic visemic description of a word  –the total number of frames in the word pronunciation  build the word model in the temporal domain by assuming any non-zero possible duration of each viseme = a temporal network of models for each symbolic visemic description, as a Viterbi lattice. “one” =

12 4. Modeling the temporal dynamics of visual speech - II Department of Informatics Aristotle University of Thessaloniki IN OUT Node k Node k+1 Sub-path i Viterbi lattice d for the visemic word model w d ; T=5

13 4. Modeling the temporal dynamics of visual speech - III Department of Informatics Aristotle University of Thessaloniki Node k = the measure of confidence in the realization of the viseme o k =“ah” at the timeframe t k =3. = the real-valued output of the SVM trained for the recognition of the viseme o k. Sub-path i = the transition probability from the state which generates o k =“ah” at timeframe t k =3 to the state which generates o k+1 =“n” at timeframe t k+1 =4. We assume equal transition probabilities. Path l = any connected path between the states IN and OUT in the Viterbi lattice. Confidence in path l from the Viterbi lattice d: Plausibility of producing the word model w d :

14 5. Block diagram of the proposed visual speech recognition system Department of Informatics Aristotle University of Thessaloniki w ah n “one” oa ah n “one” f ao r “four”...... c1c1 c2c2 cDcD i=arg max c d Result: i=1 Word “one”.............. SVM o a SVM a h SVM n

15 Task to be solved:Task to be solved: visual speech recognition of the first four digits in English Experimental data:Experimental data: the visual part from Tulips1 audiovisual speech database Implementation:Implementation: in C++, using the publicly available SVMLight toolkit writing the code for the Viterbi algorithm and additional modules and integrating them into the visual speech recognizer Training strategy:Training strategy: 12 SVMs (one for each viseme class) with polynomial kernel, degree 3, C=1000. Test strategy:Test strategy: leave-one-out protocol  train the system 12 times on 11 subjects, each time leaving out one subject for testing  24 test sequences/ word  4 words = 96 test sequences. Performance evaluation:Performance evaluation: in terms of: Overall (average) WRR – compared to the similar results from literature; 95% confidence intervals for the WRR of the proposed approach and for WRR of similar approaches from literature. Department of Informatics Aristotle University of Thessaloniki 6. Experimental results - I

16 Comparison:Comparison: Slightly higher WRR and confidence intervals compared to the literature Exception: lower WRR than the best reported without delta features (87.5%), due to a much better localization of the ROI around lip contour in that case. However – our computational complexity is much lower (no need to redefine the ROI in each frame). Department of Informatics Aristotle University of Thessaloniki 6. Experimental results - II

17 Department of Informatics Aristotle University of Thessaloniki 7. Conclusions We examined the suitability of SVM classifiers for visual speech recognition. The temporal character of speech was modeled by integrating SVMs with real valued output as nodes in a Viterbi decoding lattice. Performance evaluation of the system on a small visual speech recognition task show: –better WRR than the ones reported in literature, –even for the use of very simple features: directly the gray levels in the mouth image ðSVMs = promising tool for visual speech recognition applications. Future research’s goals: increase the WRR by: including delta features;  examining other SVM’s kernels;  learning the state transition probabilities in the Viterbi decoding lattice


Download ppt "A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan *, Constantine Kotropoulos **, Ioannis Pitas ** * Faculty."

Similar presentations


Ads by Google