Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alexandrina Rogozan Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition UNIVERSITE du MAINE

Similar presentations


Presentation on theme: "Alexandrina Rogozan Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition UNIVERSITE du MAINE"— Presentation transcript:

1 Alexandrina Rogozan Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition UNIVERSITE du MAINE rogozan@lium.univ-lemans.fr

2 Alexandrina Rogozan2 Bio Sketch Assistant Professor in Computer Science and Electrical Engineering at University of Le Mans, France & Member of Speech Processing Group at LIUM 1999 : Ph.D. in Computer Science from University of Paris XI - Orsay  Heterogeneous Data Fusion for Audio-Visual Speech Recognition 1995-1997 : Participant at the French project AMIBE  Improvement of the Robustness and Confidentiality of Man- Machine Communication by using Audio and Visual Data Universities of Grenoble, Le Mans, Toulouse, Avignon, Paris 6 & INRIA

3 Alexandrina Rogozan3 Research Activity  GOAL: Study the benefit of visual information for ASR  METHOD: Develop different audio-visual ASR systems  APPROACH: Copy the synergy observed in speech perception  EVALUATION: Test the accuracy of recognition process on a speaker-dependent connected-letter task

4 Alexandrina Rogozan4 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models 3. Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

5 Alexandrina Rogozan5 1. Audio-Visual Speech System Overview  Obtaining the synergy of acoustic and visual modalities  Audio-visual fusion results > Uni-modal results Face Tracking Lip Localization Visual-features Extraction Acoustic-features Extraction Joint Treatment Visual Front End Integration Strategies

6 Alexandrina Rogozan6 1. Unanswered questions in AV ASR:  When has the audio-visual fusion take place: before or after the categorization in each modality ?  How to take into account the differences in the temporal evolution of speech events in acoustic and visual modalities ?  How to adapt the relative contribution of acoustic and visual modalities during the recognition process ?

7 Alexandrina Rogozan7 1. Relative contribution of acoustic and visual modalities  Speech features: Place & Manner of articulation & Voicing  Vary with the phonemic content Ex: Which modality to distinguish /m/ from /n/, but /m/ from /p/ ?  Vary with the environmental context Ex: Acoustic features on the place of articulation = the least robust ones Exploit the complementary nature of modalities A V

8 Alexandrina Rogozan8 1. Differences in temporal evolution of phonemes in acoustic and visual modalities  Anticipation & Retention Phenomena: temporal shift up to 250 ms [Abry & Lalouache, 1991]  ‘Natural asynchrony’ Handle with different phonemic frontiers  Vary with the phonemic content Exploit the ‘natural asynchrony’

9 Alexandrina Rogozan9 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models - One-level Fusion Architectures - Hybrid Fusion Architecture 3. Implementation of the Proposed Hybrid-Fusion Model 4. Improvements of the Hybrid-Fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

10 Alexandrina Rogozan10 2. One-level fusion architectures  At the Data (Features) Level  At the Results (Decision) Level Categorization Fusion Recognized Speech-Unit Visual Data Acoustic Data Categorization Fusion Visual Data Acoustic Data Recognized Speech-Unit

11 Alexandrina Rogozan11 2. Fusion before categorization  Concatenation or Direct Identification (DI)  Re-coding in the Dominant (RD) or in a Motor space (RM) [Robert-Ribes, 1995] Pb: Choice of the dominant space nature & Temporal ‘resynchronization’ in the common space Adaptation Acoustic Data Fusion Visual Data Categorization Recognized Speech-Unit Categorization Adaptation Recognized Speech-Unit Visual Data Acoustic Data

12 Alexandrina Rogozan12 2. Fusion after categorization  Separate Identification (SI)  Parallel Structure  Serial Structure Categorization Adaptation Fusion Visual Data Acoustic Data Categorization Recognized Speech-Unit Categorization Adaptation Fusion Visual Data Acoustic Data Categorization Recognized Speech-Unit

13 Alexandrina Rogozan13

14 Alexandrina Rogozan14 2. Level of audio-visual fusion in speech perception Ex: Lip image + Larynx-frequency => Voicing features [Grant, 1985] pulse train ( 4,7 % ) ( 28,9 % ) ( 51,1 % )  Audio-visual fusion before categorisation Ex: Lip image (t) + Speech signal (t+  T) => McGurk illusions [Massaro, 1996] /ga/ V /ba/ A /da/ AV  Audio-visual fusion after categorisation Flexibility and robustness of speech perception Adaptability of the fusion mechanisms

15 Alexandrina Rogozan15 2. Hybrid-fusion model for Audio-Visual ASR Adaptation Categorization Fusion a av v Sequence of phonemes A AV V Continuous, time-varying space of data Discrete, categorical space of results DI SI

16 Alexandrina Rogozan16 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models 3. Implementation of the Proposed Hybrid-fusion Model - Structure of the DI-based Fusion - Structure of the SI-based Fusion 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

17 Alexandrina Rogozan17 3. Implementation of the DI-based fusion Adaptation Categorization Phonemic HMM Fusion a av v Sequence of phonemes A AV V Continuous, time-varying space of data Discrete, categorical space of results DI SI

18 Alexandrina Rogozan18 3. Characteristics of the DI-based fusion  Synchronisation of acoustic and visual speech events on the phonemic HMM states  Visual TOO strength  Perturbs the acoustic at the time of TRANSITION between HMM states and of speech-unit LABELING Necessity to adapt the DI-based fusion

19 Alexandrina Rogozan19 3. Adaptation of the DI-based fusion  To the RELEVANCE of speech features in each modality  To the RELIABILITY of processing in each modality Necessity to estimate a posteriori the reliability of the global process

20 Alexandrina Rogozan20 3. Realization of the adaptation in the DI- based fusion  Exponential weight  :  Global to the recognition hypothesis  Selected a posteriori  According to the SNR & the phonemic content A Phonemic HMM Fusion  i i  j Choice Sequence of phonemes............ Phonemic HMM V

21 Alexandrina Rogozan21 3. Choice of the hybrid-fusion architecture DI + V = > Asynchronous fusion of information Adaptation Categorization Phonemic HMM Fusion a av v Sequence of phonemes A AV V Continuous, time-varying space of data Discrete, categorical space of results DI SI

22 Alexandrina Rogozan22 3. Implementation of SI-based fusion  Serial structure => visual evaluation of DI solutions Phonemic HMM Fusion av v Adaptation Sequence of phonemes AV V Continuous, time- varying space of data Discrete, categorical space of results N-best phonetically  solutions DI SI

23 Alexandrina Rogozan23 3. Characteristics of the SI-based fusion  Multiplication of the modality output probabilities  Possibility of temporal shift up to 100 ms between modality phonemic frontiers  ‘Natural asynchrony’ allowed  Visual TOO strength  Perturbs the acoustic at the time of speech-unit LABELING Necessity to adapt the SI-based fusion

24 Alexandrina Rogozan24 3. Realization of the adaptation in the SI- based fusion  Exponential weight :  Calculated a posteriori according to the relative reliability of acoustic and visual modalities  Dispersion of 4-best solutions  Variation of with the SNR on the test data

25 Alexandrina Rogozan25 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models 3. Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model - Visual categorization - Parallel Structure for the SI-based Fusion 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

26 Alexandrina Rogozan26 4. Type of interaction in the SI-based fusion  Effective integration vs coherence verification  depends on the ratio :  IMPROVEMENT: Reinforcement of the purely-visual component  Discriminative learning  Effective visual categorization

27 Alexandrina Rogozan27 4. Discriminative learning of visual speech by Neural Networks (NN)  Necessity of relevant visual differences between the classes to discriminate  Inconsistent with phonemic classes because of visual doubles i. e. /p/, /b/, /m/ Use of adapted classes : VISEMES  Sources of variability  language, speech rate, among speakers

28 Alexandrina Rogozan28 4. Definition of visemes  Extraction of visual phonemes from the training data  Middle of acoustic-phonemic segments anchors visual segments of 140 ms  Mapping of extracted visual phonemes  Kohonen’s algorithm for Self Organising Map (SOM)  Identification of visemes  3 resolution levels n pbm fv s zt dk j ch rlg Consonant visemes Vowel visemes  ei ua  oy

29 Alexandrina Rogozan29 4. Reinforced purely-visual component in SI parallel structure  Getting ride of temporal dependence between DI and V  Effective visual categorisation  Difficulty to take into account the temporal dimension of speech with NN  Towards hybrid HMM - NN based categorization

30 Alexandrina Rogozan30 4. Hybrid HMM - NN based categorization NN HMM Visible speech A posteriori probabilities Recognized sequence of visemes HMM NN Visible speech Segmentation Recognized sequence of visemes NN + HMM HMM + NN NN Visible speech Recognized sequence of visemes HMM Segmentation Visemes confusion Recognized sequence of visemes HMM / NN

31 Alexandrina Rogozan31 4. Reinforced purely-visual component in SI parallel structure Non-homogeneity of output scores  Inconsistent with previous multiplicative-based SI fusion Sequence of phonemes Visemic HMM / NN Phonemic HMM Fusion av v Adaptation AV V Continuous, time- varying space of data Discrete, categorical space of results DI SI

32 Alexandrina Rogozan32 4. Implementation of SI-based fusion in a parallel structure Phonemes => Visemes Adaptation Sequence of phonemes av Likelihood rate calculation N solutions N...... 1 Discrete, categorical space of results v Edition-distance based alignment 1

33 Alexandrina Rogozan33 4. ‘ Phonetic Plus Post-categorical ’ proposed by Burnham (1998)  2-level fusion architecture  visual categorization by comparison to visemic prototypes  facultative use of purely-visual component after categorization Categorization Fusion Categorization Adaptation Perceived speech Visual Speech Audible Speech FusionCategorization

34 Alexandrina Rogozan34 Overview 1. Challenges in Audio-Visual ASR 2. Audio-Visual Fusion Models 3. Implementation of the Proposed Hybrid-fusion Model 4. Improvements of the Hybrid-fusion Model 5. Results and Comparisons on the AMIBE Database 6. Conclusions et Perspectives

35 Alexandrina Rogozan35 5. Experiments  Audio-visual data of AMIBE project  connected letters  ‘dining-hall’ noise at SNR of 10 dB, 0 dB and -10 dB  Speech features  Visual: internal lip-shape height, width and area +  ’ +  ’’  Acoustic: 12 MFCC + energy +  ’ +  ’’  Speech modeling  HMM + duration model [Suaudeau & André-Obrecht, 1994]  TDNN, SOM

36 Alexandrina Rogozan36 5. Results The hybrid-fusion model DI+V allows for obtaining the audiovisual synergy.

37 Alexandrina Rogozan37 5. Results -10 dB0 dB10 dBclean AUDIO-2,167,98891,5 VISUAL30,9 DI40,876,490,895,4 SI6,381,689,491,9 DI+V41,977,891,295,8

38 Alexandrina Rogozan38 5. Comparisons  Master-Slave Model proposed at IRIT, Univ. Toulouse [André-Orecht et al, 1997]  Product of Models proposed at LIUPAV, Univ. Avignon [Jourlin, 1998]

39 Alexandrina Rogozan39 5. Master-Slave Model of IRIT (1997) Acoustic HMM parameters = Probabilistic functions of the master-labial HMM model Master labial HMM Slave acoustic HMM Open lips Semi-open lips Close lips

40 Alexandrina Rogozan40 5. Product of Models of LIUAPV (1998) The audio-visual HMM parameters are computed from separate acoustic and visual HMMs. T 22 2 1 3 T 11 T 33 T 12 T 23 D 1 (A) D 2 (A) D 3 (A) T 23 x T 66 1,5 1,4 1,6 2,5 2,4 2,6 3,5 3,4 3,6 T 11 x T 44 T 12 x T 56 T 11 x T 45 D 6 (V) 5 4 6 T 44 T 66 T 45 T 56 D 4 (V) D 5 (V) T 55 Acoustic HMM Visual HMM Audio-visual HMM

41 Alexandrina Rogozan41 6. Conclusion : Contribution  Taking into account the amount of problems in AV ASR Fusion(A)synchronyAdaptationVisemes  Proposition of the hybrid-fusion DI+V model  Audio-visual fusion adaptation a posteriori to variations of both the context and the content  Definition of specific-visual units, visemes, by auto- organisation and grouping

42 Alexandrina Rogozan42 6. Conclusion : Further Work  Use of visemes also during the DI-based fusion  Learning of temporal shifts between modalities for the SI- based fusion  Definition of a dependency function between pre and post categorical weights  Modality-weight estimation at a finer level Learning on consequent training data & Extensive testing

43 Alexandrina Rogozan43 6. Perspectives Towards a global platform for Audio-Visual Speech Communication  Preprocessing  source localization, enhancement of speech signal, scene analysis  Recognition  Synthesis  Coding


Download ppt "Alexandrina Rogozan Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition UNIVERSITE du MAINE"

Similar presentations


Ads by Google