Presentation is loading. Please wait.

Presentation is loading. Please wait.

CRICOS No. 000213J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition.

Similar presentations


Presentation on theme: "CRICOS No. 000213J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition."— Presentation transcript:

1 CRICOS No. 000213J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition David Dean*, Tim Wark* † and Sridha Sridharan* Presented by David Dean

2 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 2 Why audio-visual speaker recognition Bimodal recognition exploits the synergy between acoustic speech and visual speech, particularly under adverse conditions. It is motivated by the need—in many potential applications of speech- based recognition—for robustness to speech variability, high recognition accuracy, and protection against impersonation. (Chibelushi, Deravi and Mason 2002)

3 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 3 Visual Speaker Models A1A2A3A4 V1V2V3V4 Speaker Decision Acoustic Speaker Models Fusion Early and late fusion Most early approaches to audio-visual speaker recognition (AVSPR) used either early or late fusion (feature or decision) Problems –Decision fusion cannot model temporal dependencies –Feature fusion suffers from problems with noise, and has difficulties in modelling the asychronicity of audio-visual speech (Chibelushi et al., 2002) Early Fusion Late Fusion Speaker Models A1A2A3A4 V1V2V3V4 Speaker Decision

4 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 4 Middle fusion - coupled HMMs Middle fusion models can accept two streams of input and the combination is done within the classifier Most middle fusion is performed using coupled HMMs (shown here) –Can be difficult to train –Dependencies between hidden states are not strong (Brand, 1999)

5 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 5 Middle fusion – fused HMMs Pan et al. (2004) used probabilistic models to investigate the optimal multi- stream HMM design –Maximise mutual information in audio and video They found that linking the observations of one modality to the hidden states of the other was more optimal than linking just the hidden states (i.e. Coupled HMM) The fused HMM designed results in two designs, acoustic, and video biased Acoustic Biased FHMM

6 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 6 Choosing the dominant modility The choice of the dominant modality (the one biased towards) should be based upon which individual HMM can more reliably estimate the hidden state sequence for a particular application –Generally audio Alternatively, both versions can be used concurrently and decision fused (as in Pan et al., 2004) This research looks at the relative performance of each biased FHMM design individually –If recognition can be performed using only one FHMM, decoding can be done in half the time compared to decision fusion of both FHMMs

7 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 7 Training FHMMs Both biased FHMM (if needed) are trained independently 1.Train the dominant (audio for acoustic-biased, video for video-biased) HMM independently upon the training observation sequences for that modality 2.The best hidden state sequence of the trained HMM is found for each training observation using the Viterbi process 3.Calculate the coupling parameters between the dominant hidden state sequence and the training observation sequences for the subordinate modality –i.e. estimate the probability of getting certain subordinate observation whilst within a particular dominant hidden state

8 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 8 Decoding FHMMs The dominant FHMM can be viewed as a special type of HMM that outputs observations in two streams This does not affect the decoding lattice, and the Viterbi algorithm can be used to decode –Provided that it has access to observations in both streams

9 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 9 Experimental setup Speaker Decision Visual Feature Extraction Acoustic Feature Extraction Lip Location & Tracking Visual HMM Acoustic HMM Decision Fusion Acoustic Biased FHMM Visual Biased FHMM Speaker Decision Speaker Decision HMM Decision Fusion Acoustic-Biased FHMM Video-Biased FHMM

10 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 10 Lip location and tracking Lip tracking performed as by Dean et al., 2005.

11 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 11 Feature extraction and datasets Audio –MFCC – 12 + 1 energy, + deltas and accelerations = 43 features Video –DCT – 20 coefficients + deltas and accelerations = 60 features Isolated speech from CUAVE (Patterson et al, 2002) –4 sequences for training, 1 for testing (for each of 36 speakers) –Each sequence is ‘zero one two … nine’ –Testing was also performed on noisy data Speech-babble corrupted audio versions Poorly-tracked lip region-of-interest video features Well tracked Poorly tracked

12 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 12 Fused HMM design Both acoustic- and visual-biased FHMMs are examined Underlying HMMs are speaker-dependent word-models for each digit –MLLR adapted from speaker-independent background word- models –Trained using HTK Toolkit (Young et al, 2002) Secondary models are based on discrete vector-quantisation codebooks –Codebook is generated from secondary data –The number of occurrences of each discrete VQ value within each state was recorded to arrive at an estimate of. –Codebook size of 100 was found to work best for both modalities

13 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 13 Decision Fusion Fused HMM performance is compared to decision fusion of normal HMMs in each state Weight of each stream is based upon audio weight parameter α, which can range from –0 (video only), to –1 (audio only) Two decision fusion configurations were used –α = 0.5 –Simulated adaptive fusion Best α for each noise level × × + α 1 - α Decision Fusion

14 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 14 Speaker recognition: well tracked video Tier 1 recognition rate Video HMM, video- biased FHMM, and Decision-Fusion are all performing at 100% Audio-biased FHMM performs much better than the HMM only, but not as well as video at low noise levels

15 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 15 Speaker recognition: poorly tracked video Video is degraded through poor tracking Video FHMM has no real improvement on video HMM Audio FHMM is better than all for most audio-noise levels –Even better than simulated adaptive fusion

16 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 16 Video vs. Audio-Biased FHMM Adding video to audio HMMs to create an acoustic-biased FHMM provides a clear improvement over the HMM alone However, adding audio to video HMMs provides neglibile improvement –Video HMM provides poor state alignment

17 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 17 Acoustic-biased FHMM vs. Decision Fusion FHMMs can take advantage of the relationship between modalities on a frame-by-frame basis Decision fusion can only compare two scores over an entire utterance FHMM even works better than simulated adaptive fusion for most noise levels –Actual adaptive fusion would require estimation of noise levels –FHMM is running with no knowledge of noise

18 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 18 Conclusion Acoustic biased FHMM provide a clear improvement on acoustic HMMs Video biased FHMM do not improve upon video HMMs –Video HMMs are unreliable at estimating state sequences Acoustic biased FHMM performs better than simulated adaptive decision fusion at most noise levels –With around half the decoding processing cost (more when the cost of real adaptive fusion is included)

19 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 19 Future/Continuing Work As the CUAVE database is quite small for speaker recognition experiments at only 36 subjects, research has continued on the XM2VTS database (Messer et al., 1999), which has 295 subjects Continuous GMM models replaced the VQ secondary models –Video DCT VQ couldn’t handle session variability Verification (rather than identification) allows system performance to be examined more easily System is still undergoing development

20 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 20 References M. Brand, “A bayesian computer vision system for modeling human interactions,” in ICVS’99, Gran Canaria, Spain, 1999. C. Chibelushi, F. Deravi, and J. Mason, “A review of speech-based bimodal recognition,” Multimedia, IEEE Transactions on, vol. 4, no. 1, pp. 23–37, 2002. D. Dean, P. Lucey, S. Sridharan, and T. Wark, “Comparing audio and visual information for speech processing,” in ISSPA 2005, Sydney, Australia, 2005, pp. 58–61. K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “Xm2vtsdb: The extended m2vts database,” in Audio and Video-based Biometric Person Authentication (AVBPA ’99), Second International Conference on, Washington D.C., 1999, pp. 72–77. H. Pan, S. Levinson, T. Huang, and Z.-P. Liang, “A fused hidden markov model with application to bimodal speech processing,” IEEE Transactions on Signal Processing, vol. 52, no. 3, pp. 573–581, 2004. E. Patterson, S. Gurbuz, Z. Tufekci, and J. Gowdy, “Cuave: a new audio-visual database for multimodal human-computer interface research,” in Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP ’02). IEEE International Conference on, vol. 2, 2002, pp. 2017–2020. S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland, The HTK Book, 3rd ed. Cambridge, UK: Cambridge University Engineering Department., 2002.

21 CRICOS No. 000213J Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs For Speaker Recognition CSIRO ICT Centre 21 Questions?


Download ppt "CRICOS No. 000213J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition."

Similar presentations


Ads by Google