 Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp. 746-748, December.

Slides:



Advertisements
Similar presentations
Audio Visual Speech Recognition
Advertisements

CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Audio-visual speaker verification using continuous fused HMMs.
Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu.
ISCA Goal "to promote Speech Communication Science and Technology, both in the industry and Academic areas“ Coverage: all.
Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.
Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
SecurePhone Workshop - 24/25 June Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano.
Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.
Recurrent Neural Networks ECE 398BD Instructor: Shobha Vasudevan.
An Overview of QuickSet, from OGI  Cohen, P. R., Johnston, M., McGee, D., Oviatt, S., Pittman, J., Smith, I., Chen, L., and Clow, J. (1997). QuickSet:
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
CRICOS No J † e-Health Research Centre/ CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory Comparing Audio and Visual Information.
Who am I and what am I doing here? Allan Tucker A brief introduction to my research
Pattern Recognition Applications Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
REAL TIME EYE TRACKING FOR HUMAN COMPUTER INTERFACES Subramanya Amarnag, Raghunandan S. Kumaran and John Gowdy Dept. of Electrical and Computer Engineering,
Kinect Player Gender Recognition from Speech Analysis
Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Ryan Irwin Intelligent Electronics Systems Human and Systems Engineering Center for Advanced Vehicular Systems URL:
Automated Lip reading technique for people with speech disabilities by converting identified visemes into direct speech using image processing and machine.
CRICOS No J † CSIRO ICT Centre * Speech, Audio, Image and Video Research Laboratory An Examination of Audio-Visual Fused HMMs for Speaker Recognition.
1 Multimodal Group Action Clustering in Meetings Dong Zhang, Daniel Gatica-Perez, Samy Bengio, Iain McCowan, Guillaume Lathoud IDIAP Research Institute.
Audio-Visual Speech and Speaker Recognition Gérard Chollet, Guido Aversano, Hervé Bredin, Fabian Brugger, Maurice Charbit, Jerôme Darbon, Walid Karam,
Research & development Learning optimal audiovisual phasing for an HMM-based control model for facial animation O. Govokhina (1,2), G. Bailly (2), G. Breton.
Terrain Classification Based On Structure For Autonomous Navigation in Complex Environments Duong V.Nguyen 1, Lars Kuhnert 2, Markus Ax 2, and Klaus-Dieter.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Dynamic Bayesian Networks for Meeting Structuring Alfred Dielmann, Steve Renals (University of Sheffield)
Category Discovery from the Web slide credit Fei-Fei et. al.
Time state Athanassios Katsamanis, George Papandreou, Petros Maragos School of E.C.E., National Technical University of Athens, Athens 15773, Greece Audiovisual-to-articulatory.
Speaker independent Digit Recognition System Suma Swamy Research Scholar Anna University, Chennai 10/22/2015 9:10 PM 1.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Deformable Part Model Presenter : Liu Changyu Advisor : Prof. Alex Hauptmann Interest : Multimedia Analysis April 11 st, 2013.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Relative Hidden Markov Models Qiang Zhang, Baoxin Li Arizona State University.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
Introduction to Onset Detection Functions HAO-HSUN LI 1/30.
Cognitive Systems Foresight Language and Speech. Cognitive Systems Foresight Language and Speech How does the human system organise itself, as a neuro-biological.
Potential team members to date: Karen Livescu (presenter) Simon King Florian Metze Jeff Bilmes Articulatory Feature-based Speech Recognition: A Proposal.
AUTOMATIC TARGET RECOGNITION AND DATA FUSION March 9 th, 2004 Bala Lakshminarayanan.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Database and Visual Front End Makis Potamianos.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
Acoustic source tracking using microphone array R 羅子建 R 林祺豪.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
By Akhilesh K. Sinha Nishant Singh Supervised by Prof. Amitabha Mukerjee Video Surveillance of Basketball Matches and Goal Detection Indian Institute of.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
WLD: A Robust Local Image Descriptor Jie Chen, Shiguang Shan, Chu He, Guoying Zhao, Matti Pietikäinen, Xilin Chen, Wen Gao 报告人:蒲薇榄.
Alexandrina Rogozan Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition UNIVERSITE du MAINE
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Project GuideBenazir N( ) Mr. Nandhi Kesavan RBhuvaneshwari R( ) Batch no: 32 Department of Computer Science Engineering.
SPEECH TECHNOLOGY An Overview Gopala Krishna. A
BRAIN Alliance Research Team Annual Progress Report (Jul – Feb
Online Multiscale Dynamic Topic Models
2 Research Department, iFLYTEK Co. LTD.
Automatic Speech Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Finding Clusters within a Class to Improve Classification Accuracy
HCI/ComS 575X: Computational Perception
Effects of Lombard Reflex on Deep-Learning-Based
Presentation transcript:

 Speech is bimodal essentially. Acoustic and Visual cues. H. McGurk and J. MacDonald, ''Hearing lips and seeing voices'', Nature, pp , December T. Chen and R. Rao, ''Audio-visual integration in multimodal communication'', Proceedings of the IEEE, Special issue in Multimedia Signal Processing, vol. 86, pp , May D.B. Stork and M.E. Hennecke editors, ''Speechreading by Hummans and Machines. Springer, Berlin Germany,  Through their integration (fusion), the aim is: To increase the robustness and performance. There are too many papers, so lets see a few... but from the point of view of integration most of them. AudioVisual-SpeechRecognition (AVSR)

● Tutorials. G. Potamianos, C. Neti, G. Gravier, A. Garg and A.W. Senior. Recent advances in the authomatic recognition of audio-visual speech. ''Proceedings of the IEEE, vil. 91(9), pp , September G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, ● Real Conditions. G. Potamianos and C. Neti, ''Audio-visual speech recognition in challenging environments'', In proc. European conference on Speech Technology, pp , G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E. Marcheret, N. Haas and J. Jiang, ''Towards practical deployement of audio-visual speech recognition'', ICASSP'04, vol. 3, pp , Montreal Canada, AudioVisual-SpeechRecognition (AVSR)

 Increase Robustness and Performance Based on the fact: Visual modality is independent to most of the lost of acoustic quality. Visual and Acoustic modalities work in a complementary manner. B. Dodd and R. Campbell, eds, ''Hearing by Eye: The psychology of Lipreading''. London, England. Laurence Erlbaum Associates Ltd., But, if the integrations is not well done: Catastrofic fusion. J.R. Movellan and P. Mineiro, ''Modularity and catastrophic fusion: A bayesian approach with applications to audio-visual speech recognition'', Tech. Rep , Departement of Cognitive Science, UCSD, San Diego, CA, AudioVisual-SpeechRecognition (AVSR)

 Early Integration (EI): In the feature level, concatenate the features. But features are not synchronous (VOT)! S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp , September C. C. Chibelushi, J.S. Mason and F. Deravi, ''Integration of acoustic and visual speech for speaker recognition'', Eurospeech'93, Berlin,pp , September AVSR Integration (Fusion) voice onset time (VOT)

 Late Integration (LI): In the decision level, combine the scores. Lost of all temporal information! A. Adjoudani and C. Benoit, ''Audio-visual speech recognition compared acroos two architectures'', Eurospeech'95, Madrid Spain, pp , September S. Dupont and J. Luettin, ''Audio-visual speech modeling for continuos speech recognition'', IEEE Transactions on Multimedia, vol. 2, pp , September M. Heckmann, F. Berthommier and K. Kroschel, ''Noise adaptive stream weighting in audio-visual speech recognition'', EUROASIP Journal of Applied Signal Processing, vol. 1, pp , November AVSR Integration (Fusion)

 Middle Integration (MI) allows: Specific word or sub-word models. Synchronous continuous speech recognition. J. Luettin, G. Potamianos and C. Neti, ''Asynchronous stream modeling for large vocabulary audio-visual speech recognition'', ICASSP'01, vol. 1, pp , Salt Lake City USA, May G. Potamianos, J. Luettin and C. Neti, '' Hierarchical discriminant features for audio-visual LVCSR'', ICASSP'01, vol. 1, pp , Salt Lake City USA, May AVSR Integration (Fusion)

 Multistream HMM State synchrony Weighting the observations A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, G. Potamianos, C. Neti, J. Luettin and I. Mattews, ''Audio-visual automatic speech recognition: An overview'' In G. Bailly, E. Vatikiotis-Bateson and P. Perrier, edts. Issues in Visual and Audio-visual Speech Processing, Chapter 10. MIT Press, AVSR Integration, Dynamic Bayesian Networks t=1t=0 t=2 t=T

 Product HMM Asynchrony between the streams Too many parameters I am not sure about this graphical representation G. Gravier, G. Potamianos and C. Neti, ''Asynchrony modeling for audio-visual speech recognition'', In Human Language Technology Conference, AVSR Integration, Dynamic Bayesian Networks t=1 t=0 t=2 t=T

 Factorial HMM Transition probabilities are independents for each stream. Z. Ghahramani and M.I. Jordan, ''Factorial hidden markov models'', In Proc. Advances in Neural Information Processing Systems, vol. 8 pp , AVSR Integration, Dynamic Bayesian Networks t=1 t=0 t=2 t=T

 Coupled HMM (1/2) The backbones have a dependence. M. Brand, N. Oliver and A. Pentland, ''Coupled hidden markov models for complex action recognition'', In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp , S. Chu and T. Huang, ''Audio-visual speech modeling using coupled hidden markov models'', ICASSP'02, pp , AVSR Integration, Dynamic Bayesian Networks t=1t=0 t=2 t=T

 Coupled HMM (2/2) A.V. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy, ''Dynamic Bayesian Networks for audio-visual speech recognition'',EURASIP Journal on Applied Signal Processing, vol. 11, pp. 1-15, A. Subramanya, S. Gurbuz, E. Patterson, and J.N. Gowdy, ''Audiovisual speech integration using coupled hidden markov models for continous speech recognition'', ICASSP'03, AVSR Integration, Dynamic Bayesian Networks

 Implicite Modeling J.N. Gowdy, A. Subramanaya, C. Bartels and Jeff Bilmes, ''DBN based Multi-stream models for audio-visula speech recognition'', ICASSP'04, Montreal Canada, X. Lei, G. Ji, T. Ng, J. Bilmes and M. Ostendorf, ''DBN based Multi-stream for Mandarin Toneme Recognition'', ICASSP'05, Filadelphie USA, AVSR Integration, Dynamic Bayesian Networks