Multimodal Deep Learning

Slides:



Advertisements
Similar presentations
Descriptive schemes for facial expression introduction.
Advertisements

Advanced topics.
Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University.
EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.
Stacking RBMs and Auto-encoders for Deep Architectures References:[Bengio, 2009], [Vincent et al., 2008] 2011/03/03 강병곤.
Language Comprehension Speech Perception Semantic Processing & Naming Deficits.
Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.
POSTER TEMPLATE BY: Multi-Sensor Health Diagnosis Using Deep Belief Network Based State Classification Prasanna Tamilselvan.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
A Temporal Network of Support Vector Machines for the Recognition of Visual Speech Mihaela Gordan *, Constantine Kotropoulos **, Ioannis Pitas ** * Faculty.
Presented by: Mingyuan Zhou Duke University, ECE September 18, 2009
Large-Scale Object Recognition with Weak Supervision
Deep Learning.
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Audiovisual benefit for stream segregation in elderly listeners Esther Janse 1,2 & Alexandra Jesse 2 1 Utrecht institute of Linguistics OTS 2 Max Planck.
Language Comprehension Speech Perception Naming Deficits.
1 A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Maja Pantic, Glenn I. Roisman, Thomas S. Huang Reported.
K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.
AN ANALYSIS OF SINGLE- LAYER NETWORKS IN UNSUPERVISED FEATURE LEARNING [1] Yani Chen 10/14/
Spatial Pyramid Pooling in Deep Convolutional
Image Denoising and Inpainting with Deep Neural Networks Junyuan Xie, Linli Xu, Enhong Chen School of Computer Science and Technology University of Science.
Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.
Introduction to machine learning
Andrew Ng CS228: Deep Learning & Unsupervised Feature Learning Andrew Ng TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
What’s Making That Sound ?
The Motor Theory of Speech Perception April 1, 2013.
Multimodal Alignment of Scholarly Documents and Their Presentations Bamdad Bahrani JCDL 2013 Submission Feb 2013.
Problem description and pipeline
1 Action Classification: An Integration of Randomization and Discrimination in A Dense Feature Representation Computer Science Department, Stanford University.
Multimodal Information Analysis for Emotion Recognition
Presented by: Mingyuan Zhou Duke University, ECE June 17, 2011
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Andrew Ng Feature learning for image classification Kai Yu and Andrew Ng.
Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
Cross-Modal (Visual-Auditory) Denoising Dana Segev Yoav Y. Schechner Michael Elad Technion – Israel Institute of Technology 1.
Predicting Voice Elicited Emotions
Motor Theory + Signal Detection Theory
Week8 Fatemeh Yazdiananari.  Fixed the issues with classifiers  We retrained SVMs with the new UCF101 histograms  On temporally untrimmed videos: ◦
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Joint Object-Material Category Segmentation from Audio-Visual Cues
Motor Theory of Perception March 29, 2012 Tidbits First: Guidelines for the final project report So far, I have two people who want to present their.
UCD Electronic and Electrical Engineering Robust Multi-modal Person Identification with Tolerance of Facial Expression Niall Fox Dr Richard Reilly University.
Meeting 8: Features for Object Classification Ullman et al.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University
CS 445/656 Computer & New Media
Unsupervised Learning of Video Representations using LSTMs
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
From Vision to Grasping: Adapting Visual Networks
Deep Predictive Model for Autonomous Driving
Restricted Boltzmann Machines for Classification
Cold-Start Heterogeneous-Device Wireless Localization
Neural networks (3) Regularization Autoencoder
Unsupervised Learning and Autoencoders
Deep Learning Workshop
State-of-the-art face recognition systems
Finding Clusters within a Class to Improve Classification Accuracy
Computer Vision James Hays
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Audio and Speech Computers & New Media.
Multisensory integration: perceptual grouping by eye and ear
Neural networks (3) Regularization Autoencoder
MULTI-VIEW VISUAL SPEECH RECOGNITION BASED ON MULTI TASK LEARNING HouJeung Han, Sunghun Kang and Chang D. Yoo Dept. of Electrical Engineering, KAIST, Republic.
Do Better ImageNet Models Transfer Better?
Presentation transcript:

Multimodal Deep Learning Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

I'm going to play a video of a person speaking Watch the video carefully and note what you hear … and then now I want ou to close McGurk - What happened for most of you was that -when you watchd the video you should have perceived the person saying /da/. Conversely, when you only listened to the clip, you probably heard /ba/. This effect is known as the McGurk effect and really shows that speech perception works by a complex integration of video and audios signals in our brain. In particular, the video gave us information about the place of articulation and mouth motions that changed the way we perceived the video.

McGurk Effect Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Audio-Visual Speech Recognition In this work, I’m going to talk about audio visual speech recognition and how we can apply deep learning to this multimodal setting. For example, if we’re given a small speech segment with the video of person saying letters can we determine which letter he said – imges of his lips; and the audio – how do we integrate these two sources of data. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Challenge Classifier (e.g. SVM) So how do we solve this problem? A common machine learning pipeline goes like this – we take the inputs and extract some features and then feed it into our standard ML toolbox (e.g., classifier). The hardest part is really the features – how we represent the audio and video data for use in our classifier. While for audio, the speech community have developed many features such as MFCCs which work really well, it is not obvious what features we should use for lips. Classifier (e.g. SVM) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Representing Lips Can we learn better representations for audio/visual speech recognition? How can multimodal data (multiple sources of input) be used to find better features? So what does state of the art features look like? Engineering these features took long time To this, we address two questions in this work – [click] Furthermore, what is interesting in this problem is the deep question – that audio and video features are only related at a deep level Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Unsupervised Feature Learning 5 1.1 . 10 9 1.67 . 3 Concretely, our task is to convert sequences of lip images into a vector of numbers Similarly, for the audio Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Unsupervised Feature Learning 5 1.1 . 10 9 1.67 . 3 Now that we have multimodal data, one easy way is to simply concatenate them – however simply concatenating the features like this fails to model the interactions between the modalities However, this is a very limited view of multimodal features – instead what we would like to do [click] is to Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Features 1 2.1 5 9 . 6.5 Find better ways to relate the audio and visual inputs and get features that arise out of relating them together Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Cross-Modality Feature Learning 5 1.1 . 10 Next I’m going to describe a different feature learning setting Suppose that at test time, only the lip images are avilable, and you donot get the audio signal And supposed at training time, you have both audio and video – can the audio at training time help you do better at test time even though you don’t have audio at test time (lip-reading not well defined) But there are more settings to consider! If our task is only to do lip reading, visual speech recognition. An interesting question to ask is -- can we improve our lip reading features if we had audio data. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Learning Models Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Feature Learning with Autoencoders Audio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

... Bimodal Autoencoder Audio Input Video Input Hidden Representation Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

... Bimodal Autoencoder Audio Input Video Input Hidden Representation Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shallow Learning Mostly unimodal features learned Hidden Units Video Input Audio Input So there are different versions of these shallow models and if you trained a model of this form, this is what one usually gets. If you look at the hidden units, it turns out that there are hidden units ….that respond to X and Y only … So why doesn’t this work? We think that there are two reasons for this. In the shallow models, we’re trying relate pixels values to the values in the audio spectrogram. Instead, what we expect is for mid-level video features such as mouth motions to inform us on the audio content. It turns out that the model learn many unimodal units. The figure shows the connectivity ..(explain) We think there are two reasons possible here – 1) that the model is unable to do it (no incentive) 2) we’re actually trying to relate pixel values to values in the audio spectrogram. But this is really difficult, for example, we do not expect that change in one pixel value to inform us abouhow the audio pitch changing. Thus, the relations across the modalities are deep – and we really need a deep model to do this. Review: 1) no incentive and 2) deep Mostly unimodal features learned Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

... Bimodal Autoencoder Audio Input Video Input Hidden Representation Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

... ... ... ... Bimodal Autoencoder Cross-modality Learning: Audio Reconstruction ... Video Reconstruction ... ... Hidden Representation ... But, this still has the problem! But, wait now we can do something interesting This model will be trained on clips with both audio and video. Video Input Cross-modality Learning: Learn better video features by using audio as a cue Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Cross-modality Deep Autoencoder ... Video Input Learned Representation Audio Reconstruction Video Reconstruction However, the connections between audio and video are (arguably) deep instead of shallow – so ideally, we want to extract mid-level features before trying to connect them together …. more Since audio is really good for speech recognition, the model is going to learn representations that can reconstruct audio and thus hopefully be good for speech recognition as well Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Cross-modality Deep Autoencoder ... Audio Input Learned Representation Audio Reconstruction Video Reconstruction But, what we like to do is not to have to train many versions of this models. It turns out that you can unify the separates models together. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders ... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction “Phonemes” “Visemes” (Mouth Shapes) [pause] the second model we present is the bimodal deep autoencoder What we want this bimodal deep AE to do is – to learn representations that relate both the audio and video data. Concretely, we want the bimodal deep AE to learn representations that are robust to the input modality Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders ... Video Input Audio Reconstruction Video Reconstruction “Visemes” (Mouth Shapes) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders ... Audio Input Audio Reconstruction Video Reconstruction “Phonemes” Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Deep Autoencoders ... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction “Phonemes” “Visemes” (Mouth Shapes) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Training Bimodal Deep Autoencoder ... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction ... Audio Input Shared Representation Audio Reconstruction Video Reconstruction ... Video Input Shared Representation Audio Reconstruction Video Reconstruction Train a single model to perform all 3 tasks Similar in spirit to denoising autoencoders (Vincent et al., 2008) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Evaluations Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Visualizations of Learned Features 0 ms 33 ms 67 ms 100 ms Features correspond to mouth motions and are also paired up with the audio spectrogram The features are generic and are not speaker specific Audio (spectrogram) and Video features learned over 100ms windows Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters 26-way Letter Classification 10 Speakers 60x80 pixels lip regions Cross-modality learning ... Video Input Learned Representation Audio Reconstruction Video Reconstruction Feature Learning Supervised Learning Testing Audio + Video Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Video-Only Learning (Single Modality Learning) 54.2% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with AVLetters Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Video-Only Learning (Single Modality Learning) 54.2% Our Features (Cross Modality Learning) 64.4% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE 10-way Digit Classification 36 Speakers Cross Modality Learning ... Video Input Learned Representation Audio Reconstruction Video Reconstruction Feature Learning Supervised Learning Testing Audio + Video Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Our Features (Cross Modality Learning) 68.7% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Lip-reading with CUAVE Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Our Features (Cross Modality Learning) 68.7% Discrete Cosine Transform (Gurban & Thiran, 2009) 64.0% Visemic AAM (Papandreou et al., 2009) 83.0% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Recognition ... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction CUAVE: 10-way Digit Classification 36 Speakers Evaluate in clean and noisy audio scenarios In the clean audio scenario, audio performs extremely well alone Feature Learning Supervised Learning Testing Audio + Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Recognition Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Recognition Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Multimodal Recognition Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% + Audio Features (RBM) 82.2% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shared Representation Evaluation Feature Learning Supervised Learning Testing Audio + Video Audio Video Supervised Testing Audio Shared Representation Video Linear Classifier Training Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Shared Representation Evaluation Method: Learned Features + Canonical Correlation Analysis Feature Learning Supervised Learning Testing Accuracy Audio + Video Audio Video 57.3% 91.7% Feature Learning Supervised Learning Testing Accuracy Audio + Video Audio Video 57.3% 91.7% Supervised Testing Audio Shared Representation Video Linear Classifier Training Explain in phases! Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Input Video Model Predictions /ga/ /ba/ /da/ 82.6% 2.2% 15.2% 4.4% 89.1% 6.5% Explain in phases Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Input Video Model Predictions /ga/ /ba/ /da/ 82.6% 2.2% 15.2% 4.4% 89.1% 6.5% 28.3% 13.0% 58.7% Explain in phases Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Conclusion Applied deep autoencoders to discover features in multimodal data Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue Multimodal Feature Learning: Learn representations that relate across audio and video data ... Video Input Learned Representation Audio Reconstruction Video Reconstruction ... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

Bimodal Learning with RBMs …... ... Audio Input Hidden Units Video Input One simple approach is to concatenate them together. Now, each hidden unit sees both the audio and visual inputs simultaneously. So we tried this and lets see what we get -- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng