Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multimodal Deep Learning

Similar presentations


Presentation on theme: "Multimodal Deep Learning"— Presentation transcript:

1 Multimodal Deep Learning
Jiquan Ngiam Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng Stanford University

2 I'm going to play a video of a person speaking
Watch the video carefully and note what you hear … and then now I want ou to close McGurk - What happened for most of you was that -when you watchd the video you should have perceived the person saying /da/. Conversely, when you only listened to the clip, you probably heard /ba/. This effect is known as the McGurk effect and really shows that speech perception works by a complex integration of video and audios signals in our brain. In particular, the video gave us information about the place of articulation and mouth motions that changed the way we perceived the video.

3 McGurk Effect Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

4 Audio-Visual Speech Recognition
In this work, I’m going to talk about audio visual speech recognition and how we can apply deep learning to this multimodal setting. For example, if we’re given a small speech segment with the video of person saying letters can we determine which letter he said – imges of his lips; and the audio – how do we integrate these two sources of data. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

5 Feature Challenge Classifier (e.g. SVM)
So how do we solve this problem? A common machine learning pipeline goes like this – we take the inputs and extract some features and then feed it into our standard ML toolbox (e.g., classifier). The hardest part is really the features – how we represent the audio and video data for use in our classifier. While for audio, the speech community have developed many features such as MFCCs which work really well, it is not obvious what features we should use for lips. Classifier (e.g. SVM) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

6 Representing Lips Can we learn better representations for audio/visual speech recognition? How can multimodal data (multiple sources of input) be used to find better features? So what does state of the art features look like? Engineering these features took long time To this, we address two questions in this work – [click] Furthermore, what is interesting in this problem is the deep question – that audio and video features are only related at a deep level Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

7 Unsupervised Feature Learning
5 1.1 . 10 9 1.67 . 3 Concretely, our task is to convert sequences of lip images into a vector of numbers Similarly, for the audio Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

8 Unsupervised Feature Learning
5 1.1 . 10 9 1.67 . 3 Now that we have multimodal data, one easy way is to simply concatenate them – however simply concatenating the features like this fails to model the interactions between the modalities However, this is a very limited view of multimodal features – instead what we would like to do [click] is to Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

9 Multimodal Features 1 2.1 5 9 . 6.5 Find better ways to relate the audio and visual inputs and get features that arise out of relating them together Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

10 Cross-Modality Feature Learning
5 1.1 . 10 Next I’m going to describe a different feature learning setting Suppose that at test time, only the lip images are avilable, and you donot get the audio signal And supposed at training time, you have both audio and video – can the audio at training time help you do better at test time even though you don’t have audio at test time (lip-reading not well defined) But there are more settings to consider! If our task is only to do lip reading, visual speech recognition. An interesting question to ask is -- can we improve our lip reading features if we had audio data. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

11 Feature Learning Models
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

12 Feature Learning with Autoencoders
Audio Reconstruction Video Reconstruction ... ... ... ... ... ... Audio Input Video Input Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

13 ... Bimodal Autoencoder Audio Input Video Input Hidden Representation
Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

14 ... Bimodal Autoencoder Audio Input Video Input Hidden Representation
Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

15 Shallow Learning Mostly unimodal features learned Hidden Units
Video Input Audio Input So there are different versions of these shallow models and if you trained a model of this form, this is what one usually gets. If you look at the hidden units, it turns out that there are hidden units ….that respond to X and Y only So why doesn’t this work? We think that there are two reasons for this. In the shallow models, we’re trying relate pixels values to the values in the audio spectrogram. Instead, what we expect is for mid-level video features such as mouth motions to inform us on the audio content. It turns out that the model learn many unimodal units. The figure shows the connectivity ..(explain) We think there are two reasons possible here – 1) that the model is unable to do it (no incentive) 2) we’re actually trying to relate pixel values to values in the audio spectrogram. But this is really difficult, for example, we do not expect that change in one pixel value to inform us abouhow the audio pitch changing. Thus, the relations across the modalities are deep – and we really need a deep model to do this. Review: 1) no incentive and 2) deep Mostly unimodal features learned Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

16 ... Bimodal Autoencoder Audio Input Video Input Hidden Representation
Audio Reconstruction Video Reconstruction Lets step back a bit and take a similar but related approach to the problem. What if we learn an autoencoder But, this still has the problem! But, wait now we can do something interesting Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

17 ... ... ... ... Bimodal Autoencoder Cross-modality Learning:
Audio Reconstruction ... Video Reconstruction ... ... Hidden Representation ... But, this still has the problem! But, wait now we can do something interesting This model will be trained on clips with both audio and video. Video Input Cross-modality Learning: Learn better video features by using audio as a cue Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

18 Cross-modality Deep Autoencoder
... Video Input Learned Representation Audio Reconstruction Video Reconstruction However, the connections between audio and video are (arguably) deep instead of shallow – so ideally, we want to extract mid-level features before trying to connect them together …. more Since audio is really good for speech recognition, the model is going to learn representations that can reconstruct audio and thus hopefully be good for speech recognition as well Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

19 Cross-modality Deep Autoencoder
... Audio Input Learned Representation Audio Reconstruction Video Reconstruction But, what we like to do is not to have to train many versions of this models. It turns out that you can unify the separates models together. Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

20 Bimodal Deep Autoencoders
... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction “Phonemes” “Visemes” (Mouth Shapes) [pause] the second model we present is the bimodal deep autoencoder What we want this bimodal deep AE to do is – to learn representations that relate both the audio and video data. Concretely, we want the bimodal deep AE to learn representations that are robust to the input modality Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

21 Bimodal Deep Autoencoders
... Video Input Audio Reconstruction Video Reconstruction “Visemes” (Mouth Shapes) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

22 Bimodal Deep Autoencoders
... Audio Input Audio Reconstruction Video Reconstruction “Phonemes” Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

23 Bimodal Deep Autoencoders
... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction “Phonemes” “Visemes” (Mouth Shapes) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

24 Training Bimodal Deep Autoencoder
... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction ... Audio Input Shared Representation Audio Reconstruction Video Reconstruction ... Video Input Shared Representation Audio Reconstruction Video Reconstruction Train a single model to perform all 3 tasks Similar in spirit to denoising autoencoders (Vincent et al., 2008) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

25 Evaluations Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

26 Visualizations of Learned Features
0 ms 33 ms 67 ms 100 ms Features correspond to mouth motions and are also paired up with the audio spectrogram The features are generic and are not speaker specific Audio (spectrogram) and Video features learned over 100ms windows Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

27 Lip-reading with AVLetters
26-way Letter Classification 10 Speakers 60x80 pixels lip regions Cross-modality learning ... Video Input Learned Representation Audio Reconstruction Video Reconstruction Feature Learning Supervised Learning Testing Audio + Video Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

28 Lip-reading with AVLetters
Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

29 Lip-reading with AVLetters
Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Video-Only Learning (Single Modality Learning) 54.2% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

30 Lip-reading with AVLetters
Feature Representation Classification Accuracy Multiscale Spatial Analysis (Matthews et al., 2002) 44.6% Local Binary Pattern (Zhao & Barnard, 2009) 58.5% Video-Only Learning (Single Modality Learning) 54.2% Our Features (Cross Modality Learning) 64.4% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

31 Lip-reading with CUAVE
10-way Digit Classification 36 Speakers Cross Modality Learning ... Video Input Learned Representation Audio Reconstruction Video Reconstruction Feature Learning Supervised Learning Testing Audio + Video Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

32 Lip-reading with CUAVE
Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

33 Lip-reading with CUAVE
Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Our Features (Cross Modality Learning) 68.7% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

34 Lip-reading with CUAVE
Feature Representation Classification Accuracy Baseline Preprocessed Video 58.5% Video-Only Learning (Single Modality Learning) 65.4% Our Features (Cross Modality Learning) 68.7% Discrete Cosine Transform (Gurban & Thiran, 2009) 64.0% Visemic AAM (Papandreou et al., 2009) 83.0% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

35 Multimodal Recognition
... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction CUAVE: 10-way Digit Classification 36 Speakers Evaluate in clean and noisy audio scenarios In the clean audio scenario, audio performs extremely well alone Feature Learning Supervised Learning Testing Audio + Video Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

36 Multimodal Recognition
Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

37 Multimodal Recognition
Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

38 Multimodal Recognition
Feature Representation Classification Accuracy (Noisy Audio at 0db SNR) Audio Features (RBM) 75.8% Our Best Video Features 68.7% Bimodal Deep Autoencoder 77.3% + Audio Features (RBM) 82.2% Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

39 Shared Representation Evaluation
Feature Learning Supervised Learning Testing Audio + Video Audio Video Supervised Testing Audio Shared Representation Video Linear Classifier Training Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

40 Shared Representation Evaluation
Method: Learned Features + Canonical Correlation Analysis Feature Learning Supervised Learning Testing Accuracy Audio + Video Audio Video 57.3% 91.7% Feature Learning Supervised Learning Testing Accuracy Audio + Video Audio Video 57.3% 91.7% Supervised Testing Audio Shared Representation Video Linear Classifier Training Explain in phases! Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

41 McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Input Video Model Predictions /ga/ /ba/ /da/ 82.6% 2.2% 15.2% 4.4% 89.1% 6.5% Explain in phases Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

42 McGurk Effect A visual /ga/ combined with an audio /ba/ is often perceived as /da/. Audio Input Video Model Predictions /ga/ /ba/ /da/ 82.6% 2.2% 15.2% 4.4% 89.1% 6.5% 28.3% 13.0% 58.7% Explain in phases Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

43 Conclusion Applied deep autoencoders to discover features in multimodal data Cross-modality Learning: We obtained better video features (for lip-reading) using audio as a cue Multimodal Feature Learning: Learn representations that relate across audio and video data ... Video Input Learned Representation Audio Reconstruction Video Reconstruction ... Audio Input Video Input Shared Representation Audio Reconstruction Video Reconstruction Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

44 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

45 Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng

46 Bimodal Learning with RBMs
…... ... Audio Input Hidden Units Video Input One simple approach is to concatenate them together. Now, each hidden unit sees both the audio and visual inputs simultaneously. So we tried this and lets see what we get -- Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee & Andrew Ng


Download ppt "Multimodal Deep Learning"

Similar presentations


Ads by Google