Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multimodal Information Analysis for Emotion Recognition

Similar presentations


Presentation on theme: "Multimodal Information Analysis for Emotion Recognition"— Presentation transcript:

1 Multimodal Information Analysis for Emotion Recognition
(Tele Health Care Application) Malika Meghjani Gregory Dudek and Frank P. Ferrie Good Morning, My name is Malika. I am from the department of Electrical and Computer Engineering at McGill University. And I would be presenting my research entitled “Multimodal Information Analysis for Emotion Recognition”. This work is done under the supervision of Prof. Gregory Dudek from school of Computer Science and Prof. Frank Ferrie from ECSE.

2 Content Our Goal Motivation Proposed Approach Results Conclusion
In my presentation, I will talk about…..

3 Our Goal Automatic emotion recognition using audio-visual information analysis. Create video summaries by automatically labeling the emotions in a video sequence.

4 NURSING INTERVENTIONS
Motivation Map Emotional States of the Patient to Nursing Interventions. Evaluate the role of Nursing Interventions for improvement in patient’s health. intervention - care provided to improve a situation (especially medical procedures or applications that are intended to relieve illness or injury) We have, recorded video sequences of a patient interacting with a nurse on a remote end in a video conference call. During the conversation, the patient shares various issues she has relating her health and the life in general. Based on the concerns and the emotional state of the patient, the nurse provides the required support for the patient to better manage her health. Later these recorded sequences are watched manually and an inference is drawn between the emotional state of the patient and the nurse’s intervention to evaluate how the intervention helped the nurse in the improvement of the patient’s health. This was a very tedious task which we aim at automating. We want a computer to look at the video sequences and classify it in one of the emotional classes. NURSING INTERVENTIONS

5 Proposed Approach Visual Feature Extraction
Visual based Emotion Classification Recognized Emotional State Data Fusion Decision Level Fusion Audio Feature Extraction Audio based Emotion Classification Emotions are mainly expressed through facial expressions, gestures and/or in the voice. Here we use two sources for emotion recognition: vocal information and facial expression. First we perform emotion recognition using one of the individual modalities and then we combine the two modalities to compare how well the system performs. There are basically two ways of combining the information. We can either take decision of the two modalities and see how confident each one is and then combine based on the confidence measure. The second method combines the two sources before any decision is made and give entire set of information to the recognizer and the emotion classification is performed. Feature Level Fusion

6 Visual Analysis Face Detection (Voila Jones Face Detector using Haar Wavelets and Boosting Algorithm) Feature Extraction (Gabor Filter: 5 Spatial Frequencies and 4 orientations, 20 filter responses for each frame) Feature Selection (Select the most discriminative features in all the emotional classes) SVM Classification (Classification with probability estimates)

7 Visual Feature Extraction
= X ( 5 Frequencies X 4 Orientation ) Frequency Domain Filters Feature Selection Automatic Emotion Classification

8 Audio Analysis Audio Pre-Processing (Remove leading and trailing edges) Feature Extraction (Statistics of Pitch and Intensity contours and Mel Frequency Cepstral Coefficients) Feature Normalization (Remove inter speaker variability) SVM Classification (Classification with probability estimates)

9 Audio Feature Extraction
Automatic Emotion Classification Speech Rate Pitch Intensity Spectrum Analysis Mel Frequency Cepstral Coefficients (Short-term Power Spectrum of Sound) Audio Signal Fourier Transform of audio signal. Map the powers of the spectrum to Mel Frequency. Take log of the powers. Perform discrete cosine transform. MFCC are amplitudes of the spectrum.

10 SVM Classification Maximize the margin
Have a penalty term for misclassified points. Map the data to higher dimensional plane where it is easier to classify with linear decision surface.

11 Feature Selection Feature selection method is similar to the SVM classification. (Wrapper Method) Generates a separating plane by minimizing the weighted sum of distances of misclassified data points to two parallel planes. Suppress as many components of the normal to the separating plane which provide consistent results for classification. Average Count of Error Features Distance Selected A and B are two point sets. ‘w’  Normal to the separating plane. e  Vector of ones. Bounding Planes

12 Data Fusion Decision Level:
Obtaining probability estimate for each emotional class using SVM margins. The probability estimates from two modalities are multiplied and re-normalized to the give final estimation of decision level emotional classification. 2. Feature Level: Concatenate the Audio and Visual feature and repeat feature selection and SVM classification process.

13 Database and Training Database: Visual only posed database
Audio Visual posed database Training: Audio segmentation based on minimum window required for feature extraction. Corresponding visual key frame extraction in the segmented window. Image based training and audio segment based training.

14 Experimental Results Statistics Database No. of Training Examples
No. of Subjects No. of Emotional State % Recognition Rate Validation Method Posed Visual Data Only (CKDB) 120 20 5+Nuetral 75% Leave one subject out cross validation Posed Audio Visual Data (EDB) 270 9 6 82% 76% Decision Level Feature Level 14

15 Time Series Plot Surprise Sad Angry Disgust Happy
75% Leave One Subject Out Cross Validation Results ( * Cohen Kanade Database, Posed Visual only Database) 15

16 (*eNTERFACE 2005 , Posed Audio Visual Database)
Feature Level Fusion (*eNTERFACE 2005 , Posed Audio Visual Database)

17 (*eNTERFACE 2005 , Posed Audio Visual Database)
Decision Level Fusion (*eNTERFACE 2005 , Posed Audio Visual Database)

18 Confusion Matrix

19 Demo

20 Conclusion Combining two modalities (Audio and Visual) improves overall recognition rates by 11% with Decision Level Fusion and by 6% with Feature Level Fusion Emotions where vision wins: Disgust, Happy and Surprise. Emotions where audio wins: Anger and Sadness Fear was equally well recognized by the two modalities. Automated multimodal emotion recognition is clearly effective. 20

21 Things to do… Inference based on temporal relation between instantaneous classifications. Tests on natural audio-visual database (on-going).


Download ppt "Multimodal Information Analysis for Emotion Recognition"

Similar presentations


Ads by Google