Multimodal Information Analysis for Emotion Recognition

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

The Extended Cohn-Kanade Dataset(CK+):A complete dataset for action unit and emotion-specified expression Author：Patrick Lucey, Jeffrey F. Cohn, Takeo.

Model-based Image Interpretation with Application to Facial Expression Recognition Matthias Wimmer

Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.

Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.

Patch to the Future: Unsupervised Visual Prediction

Automatic Feature Extraction for Multi-view 3D Face Recognition

HYBRID-BOOST LEARNING FOR MULTI-POSE FACE DETECTION AND FACIAL EXPRESSION RECOGNITION Hsiuao-Ying ChenChung-Lin Huang Chih-Ming Fu Pattern Recognition,

AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.

1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,

Face Recognition & Biometric Systems, 2005/2006 Face recognition process.

ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 

EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Real-time Embedded Face Recognition for Smart Home Fei Zuo, Student Member, IEEE, Peter H. N. de With, Senior Member, IEEE.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Face Recognition with Harr Transforms and SVMs EE645 Final Project May 11, 2005 J Stautzenberger.

Feature Detection and Emotion Recognition Chris Matthews Advisor: Prof. Cotter.

TAUCHI – Tampere Unit for Computer-Human Interaction Automated recognition of facial expressi ns and identity 2003 UCIT Progress Report Ioulia Guizatdinova.

Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:

Object Detection Using the Statistics of Parts Henry Schneiderman Takeo Kanade Presented by : Sameer Shirdhonkar December 11, 2003.

Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.

Oral Defense by Sunny Tang 15 Aug 2003

Face Detection CSE 576. Face detection State-of-the-art face detection demo (Courtesy Boris Babenko)Boris Babenko.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Representing Acoustic Information

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Facial Feature Detection

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Multimodal Interaction Dr. Mike Spann

Exploiting video information for Meeting Structuring ….

EADS DS / SDC LTIS Page 1 7 th CNES/DLR Workshop on Information Extraction and Scene Understanding for Meter Resolution Image – 29/03/07 - Oberpfaffenhofen.

General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Jacob Zurasky ECE5526 – Spring 2011

Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.

Current work at UCL & KCL. Project aim: find the network of regions associated with pleasant and unpleasant stimuli and use this information to classify.

Overview of Part I, CMSC5707 Advanced Topics in Artificial Intelligence KH Wong (6 weeks) Audio signal processing – Signals in time & frequency domains.

A Two-level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis J. Wu, J. M. Pedersen, D. Putthividhya, D. Norgaard,

Interactive Learning of the Acoustic Properties of Objects by a Robot

Intelligent Control and Automation, WCICA 2008.

Robust Real Time Face Detection

Performance Comparison of Speaker and Emotion Recognition

GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.

Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

CS 445/656 Computer & New Media

Do Expression and Identity Need Separate Representations?

Recognition of bumblebee species by their buzzing sound

2. Skin - color filtering.

University of Rochester

Proposed architecture of a Fully Integrated

The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression By: Patrick Lucey, Jeffrey F. Cohn, Takeo.

Vehicle Segmentation and Tracking in the Presence of Occlusions

Learning to Detect Faces Rapidly and Robustly

Categorization by Learning and Combing Object Parts

AHED Automatic Human Emotion Detection

Audio and Speech Computers & New Media.

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

Multimodal Caricatural Mirror

Project #2 Multimodal Caricatural Mirror Intermediate report

AHED Automatic Human Emotion Detection

Concave Minimization for Support Vector Machine Classifiers

Domingo Mery Department of Computer Science

SPECIAL ISSUE on Document Analysis, 5(2):1-15, 2005.

Presentation transcript:

Multimodal Information Analysis for Emotion Recognition (Tele Health Care Application) Malika Meghjani Gregory Dudek and Frank P. Ferrie Good Morning, My name is Malika. I am from the department of Electrical and Computer Engineering at McGill University. And I would be presenting my research entitled “Multimodal Information Analysis for Emotion Recognition”. This work is done under the supervision of Prof. Gregory Dudek from school of Computer Science and Prof. Frank Ferrie from ECSE.

Content Our Goal Motivation Proposed Approach Results Conclusion In my presentation, I will talk about…..

Our Goal Automatic emotion recognition using audio-visual information analysis. Create video summaries by automatically labeling the emotions in a video sequence.

NURSING INTERVENTIONS Motivation Map Emotional States of the Patient to Nursing Interventions. Evaluate the role of Nursing Interventions for improvement in patient’s health. intervention - care provided to improve a situation (especially medical procedures or applications that are intended to relieve illness or injury) We have, recorded video sequences of a patient interacting with a nurse on a remote end in a video conference call. During the conversation, the patient shares various issues she has relating her health and the life in general. Based on the concerns and the emotional state of the patient, the nurse provides the required support for the patient to better manage her health. Later these recorded sequences are watched manually and an inference is drawn between the emotional state of the patient and the nurse’s intervention to evaluate how the intervention helped the nurse in the improvement of the patient’s health. This was a very tedious task which we aim at automating. We want a computer to look at the video sequences and classify it in one of the emotional classes. NURSING INTERVENTIONS

Proposed Approach Visual Feature Extraction Visual based Emotion Classification Recognized Emotional State Data Fusion Decision Level Fusion Audio Feature Extraction Audio based Emotion Classification Emotions are mainly expressed through facial expressions, gestures and/or in the voice. Here we use two sources for emotion recognition: vocal information and facial expression. First we perform emotion recognition using one of the individual modalities and then we combine the two modalities to compare how well the system performs. There are basically two ways of combining the information. We can either take decision of the two modalities and see how confident each one is and then combine based on the confidence measure. The second method combines the two sources before any decision is made and give entire set of information to the recognizer and the emotion classification is performed. Feature Level Fusion

Visual Analysis Face Detection (Voila Jones Face Detector using Haar Wavelets and Boosting Algorithm) Feature Extraction (Gabor Filter: 5 Spatial Frequencies and 4 orientations, 20 filter responses for each frame) Feature Selection (Select the most discriminative features in all the emotional classes) SVM Classification (Classification with probability estimates)

Visual Feature Extraction = X ( 5 Frequencies X 4 Orientation ) Frequency Domain Filters Feature Selection Automatic Emotion Classification

Audio Analysis Audio Pre-Processing (Remove leading and trailing edges) Feature Extraction (Statistics of Pitch and Intensity contours and Mel Frequency Cepstral Coefficients) Feature Normalization (Remove inter speaker variability) SVM Classification (Classification with probability estimates)

Audio Feature Extraction Automatic Emotion Classification Speech Rate Pitch Intensity Spectrum Analysis Mel Frequency Cepstral Coefficients (Short-term Power Spectrum of Sound) Audio Signal Fourier Transform of audio signal. Map the powers of the spectrum to Mel Frequency. Take log of the powers. Perform discrete cosine transform. MFCC are amplitudes of the spectrum.

SVM Classification Maximize the margin Have a penalty term for misclassified points. Map the data to higher dimensional plane where it is easier to classify with linear decision surface.

Feature Selection Feature selection method is similar to the SVM classification. (Wrapper Method) Generates a separating plane by minimizing the weighted sum of distances of misclassified data points to two parallel planes. Suppress as many components of the normal to the separating plane which provide consistent results for classification. Average Count of Error Features Distance Selected A and B are two point sets. ‘w’  Normal to the separating plane. e  Vector of ones. Bounding Planes

Data Fusion Decision Level: Obtaining probability estimate for each emotional class using SVM margins. The probability estimates from two modalities are multiplied and re-normalized to the give final estimation of decision level emotional classification. 2. Feature Level: Concatenate the Audio and Visual feature and repeat feature selection and SVM classification process.

Database and Training Database: Visual only posed database Audio Visual posed database Training: Audio segmentation based on minimum window required for feature extraction. Corresponding visual key frame extraction in the segmented window. Image based training and audio segment based training.

Experimental Results Statistics Database No. of Training Examples No. of Subjects No. of Emotional State % Recognition Rate Validation Method Posed Visual Data Only (CKDB) 120 20 5+Nuetral 75% Leave one subject out cross validation Posed Audio Visual Data (EDB) 270 9 6 82% 76% Decision Level Feature Level 14

Time Series Plot Surprise Sad Angry Disgust Happy 75% Leave One Subject Out Cross Validation Results ( * Cohen Kanade Database, Posed Visual only Database) 15

(*eNTERFACE 2005 , Posed Audio Visual Database) Feature Level Fusion (*eNTERFACE 2005 , Posed Audio Visual Database)

(*eNTERFACE 2005 , Posed Audio Visual Database) Decision Level Fusion (*eNTERFACE 2005 , Posed Audio Visual Database)

Confusion Matrix

Demo

Conclusion Combining two modalities (Audio and Visual) improves overall recognition rates by 11% with Decision Level Fusion and by 6% with Feature Level Fusion Emotions where vision wins: Disgust, Happy and Surprise. Emotions where audio wins: Anger and Sadness Fear was equally well recognized by the two modalities. Automated multimodal emotion recognition is clearly effective. 20

Things to do… Inference based on temporal relation between instantaneous classifications. Tests on natural audio-visual database (on-going).