SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Human language technologies Data Collections & Studies WP4- Emotions: Faces.
Bruxelles, October 3-4, Interface Multimodal Analysis/Synthesis System for Human Interaction to Virtual and Augmented Environments IST Concertation.
1 Analysis of Parameter Importance in Speaker Identity Ricardo de Córdoba, Juana M. Gutiérrez-Arriola Speech Technology Group Departamento de Ingeniería.
The Extended Cohn-Kanade Dataset(CK+):A complete dataset for action unit and emotion-specified expression Author:Patrick Lucey, Jeffrey F. Cohn, Takeo.
Detecting Certainness in Spoken Tutorial Dialogues Liscombe, Hirschberg & Venditti Using System and User Performance Features to Improve Emotion Detection.
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.
INCORPORATING MULTIPLE-HMM ACOUSTIC MODELING IN A MODULAR LARGE VOCABULARY SPEECH RECOGNITION SYSTEM IN TELEPHONE ENVIRONMENT A. Gallardo-Antolín, J. Ferreiros,
1 Voice Command Generation for Teleoperated Robot Systems Authors : M. Ferre, J. Macias-Guarasa, R. Aracil, A. Barrientos Presented by M. Ferre. Universidad.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition Thurid Vogt, Elisabeth André ICME 2005 Multimedia concepts.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
1 Incremental Detection of Text on Road Signs from Video Wen Wu Joint work with Xilin Chen and Jie Yang.
Producing Emotional Speech Thanks to Gabriel Schubiner.
Sunee Holland University of South Australia School of Computer and Information Science Supervisor: Dr G Stewart Von Itzstein.
The LOTE and Linking Communication domains. Language learning is all about communication. Learners listen to, read and view language in varying text types.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Speech Synthesis Markup Language -----Aim at Extension Dr. Jianhua Tao National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
WOZ acoustic data collection for interactive TV A. Brutti*, L. Cristoforetti*, W. Kellermann+, L. Marquardt+, M. Omologo* * Fondazione Bruno Kessler (FBK)
Human Emotion Synthesis David Oziem, Lisa Gralewski, Neill Campbell, Colin Dalton, David Gibson, Barry Thomas University of Bristol, Motion Ripper, 3CR.
Nonverbal Communication
GUI: Specifying Complete User Interaction Soft computing Laboratory Yonsei University October 25, 2004.
Real-Time Speech Recognition Subtitling in Education Respeaking 2009 Dr Mike Wald University of Southampton.
Emotions in Hindi -Recognition and Conversion S.S. Agrawal CDAC, Noida & KIIT, Gurgaon
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Prepared by: Waleed Mohamed Azmy Under Supervision:
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
NM – LREC 2008 /1 N. Moreau 1, D. Mostefa 1, R. Stiefelhagen 2, S. Burger 3, K. Choukri 1 1 ELDA, 2 UKA-ISL, 3 CMU s:
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Multimodal Information Analysis for Emotion Recognition
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Collection of multimodal data Face – Speech – Body George Caridakis ICCS Ginevra Castellano DIST Loic Kessous TAU.
MULTIMEDIA INPUT / OUTPUT TECHNOLOGIES INTRODUCTION 6/1/ A.Aruna, Assistant Professor, Faculty of Information Technology.
Model of the Human  Name Stan  Emotion Happy  Command Watch me  Face Location (x,y,z) = (122, 34, 205)  Hand Locations (x,y,z) = (85, -10, 175) (x,y,z)
Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science.
Multimodality, universals, natural interaction… and some other stories… Kostas Karpouzis & Stefanos Kollias ICCS/NTUA HUMAINE WP4.
M. Brendel 1, R. Zaccarelli 1, L. Devillers 1,2 1 LIMSI-CNRS, 2 Paris-South University French National Research Agency - Affective Avatar project ( )
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Imposing native speakers’ prosody on non-native speakers’ utterances: Preliminary studies Kyuchul Yoon Spring 2006 NAELL The Division of English Kyungnam.
Performance Comparison of Speaker and Emotion Recognition
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
MIT Artificial Intelligence Laboratory — Research Directions The Next Generation of Robots? Rodney Brooks.
Subjective evaluation of an emotional speech database for Basque Aholab Signal Processing Laboratory – University of the Basque Country Authors: I. Sainz,
Conversational role assignment problem in multi-party dialogues Natasa Jovanovic Dennis Reidsma Rutger Rienks TKI group University of Twente.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Presented By Meet Shah. Goal  Automatically predicting the respondent’s reactions (accept or reject) to offers during face to face negotiation by analyzing.
Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.
G. Anushiya Rachel Project Officer
Visual Information Retrieval
Mr. Darko Pekar, Speech Morphing Inc.
August 15, 2008, presented by Rio Akasaka
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Easy Generation of Facial Animation Using Motion Graphs
Domingo Mery Department of Computer Science
A Generative Audio-Visual Prosodic Model for Virtual
Presentation transcript:

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1, J. M. Lucas 1, F. Fernandez 1, L. F. D'haro 1, R. San-Segundo 1, J. Ferreiros 1, R. Cordoba 1 and J. M. Pardo 1 1 Speech Technology Group. Universidad Politécnica de Madrid. Spain. 2 Universidad de Alcala. Spain MAIN GOALS Emotional Level Corpus  15 reference sentences of SESII-A corpus were played by actors 4 times, incrementing gradually the emotional level (neutral, low, medium and high level). Diphone concatenation synthesis corpus  LOGATOMOS corpus is made of 570 logatomos within the main Spanish Di-phone distribution is covered. They were grouped into 114 utterances in order to provide the performance of the actors. Pauses between words were requested to them in the performance in order to be recorded as in an isolated way.  This corpus allows studying the impact and the viability of communicate affective content through voice by no semantic sense words. New voices for limited domain expressive synthesizers based on concatenation synthesis would be built. Unit Selection synthesis corpus  QUIJOTE is a corpus made of 100 utterances selected from the 1st part of the book Don Quijote de la Mancha and that respects the allophonic distribution of the book. This wide range of allophonic units allows synthesis by unit selection technique. Prosody Modeling  In SESII-B corpus, hot anger was additionally considered in order to evaluated different kinds of anger. The 4 original paragraphs in SES has been split into 84 sentences.  PROSODIA corpus is made of 376 utterances divided into 5 sets. The main purpose of this corpus is to include rich prosody aspects that makes possible the study of prosody in speeches, interviews, short dialogues or question-answering situations. FAR FIELD CONTENT  A linear harmonically spaced array composed of 12 microphones placed on the left wall.  A roughly squared microphone array composed of 4 microphones placed on two tables in front of the speaker. VIDEO CONTENT  Files were recorded using 720x576 resolution and 25 frames per second.  Video data has been aligned and linked to speech and text data, providing a fully-labeled multimedia database. PHONETIC AND PROSODIC LABELING FORTHCOMING WORK EVALUATION  Phonetically labeled using HTK software in an automatic way.  In addition to this, 5% of each sub-corpus in SEV has been manually labeled, providing reference data for studies on rhythm analysis or on the influence of the emotional state on automatic phonetic segmentation systems.  EGG signal has also been automatically pitch-marked and, for intonation analysis, the same 5% of each sub-corpus has been manually revised too.  Close talk speech of SESII-B, QUIJOTE and PROSODIA (3890 utterances) has been evaluated using a web interface.  6 evaluators for each voice participated in the evaluation. They could hear each utterance as many times they need.  Evaluators were asked for:  the emotion played on each utterance  the emotional level (choosing between very low, low, normal, high or very high).  Each utterance was evaluated at least by 2 people.  The Pearson coefficient of identification rates between the evaluators was 98%.  A kappa factor of 100% was used in the validation. 89.6% of actress utterances and 84.3% of actor utterances were validated.  60% of utterances were labeled at least as a high level utterance.  Whole database has been evaluated by an objective emotion identification experiment that leads 95% identification rate (average for both speakers).  Based on PLP speech features and its dynamic parameters.  A 99% Pearson coefficient was obtained between the perceptual and objective evaluation.  The mean square error between the confusion matrices of both experiments is less than 5%. RECORDING EQUIPMENT  Record happiness, cold/hot anger, surprise, sadness, disgust, fear and neutral.  3 acted voices: 1 male, 1 female and third male voice (not recorded yet).  The main purpose of 'near' talk speech recordings is emotional speech synthesis and emotional patterns analysis related to emotion identification tasks.  The main purpose of 'far' talk speech recordings is to evaluate the impact of affective speech capture in more realistic conditions (with microphones placed far away from the speakers), also in tasks related to speech recognition and emotion identification.  The main purpose of video capture is allowing :  Research on emotion detection using visual information, face tracking studies.  Possibility of study specific head, body or arms behaviour that could be related to features such as emotion intensity level or give relevant information of each emotion played.  Audio-visual sensor fusion for emotion identification and even affective speech recognition are devised as potential applications of this corpus. SETTEXTUTT/emomin/emo LOGATOMOSIsolated words SESII-AShort sentences SESII-BLong sentences QUIJOTERead speech PROSODIA 1A speech PROSODIA 2Interview (short anwsers) PROSODIA 3Interview (long anwsers) PROSODIA 4Question anwsering PROSODIA 5Short dialogs TOTAL1175~112 Table: Features related to SEV size for each speaker.  Make exhaustive perceptual analysis of every voice and evaluate the segmental/prosodic information relevance of each emotion.  Implementation of high-quality emotional speech synthesis and speech conversion models and algorithms, using the huge variety of contexts and situations, including recordings for Diphone-based or unit selection synthesis, and special sub-corpora for complex prosodic modeling.  We also want to start doing, close-talk and far-field emotion detection and emotional speech recognition or video-based emotion identification.