Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 7: Dialog Acts & Sarcasm Mari Ostendorf.

Slides:

Advertisements

Similar presentations

Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.

Advertisements

Dan Jurafsky Lecture 4: Sarcasm, Alzheimers, +Distributional Semantics Computational Extraction of Social and Interactional Meaning SSLST, Summer 2011.

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Sentiment Analysis on Twitter Data

Identifying Sarcasm in Twitter: A Closer Look

Distant Supervision for Emotion Classification in Twitter posts 1/17.

Language and communication What is language? How do we communicate? Pragmatic principles Common ground.

Date Transcripts Learning Objectives: 1.To be able to recall and apply the features of transcripts. 2.To be able to examine how character is created in.

Problem Semi supervised sarcasm identification using SASI

Rhee Dong Gun. Chapter The speaking process The differences between spoken and written language Speaking skills Speaking in the classroom Feedback.

Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Semi Supervised Recognition of Sarcastic Sentences in Twitter and Amazon Dmitry DavidovOren TsurAri Rappoport.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Do you suffer from judgement creep? A group moderation session will soon put you right!

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Communication Skills Anyone can hear. It is virtually automatic. Listening is another matter. It takes skill, patience, practice and conscious effort.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

VERBAL COMMUNICATION.

A Study in Cross-Cultural Interpretations of Back-Channeling Behavior Yaffa Al Bayyari Nigel Ward The University of Texas at El Paso Department of Computer.

Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

circle Adding Spoken Dialogue to a Text-Based Tutorial Dialogue System Diane J. Litman Learning Research and Development Center & Computer Science Department.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Turn-taking Discourse and Dialogue CS 359 November 6, 2001.

Issues in Multiparty Dialogues Ronak Patel. Current Trend  Only two-party case (a person and a Dialog system  Multi party (more than two persons Ex.

1 Natural Language Processing Lecture Notes 14 Chapter 19.

1 Computation Approaches to Emotional Speech Julia Hirschberg

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Performance Comparison of Speaker and Emotion Recognition

Language and Gender. Language and Gender is… Language and gender is an area of study within sociolinguistics, applied linguistics, and related fields.

Modeling Latent Biographic Attributes in Conversational Genres Nikesh Garera David Yarowsky.

Dialogue Act Tagging Discourse and Dialogue CMSC November 4, 2004.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

Extending SASI to Satirical Product Reviews: A Preview Bernease Herman University of Michigan Monday, April 22, 2013.

Listening comprehension is at the core of second language acquisition. Therefore demands a much greater prominence in language teaching.

Lexical, Prosodic, and Syntactics Cues for Dialog Acts.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

EXAMINERS’ COMMENTS RAPHAEL’S LONG TURN GRAMMAR Accurate use of simple grammatical structures and also of some complex sentences: ‘they could also be preparing.

Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

On the role of context and prosody in the interpretation of ‘okay’ Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Héctor Chávez, and Lauren Wilcox.

© UCLES 2013 SAY TALK SPEAK TELL Teaching Speaking at B1 and B2 Bulgaria Cambridge Days 2015 Bob Obee.

Implicature. I. Definition The term “Implicature” accounts for what a speaker can imply, suggest or mean, as distinct from what the speaker literally.

Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.

To my presentation about:  IELTS, meaning and it’s band scores.  The tests of the IELTS  Listening test.  Listening common challenges.  Reading.

Verbal listening: Listening.

Human Computer Interaction Lecture 21 User Support

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Conditional Random Fields for ASR

Automatic Hedge Detection

Recognizing Structure: Dialogue Acts and Segmentation

Studying Intonation Julia Hirschberg CS /21/2018.

Intonational and Its Meanings

Intonational and Its Meanings

Proportion of Original Tweets

Turn-taking and Disfluencies

Recognizing Structure: Sentence, Speaker, andTopic Segmentation

Tetsuya Nasukawa, IBM Tokyo Research Lab

Recognizing Structure: Dialogue Acts and Segmentation

Low Level Cues to Emotion

Automatic Prosodic Event Detection

Presentation transcript:

Computational Extraction of Social and Interactional Meaning from Speech Dan Jurafsky and Mari Ostendorf Lecture 7: Dialog Acts & Sarcasm Mari Ostendorf Note: Uncredited examples are from Dialogue & Conversational Agents chapter.

H-H Conversation Dynamics (from Stolcke et al., CL 2000) (from Jurafsky book)

Human-Computer Dialog Greeting Request Clarification Question Inform Response Welcome to the Communicator... I wanna go from Denver to... What time do you want to leave Denver? I’d like to leave in the morning... Eight flight options were returned. Option 1...

Overview Dialog acts Definitions Important special cases Detection Role of prosody Sarcasm In speech In text

Overview Dialog acts Definitions Important special cases Detection Role of prosody Sarcasm

Speech/Dialog/Conversation Acts Characterize the purpose of an utterance Associated with sentences (or intonational phrases) Used for: Determining and controlling the “state” of a conversation in a spoken language system Conversation analysis, e.g. extracting social information Many different tag sets, depending on application

Example Dialog Acts

Aside: Speech vs. Text Speech/dialog/conversation act inventories were developed when conversations were spoken Now, conversations can happen online or via text messaging Dialog acts are also relevant here, researchers are starting to look at this Some differences: Text is impoverished relative to speech, so extra punctuation, emoticons, etc., are added Turn-taking & grounding

Overview Dialog acts Definitions Important special cases Detection Role of prosody Sarcasm

Special Cases Question detection  punctuation prediction 4-category general set: statement, question, incomplete, backchannel  cross-domain training and transfer Agreement vs. disagreement  social analysis Error corrections (for communication errors)  human-computer dialogs

Questions: Harder than you’d think… Indirect speech act

Correction Example

Overview Dialog acts Definitions Important special cases Detection Role of prosody Sarcasm

Automatic Detection Two problems: Classification given segmentation Segmentation (often multiple DAs per turn) Best treated jointly, but this can be computationally complex – start with known segmentation case ok uh let me pull up your profile and I’ll be right with you here and you said you wanted to travel next week 1.ok uh let me pull up your profile and I’ll be right with you here 2.and you said you wanted to travel next week 1.ok 2.uh let me pull up your profile and 3.I’ll be right with you here and 4.you said you wanted to travel 5.next week

Looking at Segmentation (from Stolcke et al., CL 2000)

More Segmentation Challenges A: Ok, so what do you think? B: Well that’s a pretty loaded topic. A: Absolutely. B: Well, here in uh – Hang on just a minute, the dog is barking -- Ok, here in Oklahoma, we just went through a major educational reform… A: After all these things, he raises hundreds of millions of dollars. I mean uh the fella B: but he never stops talking about it. A: but ok B: Aren’t you supposed to y- I mean A: well that’s a little- the Lord says B: Does charity mean something if you’re constantly using it as a cudgel to beat your enemies over the- I’m better than you. I give money to charity. A: Well look, now I…

Knowledge Sources for Classification Words and grammar “please,” “would you” – cue to request Aux inversion – cue to Y/N question “uh-huh,” “yeah” – often backchannels Prosody Rising final pitch – Y/N question, declarative question Pitch & energy can distinguish backchannel (yeah) from agreement, pitch reset may indicate incomplete Pitch accent type… (more on this) Conversational structure (context) Answers follow questions

Feature extraction Words N-grams as features DA-dependent n-gram language model score Presence/absence of syntactic constituents Prosody (typically with normalization) Speaking rate Mean and variance of log energy Fundamental frequency: mean, variance, overall contour trend, utterance final contour shape, change in mean across utterance boundaries

Combining Cues with Context With conversational structure: need a sequence model d = dialog act sequence d 1, …, d T f = prosody features, w = word/grammar features Direct model (e.g. conditional random field) Generative model (e.g. HMM, or hidden event model) Experimental results show small gain from context argmax p(d|f,w) where p(d|f,w) =  t p(d t |f t,w t,d t-1 ) argmax p(f,w|d)p(d) where p(f,w|d) =  t p(f t |d t ) p(w t |d t ) p(d t |d t-1 )

Assuming Independent Segments No sequence model, but DA prior (unigram) still important Direct model: features can extend beyond utterance to approximately capture context, need to handle nonhomogeneous cues or make them homogeneous Generative model: Can predict d t using separate w and f classifiers, then do classifier combination argmax p(d t |f t,w t ) argmax p(f t |d t ) p(w t |d t ) p(d t )

Some Results (not directly comparable) 42 classes (Stolcke et al., CL 2000) Hidden-event model: prosody & words (& context) 42-class accuracy: 62-65% Switchboard ASR (68-71% hand transcripts) 4 classes (Margolis et al., DANLP 2009) Liblinear, n-grams + length (no prosody), hand transcripts 4-class accuracy: 89% Swbd, 84% MRDA 4-class avg recall: 85% Swbd, 81% MRDA 2 classes (Margolis & Ostendorf, ACL 2011) Liblinear, n-grams + prosody, hand transcripts question F-measure: 0.6 MRDA (recall = 92%) 3 classes (Galley et al., ACL 2004) Maxent, lexical-structural-duration features, hand transcripts 3-class accuracy: 86% MRDA

Backchannel “Universals” What is in common with backchannels across languages? Short length, low energy, NOT the words Example: English: uh-huh, right, yeah Spanish: mmm, si, ya  mmm, yes, already Experiment: Cross-language DA classification for English vs. Spanish conversational telephone speech Margolis et al., 2009 Statement, question, incomplete, backchannel Use automatic translation in cross-language classification

Spanish vs. English DAs Backchannels: roughly 20% of DAs lexical cues are useful within languages, so length is not used much length more important across languages Questions: “ es que” often starts a statement in Spanish translate: “ is that” indicates a question in English

Overview Dialog acts Role of prosody Sarcasm

Prosody Impact overall is small: from Stolcke et al., CL 2000 BUT, it can be important for some distinctions Other examples: right, so, absolutely, ok, thank you, …. Oh. (disappointment) vs. Oh! (I get it) Yeah: positive vs. negative

Question Detection From Margolis & Ostendorf, ACL 2011

Whatever! (Benus, Gravano & Hirschberg, 2007) Production: 1 st syllable more likely to have a pitch accent for negative interpretation. Perception: Listeners negativity judgments from prosody on “whatever” alone is similar to having full context.

Overview Dialog acts Role of prosody Sarcasm In speech In text

Sarcasm Changing the default (or literal) meaning Objectives of sarcasm Make someone else feel bad or stupid Display anger or annoyance about something Inside joke Why is it interesting? More accurate sentiment detection More accurate agreement/disagreement detection General understanding of communication strategies

Negative positives in talk shows: yeah and i don't think you’re going to be going back … yeah oh yeah that's right yeah yeah yeah but … yeah well i well m my understanding is … yeah it it it gosh you know is that the standard that prosecutors use the maybe possibly she's telling the truth standard yeah i i don't think it was just the radical right yeah larry i i want to correct something randi said of course

Negative positives (cont.) -- right right th that's right that's right yeah you know what you're right but right right but but you you can't say that punching him … right but the but the psychiatrists in this case were not just … senators are not polling very well right then as a columnist who's offering opinions on what i think the right policy is it seems to me…

Yeah, right. (Tepperman et al., 2006) 131 instances of “yeah right” in Switchboard & Fisher, 23% annotated as sarcastic Annotation: In isolation: very low agreement between human listeners (k=0.16)* In context, still weak agreement (k=.31) Gold standard based on discussion Observation: laughter is much more frequent around sarcastic versions * “Prosody alone is not sufficient to discern whether a speaker is being sarcastic.”

Sarcasm Detector Features: Prosody: relative pitch, duration & energy for each word Spectral: class-dependent HMM acoustic model score Context: laughter, gender, pause, Q/A DA, location in utterance Classifier: decision tree (WEKA) Implicit feature selection in tree training

Results Laughter is most important contextual feature Energy seems a little more important than pitch

Let’s do our own experiment absolutely Male Female yeah Male Female exactly Male Female

Overview Dialog acts Role of prosody Sarcasm In speech In text Davidov, Tsur & Rappoport, 2010 – DTR10 Gonzalez-Ibanez, Muresan & Wacholder, 2011 – GIMW11

Sarcasm in Twitter & Amazon Twitter examples (DTR10) “thank you Janet Jackson for yet another year of Super Bowl classic rock!” “He’s with his other woman: XBox 360. It’s 4:30 fool. Sure I can sleep through the gunfire” “Wow GPRS data speeds are blazing fast.” More twitter examples That must suck. I can't express how much I love shopping on black that's what I love about Miami. Attention to detail in preserving historic landmarks of the im just loving the positive vibes out of that! Amazon examples (DTR10) “[I] Love The Cover” (book) “Defective by design” (music player) Negative positive

Twitter #sarcasm issues Problems: DTR10 Used infrequently Used in non-sarcastic cases, e.g. to clarify a previous tweet (it was #Sarcasm) Used when sarcasm is otherwise ambiguous (prosody surrogate?) – biased towards the most difficult cases GIMW11 argues that the non-sarcastic cases are easily filtered by only using ones with #sarcasm at the end

DTR10 Study Data Twitter: 5.9M tweets, unconstrained context Amazon: 66k reviews, known product context Mechanical Turk annotation K= 0.34 on Amazon, K = 0.41 on Twitter Features Patterns of high frequency words + content word slots “[COMPANY] CW does not CW much” Punctuation K-NN classifier Semi-supervised labeling of training samples

DTR10 Results F-score Punctuation0.28 Patterns0.77 Patts + punc0.81 Enriched patts0.40 Enriched punct0.77 All (SASI)0.83 Amazon results for different feature sets on gold standard F-score Amazon - Turk0.79 Twitter - Turk0.83 Twitter – #Gold0.55 Amazon/Twitter SASI results for eval paradigms

GMW11 Study Data: 2700 tweets, equal amounts of positive, negative and sarcastic (no neutral) Annotation by hashtags: sarcasm/sarcastic, happy/joy/lucky, sadness/angry/frustrated Features: Unigrams, LIWC classes (grouped), WordNet affect Interjections and punctuation, Emoticons & ToUser Classifier: SVM & logistic regression

Results Automatic system accuracy: 3-way S-P-N: 57%, 2-way S-NS: 65% Equal difficulty in separating sarcastic from positive and negative Human S-P-N labeling: 270 tweet subset, K=0.48 Human “accuracy”: 43% unanimous, 63% avg New humans S-NS labeling, K=.59 Human “accuracy”: 59% unanimous, 67% avg Automatic: 68% Accuracies & agreement go up for subset with emoticons Conclusion: Humans are not so good at this task either…

Summary Dialog Acts Purpose of an utterance in conversation Useful for punctuation in transcription, social analysis, dialog management in human-computer interaction Detection leverages words, grammar, prosody & context Prosody … matters for a small subset of DAs, but can matter a lot for these cases Is realized in continuous (range) and symbolic (accents) cues – needs contextual normalization Sarcasm: a difficult task! (for both text and speech)

Topics not covered … Joint segmentation and classification Semi-supervised learning Domain-dependent tag set differences etc.