Presentation is loading. Please wait.

Presentation is loading. Please wait.

That's What She Said: Double Entendre Identification Kiddon & Brun 2011.

Similar presentations


Presentation on theme: "That's What She Said: Double Entendre Identification Kiddon & Brun 2011."— Presentation transcript:

1 That's What She Said: Double Entendre Identification Kiddon & Brun 2011

2 Introduction Double entendre An expression that can be understood in two ways: Innocuous and risqu é Double entendre identification has not been researched Very difficult problem – need deep semantic and cultural understanding for most

3 Introduction “That's what she said” jokes Subset of double entendres Repopularized by “The Office” TV show Internet meme Examples Late-evening basketball practice: “I was trying all night, but I just could not get it in!”

4 Introduction TWSS as a metaphor identification problem Analogical mapping between domains Terminology of source domain used to describe situations in target domain Terms in source domain are literal and the same terms in target domain are nonliteral

5 Introduction Other research in computational metaphor identification Learning selectional preferences of words in multiple domains to identify nonliteral usage SVMs trained on labeled data to distinguish metaphoric language from literal language

6 Method Applied methods from metaphor identification Mappings between two domains  Innocuous source and erotic target Selectional preferences  Identifying adjectival selectional preferences of sexually explicit nouns to other nouns  Examine relationship between structures in erotic domain and nonerotic context Goal for the domain is high precision (correctly identified TWSS)  Low recall tolerated (better to miss an opportunity than to make a socially awkward mistake)

7 Method DeviaNT: Double Entendre via Noun Transfer SVM model that uses features which model TWSS characteristics  TWSSs likely to contain nouns which are euphemisms for sexually explicit nouns  TWSSs share common structure with sentences in erotic domain

8 Method Created word classes for their algorithm SN is a set of sexually explicit nouns  Manually selected 76 nouns predominantly used in sexual contexts  9 categories based on which sexual object, body part, or participant they identify SN - ⊂ SN set likely targets in euphemisms  |SN - | = 61 BP is the set of body part nouns  Approximation contains 98 body parts

9 Method Corpora for comparison Source domain: Erotic Corpus  Textfiles.com/sex/EROTICA  1.5 million sentences  All unparsable text, etc removed  Parsed with Stanford Parser Target domain: The Brown Corpus  Already tagged!

10 Method Corpora modified to be more generic All numbers replaced with CD tag Proper nouns given NNP tag Nouns that are elements of SN tagged as SN

11 Method NS(n) = Noun Sexiness function For each noun, Adjective Count Vector contains freq. of each adjective modifying noun in the union of erotica and Brown corpora NS(n) = maximum cosine similarity over each noun in SN- using tf-idf weights of the nouns' adjective count vectors Nouns occurring <200 times, <50 times w/adjs, or were associated with 3x a many adj that never occurred with nouns in SN were assigned 10 -7 (SO not sexy!). Examples of nouns with high NS are “rod” and “meat”

12 Method AS(a) = Adjective Sexiness function Measures how likely an adjective a is to modify a noun in SN Relative frequency of a in sentences with at least one noun in SN Example adjectives with high AS are “hot” and “wet”

13 Method VS(v) = Verb Sexiness function Measures how likely a VP is to appear in an erotic than nonerotic context S E = set of sentences in erotic corpus S B = set of sentences in Brown VP v = substring containing verb bordered on each side by the closest noun or a pronoun. If condition not met, verb is endpoint of v.

14 Method VS(v) = Verb Sexiness function VS(v) is defined as approx. probability of v appearing in erotic and nonerotic context with counts in S E and S B such that P(s ∈ S E ) = P(s ∈ S B ) VS(v) is the probability that (v ∈ s) implies s is in an erotic context

15 Method Features DeviaNT uses two categories of features in identification Noun Euphemisms Structural Elements

16 Method Noun Euphemisms Does s contain a noun∈SN? Does s contain a noun∈BP? Does s contain a noun such that NS(n) = 10 -7 ? Average NS(n) for all n∈s such that n∉S N ∪S B

17 Method Structural Elements Does s contain a verb that never occurs in SE? Does s contain a VP that never occurs in SE? Average VS(v) over all VP v∈s Average AS(a) for all a∈s Does s contain a such that a never occurs in s∈S N ∪S B with a n∈SN?

18 Method Structural Elements Also identifies Basic Structure by:  Number of non-punctuation tokens  Number of punctuation tokens  {0, 1, 2+} for each pronoun and each POS tag, number of times it occurs in s  Category of subject (noun, pronoun, etc.)

19 Method SVM classifier Used default parameters with option to fit logistic regression curves to outputs for precision/recall analysis MetaCost metaclassifier Reclassify training data to produce a single cost- sensitive classifier Cost of false positive 100x that of false negative  Being correct more important than missing false negatives

20 Evaluation Goal of evaluation to show that their features can compete with baselines Training data Positive:  2001 examples from twssstories.com, a website of user- submitted TWSS jokes Negative:  2001 sentences, 667 from each site  textsfromlastnight.com – racy texts  fmylife.com/intimacy – love life stories  wikiquotes.org- quotes from famous American speakers/film

21 Evaluation Baseline Naïve Bayes classifier on unigram features SVM on unigram features SVM on unigram and bigram features MetaCost versions of each DeviaNT with only Basic Structure features SVM models used same parameters and kernel functions as DeviaNT Also tested DeviaNT with unigram features, but it did not improve performance

22 Results Baseline DEviaNT & Basic Structure have highest precision DEviaNT >71.4% Unigram SVM w/o MetaCost maxed at 59.2%

23 Results Compared sentences DEviaNT, Basic Structure, and Unigram SVM w/o MetaCost most confidently classified as TWSS DEviaNT returned 28 all tied for most likely to be a TWSS.  20 were true positives  2 of 8 false positives were actually TWSS – such as “Yeah, but his hole really smells sometimes.”

24 Results Compared sentences DEviaNT, Basic Structure, and Unigram SVM w/o MetaCost most confidently classified as TWSS Basic Structure returned 16 sentence  11 true positives  Of these, 7 were in DEviaNT's sure set

25 Results Compared sentences DEviaNT, Basic Structure, and Unigram SVM w/o MetaCost most confidently classified as TWSS Unigram SVM w/o MetaCost  Returned 130 sentences, 77 true positives

26 Results Compared sentences DEviaNT, Basic Structure, and Unigram SVM w/o MetaCost most confidently classified as TWSS DEviaNT was able to identify TWSSs that identified noun euphemisms like “Don't you think these buns are a little too big for this meat?”, which Basic Structure missed. DEviaNT has much lower recall than Unigram SVM, but it accomplishes goal of high precision If training data was balanced subset of test data, DeviaNT's precision would be 0.995

27 Conclusion Experiments indicate euphemism and erotic domain structure features contribute to improving precision of TWSS identification Could be possible to generalize this technique for other types of double entendres and humor


Download ppt "That's What She Said: Double Entendre Identification Kiddon & Brun 2011."

Similar presentations


Ads by Google