Presentation on theme: "Semi Supervised Recognition of Sarcastic Sentences in Twitter and Amazon Dmitry DavidovOren TsurAri Rappoport."— Presentation transcript:
Semi Supervised Recognition of Sarcastic Sentences in Twitter and Amazon Dmitry DavidovOren TsurAri Rappoport
Sarcasm: Definition “Sarcasm is a sophisticated form of speech act in which the speakers convey their message in an implicit way.” “The activity of saying or writing the opposite of what you mean, or of speaking in a way intended to make someone else feel stupid or angry.” – Macmillan English Dictionary(2007)
Examples Twitter: “This is what I get to study tonight…! Yippy #sarcasm”#sarcasm “Ahhhh the feeling you get while driving back to boarding school. The best. #sarcasm”#sarcasm Amazon: “Finally pens for women! I don’t know what I have been doing all my life writing with men’s pens.” “Defective by Design.”
SASI – Semi Supervised Sarcasm Identification Trains a classifier to recognize sarcastic patterns in a semi-supervised setting. Classifies sentences into sarcastic classes using the classifier: Absence of Sarcasm (1) to Clearly Sarcastic (5).
Seed data for Training(Amazon) 80 positive and 505 negative examples extended to 471 positive and 5020 negative examples. (Using Yahoo! BOSS API) Data was preprocessed to replace occurrences of author, product, company, book titles, usernames, links with [AUTHOR], [PRODUCT], [COMPANY], [TITLE], [USER], [LINK] Reduces specificity of patterns recognized.
Seed data for Training(Twitter) Positive examples same as the ones used for Amazon and negative examples were hand annotated. (cross domain) Data was preprocessed to replace occurrences of username, links and hash-tags with [USER], [LINK], [HASHTAG]
Testing data 66000 Amazon product reviews for 120 products 5.9 million tweets
Pattern extraction Words were classified as High frequency(HFW) or Content(CW) based on frequency comparison. HFW have a frequency of at least 100 per million and CW have a frequency of at most 1000 per million. Patterns such as “[COMPANY] CW does not CW much” and “about CW CW or CW CW” are extracted.
Pattern extraction(contd.) To reduce the number of patterns: – Remove patterns which occur in only one review – Remove ambivalent patterns. Patterns such as “[COMPANY] CW does not CW much” and “about CW CW or CW CW” are extracted.
Feature Vectors Each pattern is used as one element of feature vector F = [p1, p2, p3, ……, pn] Where pi = 1 – exact match α – sparse match ƴ * n/N – incomplete match 0 – No match
Classification Algorithm Feature vectors for seed data and test data are created and compared. For a vector v in the training set, Label(v) = 1/k Σ Count(Label(ti)) * Label(ti) Σ Count(Label(tj)) where t1..tk are the k seed vectors with lowest euclidean score from v
Baseline and Evaluation For the Amazon set, reviews with low star rating and high positive word content. For Twitter set, 1500 tweets with #sarcasm served as a gold standard. (Noisy) Five fold validation performed. A random sampling of 90 positively and 90 negatively ranked sentences from the test data were annotated with the help of Mechanical Turk. (k = 0.34(Am), k = 0.41(Tw))