Identifying Sarcasm in Twitter: A Closer Look Roberto Gonzalez Smaranda Muresan Nina Wacholder
Aim of the study To construct a corpus of sarcastic utterances that have been explicitly labeled so by the composers themselves. (#sarcasm, #sarcastic) To exemplify the difficulty in distinguishing sarcastic sentences from negative/positive sentences.
Data Data for the study is divided in three sets of 900 tweets each: sarcastic, positive and negative. Each data set is culled from twitter using appropriate hash-tags. Sarcasm: #sarcasm, #sarcastic Positive: #happy, #joy, #lucky Negative: #sadness, #frustrated, #angry
Data Preprocessing Tweets tagged with #sarcasm or #sarcastic in the middle of the tweet removed. Manually checked to see if the tags were a part of the content of the tweet. Eg: “I really love #sarcasm”
Lexical features Unigrams Dictionary based Pennebaker et al (LIWC) Linguistic Processes (adverbs, pronouns) Psychological Processes (Positive, negative emotion) Personal Concerns (work, achievement) Spoken Categories ( assent, non-fluencies) WordNet Affect List of interjections and punctuations
Pragmatic Features Positive emoticons Negative emoticons ToUser smileys Negative emoticons Frowning faces ToUser @user
Comparisons and X2 rankings
Classification Logistic Regression and Support Vector Machine with SMO (sequential minimal optimization) Features used: Unigrams Dictionary features presence (LIWC+_P) Dictionary features frequency (LIWC+_F)
Classification Results
Comparison against human performance 3 judges asked to classify tweets as sarcastic, positive or negative. (90 tweets per category) S-N-P: 50% agreement (k = 0.4788) S-NS: 71.67% agreement (k = 0.5861) Emoticon based S-NS: 89% agreement (k = 0.74)
Human Comparison results