Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distant Supervision for Emotion Classification in Twitter posts 1/17.

Similar presentations

Presentation on theme: "Distant Supervision for Emotion Classification in Twitter posts 1/17."— Presentation transcript:

1 Distant Supervision for Emotion Classification in Twitter posts 1/17

2  Natural language and text processing to identify and extract subjective information  Classifying the polarity of a given text as positive, negative or neutral  In general: to discover how people feel about a particular topic 2/17

3 Customers To research products before purchasing Marketers To research public opinion of their company or products Analyze customer satisfaction Organizations Gather critical feedback in newly released products 3/17

4  Earlier studies relied on predefined datasets, typically keyword-based  Determening the emotion is subjective  The words can be ambiguous 4/17

5  An attempt to exploit the widespread use of emoticons and other emotional content  They are treated as noisy labels to obtain very large training sets  Machine learning algorithms (Naïve Bayes, MaxENT and SVM) have accuracy above 80% when trained with emoticon data 5/17

6  Web application with a purpose to discover sentiment of a brand, product or topic on Twitter 6/17

7  Machine learning classifiers  Keyword-based  Naive Bayes  MaxENT  SVM  Feature Extractors  Unigrams  Bigrams  Unigrams and bigrams  Unigrams with part of speech tags 7/17

8  As a baseline, a publicly available list of keywords is used  For each tweet, the number of positive and negative keywords is counted  The classifier return the polarity with the higher count 8/17

9  Multinomial Naïve Bayes model is used  Class c is assigned to tweet d, where  In this formula, f represents a feature and n i (d) represents the count of feature f i found in tweet d. There are a total of m features 9/17

10  Feature-based models  Features like bigrams and phrases can be added  In this formula, c is the class, d is the tweet, and lambda is a weight vector. The weight vectors decide the significance of a feature in classification 10/17

11  Input data are two sets of vectors of size m where each entry in the vector corresponds to the presence of a feature  E.g. Unigram feature extractor – a feature is a word found in a tweet  If the feature is present – value 1  If not – value 0 11/17

12  Analysis is done using Twitter API  In the API, a query for „:)“ returns tweets with positive emotion and a query for „:(„ returns tweets with negative emotion 12/17

13  The training data is post-processed with filters:  Emoticons are stripped off for training purposes MaxENT and SVM have better accuracies without them  Tweets with both positive and negative emoticons are removed i.e. „I’m turning 30 today :( but I still get birthday presents! :)“  Retweets are removed The same tweet shouldn’t be counted twice  Tweets with „:P“ are removed They usually don’t represent any distinct emotion  Replicated tweets are removed 13/17

14  Unigram feature extractor  The simplest way to retrieve features  Results are similar to Pang and Lee’s work on different classifiers on movie reviews  Bigram feature extractor  Used for negation phrases like „not good“ or „not bad“  Downside: bigrams are very sparse and accuracy can drop for both MaxENT and SVM  Unigrams and bigrams  Accuracy improved for Naive Bayes and MaxENT  Decline in accuracy for SVM  Parts of speech  The same word may have many different meaning Over as a verb may have a negative connotation Over can be a noun, without an emotion at all  POS tags aren’t much of a use 14/17

15  Semantics  Djokovic beats Federer :) The sentiment is positive for Djokovic, negative for Federer  Domain-specific tweets  Classifiers could perform better if limited to particular domains (such as movies)  Handling neutral tweets  Internationalization  There are lots of tweet about the same subject in lost of different languages  Utilizing emoticon data in the set  Emoticons are stipped out and classifiers could perform better if they were included 15/17

16  On a tweet that says Djokovic beats Federer, one cannot extract the sentiment of the tweet  To be precise, semantics could be a solution  If (user.isFrom(Serbia)) then sentiment := positive else if (user.isFrom(Switzerland)) then sentiment := negative  Using semantics, we can gather more information, than just by reading keywords 16/17

17 17/17

Download ppt "Distant Supervision for Emotion Classification in Twitter posts 1/17."

Similar presentations

Ads by Google