Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project.

Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project

EUREKA!! – Getting the Idea Why sentiment analysis?  Huge amount of opinionated Text on web  Sentiment Analysis on web – popularity of a product, movie or a person as such. Idea:  Create Debate – online debating forum where people argue for/against some topic.  Mine for the salient text features for agreement/disagreement posts.

Math -14308 Debates…. -983800 Sentences! -178290 Posts, -9194 Users - Labeled dataset Neutral Agreement Disagreement Structural Analysis – Certain features of the language in the post that make it a high score agreement/disagreement post. Behavioral Analysis – Aspects of User’s behavior that give him a high rank on the forum. Creating the Haystack ….

What's the gain? Influence detection in a community Sub-Group Detection Stance Identification – Are there any visible groups with a particular stance? Predict the Crowd Trend for a particular topic of interest? Text Summarization

Correlation between polarity of the post Vs its score? Popular pattern observed in the dependency parse of agreement/disagreement posts? Emoticons? Are posts with formal text up-voted often? Finding the needle - structural features ….

Experiment 1 : Polarity Measure Intuition : Is the number of +ve/-ve words an indicative of how popular a post is? Tool – Opinion Finder/ Wordnet. Output of processed data by Opinion Finder.  It think it's wrong to assume that in order to be a revolutionary thinker you have to be crazy  MPQAPOL – Indicates the polarity of the word like “bad”  MPQASRC – Indicates the opinion source in the sentence like “It”  MPQASD – Direct subject expression in the sentence like “said” Result :  No evident correlation between number of polar words and the rank of the post  Authors use equal distribution of positive and negative words while expressing agreement/disagreement. PostsAgreement PostsDisagreement Posts Positive words-0.008240.012647 Negative words-0.010240.01392

Experiment 2 : Readability Measure Intuition : Do the posts that are more readable/formal gain higher scores? Tool – Flesch Toolkit to analyze the Flesch Readability measure for each post. Calculated Pearson’s coefficient between the labeled score and Flesch score for each of the posts. Result : High correlation - the more formal the language of a post, the more is the points associated with it.  Eg 1 : “good times...bring it back ! -------------=-=-=-=-=-=-=-=-=- =-=-=-==-=-=- ))))))))))))” [Flesch – 0, Labeled points - 1]  Eg 2 : “Vegetables is often seen as more healthy than eating meat.” [Flesch – 93.12, Labeled points – 29 (max)] PostsAgreement PostsDisagreement Posts Pearson’s correlation for flesch readability 0.2069740.169236

Experiment 3 : Emoticon analysis Intuition : Do Emoticons in agreement/disagreement posts have any correlation with their labeled scores? Tool – CMU Ark Tagger [Stanford Parser doesn’t scale well]. Pearson’s coefficient between the labeled score and number of +ve/-ve emoticons for agreement/disagreement posts. Result : High correlation between number of emoticons and rank of disagreement posts. Analysis : authors tend to use expressive emoticons like smiles to give a sarcastic opinion regarding a particular argument.  “Hey! What’s that supposed to mean?;)”,  “Sure If you say so :P”. PostsAgreement PostsDisagreement Posts Positive emoticons-0.023750.38943 Negative emoticons-0.0035270.03421

Experiment 4 : Dependency Parse Intuition : Do highly ranked agreement/disagreement posts depict a popular dependency pattern? Agreement posts tend to express an agreement early on in the post, while disagreement is mild. Tool – Stanford Parser – Syntactic and Dependency Parse of the posts. Result: A lot of highly ranked agreement posts showed a popular dependency pattern as follows that begins with -  I->nsubj->+ve [I agree to, I like your point, I up-voted your argument] “I have to agree. Blah blah” I->nsubj->have->xcomp->agree->End I->nsubj->+ve->xcomp->+ve->End Stanford Parser + ExtractDependencies Code to traverse PRP to PRP$ Sentiwordnet PostsAgreement Posts Pearson’s coeff with I->nsubj->+ve pattern0.252146

Author starting a neutral post? Time of entry into discussion? Average number of times an author participates in a thread? Author participating in agreement/disagreement discussions? Finding the needle - behavioral features ….

Which Authors get the highest rank? -1 Intuition : To find if average number of times an author participates in a thread has a correlation with his ranking? Pearson’s coefficient Average number of times an author participates in a thread. 0.489 Result :  There is a pretty evident positive correlation of an author’s points to the number of times he participates in the discussion posts per thread.

Which Authors get the highest rank?-2 Intuition : To find if authors who participate in some kind of discussion/ or start a new thread get a high rank ? Pearson’s coefficient Authors who agree 0.847 Authors who disagree0.770 Authors who start a new thread. 0.60 Result :  Rating of authors who agree > Rating of authors who disagree more > Rating of authors who start a new debate.  Authors who participate more in discussions are more popular.

Which Authors get the highest rank?-3 Intuition : To find if a authors that participate early/late in discussion fetch more ranking? Pearson’s coefficient Authors who participate early 0.1990 Authors who participate late-0.00358 Result :  Authors participating late in discussion are likely to have higher ranking.  By Intuition, authors who come late in discussion already know the opinion bias.  Participating early doesn’t help in ranking

Get the Ranking of Authors w.r.t features Trained a linear regression model using Weka’s Libsvm and got a predicted ranking of all authors based on the features. Got a correlation coefficient by comparing these rankings vs the gold standard rankings. SVM’s Correlation Coefficient Gold Standard Rankings/ Predicted Rankings. 0.300 Result :  The feature vector set shows a decent correlation with the actual rankings.

Future Work In this project, I essentially looked at some of the structural and behavioral features The opinion finder tool also tells whether it is a subjective or objective. One of the future Experiments – to find if there exists a correlation between subj/obj sentences and score of post? Does the length of the post matter? Going forward - consolidate all these features and results in the database and make it available as an open-source dataset

Thank You!

Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project.

Similar presentations

Presentation on theme: "Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project.

Similar presentations

Presentation on theme: "Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project."— Presentation transcript:

Similar presentations

About project

Feedback