Presentation on theme: "Project Discussion-3 -Prof. Vincent Ng -Jitendra Mohanty -Girish Vaidyanathan -Chen Chen."— Presentation transcript:
Project Discussion-3 -Prof. Vincent Ng -Jitendra Mohanty -Girish Vaidyanathan -Chen Chen
Agenda Brief recap of last presentation Things we have done so far – Finding interests of user – Finding user’s gender – Finding trending topic – Finding Opinion target pair – Argument detection Future plans – Argument detection – User profile construction
Brief recap of last presentation Finding interests of user 20 categories of interest Algorithm – Neural network – Support Vector Machine – Passive Aggressive algorithm Data Twitter data Blog data
Data Preparation music 0.3305412193304446 photography 0.13342809041481524 art 0.10207545607595286 reading 0.3219854828471283 movie 0.19912786686170067 sport 0.2109817017635857 writing 0.1620484089090056 travel 0.12136726188833384 cooking 0.0761322551265421 fashion 0.0824524604642177 food 0.06361604062594872 politics 0.04773272983192118 god 0.05087903292578588 singing 0.055998675240802584 dancing 0.05427372836916623 family 0.05007865757734662 animal 0.04193690834322303 shopping 0.043261667540639745 game 0.046159578284988824 Social media 0.05710264123864985 We have 120,778 users totally. 60% are used as training data. 20% are used as development data. (Tune learning algorithm parameter) 20% are used as testing data.
Finding interests of user Feature groups POS sequence: 1003 Named entities: 14 Social linguistic: 37063 Bigram : 193985 Unigram: 273985 Unigram for description: 18855 Bigram for description: 15754 Ngram for user name: 17482 Ngram for screen name: 19944 Totally: 578,085
Finding interests of user Measures Precision. The fraction of predict users who really have that interest. Recall. The fraction of relevant instances that are retrieved. F score. Accuracy.
Finding interests of user Neural Network Result F scoreAccuracyPrecisionRecall Music0.6506160.7735140.657970.643426 photography0.5358070.8964230.659060.451391 art0.5040140.9079320.5744790.448947 reading0.6583780.7645310.6157740.707317 movie0.5741190.8464560.6321110.525873 sport0.5805780.8324640.6121070.552139 writing0.6040140.8766770.6620050.555365 travel0.4964540.8883090.5674060.441274 cooking0.4796040.9302860.5852190.406283 fashion0.6497940.9473010.7025580.604401 food0.4968680.9501160.6894550.388381 politics0.6044250.9696560.7692310.497778 god0.5322050.9588090.6431820.453889 singing0.5637210.9611690.7680610.445261 dancing0.5538320.9607140.7253690.447909 family0.4679520.9577330.6564330.363563 animal0.5326160.9712290.80.399194 shopping0.5382190.9697380.7717390.413191 game0.5629290.9683720.8172760.429319 socialMedia0.5579780.9630730.8402990.417656
Finding interests of user Support Vector Machine Result Support vector machineF scoreAccuracyPrecisionRecall Music0.6523490.767470.6395630.665656 photography0.5246480.8882270.6005640.465771 art0.4867720.9076420.5781420.420342 reading0.6568840.7667250.6218580.69609 movie0.576720.8410330.6058360.550273 sport0.5778590.8377210.6368380.528878 writing0.5997640.8738620.6482110.558054 travel0.4942360.8873990.5621830.440942 cooking0.4766240.9286310.5671970.410995 fashion0.6479530.9501570.7557980.567042 food0.4990160.9473010.6283450.413838 politics0.5920750.9710220.859560.451556 god0.5226030.9602170.6866840.421812 singing0.5645310.9611690.7667090.44673 dancing0.5684010.9629080.7752960.448669 family0.4720560.959720.7154610.352227 animal0.520270.9706080.7889340.388105 shopping0.5502180.9701520.7709790.42774 game0.5636050.9683310.8138390.431065 socialMedia0.5691130.9625770.7960.442878
Finding interests of user Passive Aggressive F scoreAccuracyPrecisionRecall Music0.642410.6140730.673487 photography0.5178970.524880.511097 art0.4848360.5119260.460469 reading0.6511510.6211340.684217 movie0.5591730.5795530.540177 sport0.5727670.6226850.530258 writing0.5967870.6468740.553899 travel0.4836960.5327210.442933 cooking0.4722890.5560280.410471 fashion0.6475390.7189260.589048 food0.4868720.572310.423629 politics0.5859030.7698990.472889 god0.515850.6431140.430634 singing0.5538990.7634270.438648 dancing0.5557150.7487110.441825 family0.4522190.6367650.350607 animal0.5199750.6661420.426411 shopping0.5379060.7083990.43356 game0.553760.7657940.433682 socialMedia0.5520830.7630890.432493
Finding interests of user Result Comparison Micro FscoreMacro F-score Support Vector Machine0.584990.56396 Neural Network0.586780.56504 Passive aggressive0.579330.56027
Finding interests of user Result Analysis Recall Analysis Recall is low because of the data size per user. Some users claim they have certain interest, while they have not published any tweets or blogs related to those kinds of interest. However, once they publish some tweets or blogs related to those kinds of interest in the future, our system can make the right prediction. Precision Analysis We find some cases that people have published some tweets or blogs related to one certain interest, however, they doesn’t specify it as their interest. Precision is higher than recall for most interest categories.
Finding interests of user Including more features – Tweet POS sequence: 1003 – Named entities: 14 – Social linguistic: 37063 – Bigram for tweets and blogs: 193985 – Unigram for tweets and blogs: 273985 – Unigram for description: 18855 – Bigram for description: 15754 – Ngram for user name: 17482 – Ngram for screen name: 19944 – Gender: 2 – Blog pos sequence: 3075 – Unigram for “About me”: 18753 – Bigram for “About me”: 12391 – Industry: 39 – Location: 4505 – Occupation: 332 – Totally: 617180
Finding interests of user Neural network result after including more feature Improved F score Old F score Music0.661115 0.650616 photography0.556041 0.535807 art0.515067 0.504014 reading0.664573 0.658378 movie0.583581 0.574119 sport0.594504 0.580578 writing0.617632 0.604014 travel0.504436 0.496454 cooking0.49019 0.479604 fashion0.680579 0.649794 food0.505812 0.496868 politics0.613917 0.604425 god0.542065 0.532205 singing0.570498 0.563721 dancing0.565947 0.553832 family0.469649 0.467952 animal0.529793 0.532616 shopping0.54717 0.538219 game0.571116 0.562929 socialMedia0.56829 0.557978
Finding interests of user Result after including more feature. After including additional features we are able to predict interests of all categories better than before. Some categories like music, reading, writing, fashion, politics are improved by more than 1 percent. Micro F-scoreMacro F-score Neural Network0.586780.56504 Neural Network (More features)0.597800.57582
Finding interests of user Feature Analysis – Totally there are 16 feature groups. Delete one feature group and then see the result. – As neural network can give the best result, we apply neural network to analyze.
Feature Analysis result Neural NetworkMicro F-ScoreMacro F-score All features0.59780.57582 remove tweet pos0.60410.58344 remove ner0.597390.57412 remove social linguistic0.600280.57929 remove bigram0.555310.52591 remove unigram0.592730.57005 remove description unigram0.595790.57388 remove description bigram0.596220.57491 remove name ngram0.599450.57679 remove screen name ngram0.600340.57693 remove gender0.597370.57463 remove blog pos0.601670.57937 remove aboutMe unigram0.594430.57166 remove aboutMe bigram0.598310.57594 remove industry0.59860.57557 remove location0.598470.57524 remove occupation0.599050.57676
Finding gender of user Motivation: – Help to construct user’s profile – Help to compare opinions between different gender Data – Tweet data – Blog data Feature group: – POS sequence: 1003 – Named entities: 14 – Social linguistic: 37063 – Bigram : 193985 – Unigram: 273985 – Unigram for description: 18855 – Bigram for description: 15754 – Ngram for user name: 17482 – Ngram for screen name: 19944 – Interest features * – Total number: 578085 + 20 (Interest features amount)
Gender Distribution CountPercentage Male4740439.25% Female7337460.75%
Finding gender of user Result: If we can improve our interest prediction, it may be possible to improve the gender prediction. Neural NetworkF-ScoreAccuracyPrecisionRecall No interest features0.885280.9093390.8824140.888165 Real interest features0.8893910.9128580.8892510.889531 predicted interest features0.8840910.9084290.8815050.886693 Support Vector MachineF-ScoreAccuracyPrecisionRecall No interest features0.8688920.8948090.853350.885012 Real interest features0.8731730.8979960.8555580.891528 predicted interest features0.8681750.8938570.8497380.887429 Passive Aggressive AlgorithmF-ScoreAccuracyPrecisionRecall No interest features0.8724020.8625440.882489 Real interest features0.8765640.8524880.902039 predicted interest features0.8705670.8614580.879872
Finding Trending Topics Motivation – Helps in finding the interesting topics that attract people’s attention – Trending topics are helpful in argument detection Possible Approaches – Naïve Approach – Online Clustering – Latent Dirchlet Allocation (LDA)
Trending Topics Results Naïve Approach: A brief recap – Visit every tweet in our dataset and find the words and phrases that are occurring frequently. – Those are the probable trending topics in our dataset – The timeframe of a trending topic can be found in a similar fashion by keeping track of the timestamp associated with every tweet and find the minimum and maximum timestamp with respect to every phrase/word
Some Results for the month of December in our dataset in chronological order using Naïve Approach
Problems in Naïve approach Many irrelevant words or phrases with large counts will be considered as trending topics. For example, – Youtube video – I arrived – Watching movie
Solution for the problem in the previous slide ( part of future work ) Possible solutions – Online Clustering – Latent Dirichlet Allocation
Solution for the problem in the previous slide ( part of future work ) Online clustering algorithm – All the tweets are ordered on the timeline – Tweets are represented vector of tf-idf(term freq. & inverted document freq.) weights – Tweets which have highest similarities are clustered together. – Every cluster corresponds to a trending topic. Online clustering algorithm works better because it uses tf-idf weights for all terms and find the similarity between a tweet and all the current clusters available. For Example, consider the following tweets I love lady gaga I love gandhi Lady gaga is the best singer in the world
Solution for the problem in the previous slide ( part of future work ) Latent Dirchlet Allocation(LDA) – LDA is a bag of words model – In LDA, each tweet is viewed as a mixture of various topics. – Suppose a tweet has a particular trending topic in it, It has a high probability of belonging to that topic.
Opinion-Target: What is it?? Opinion words in an opinionated sentence, in most cases, are adjectives which act upon directly on its target. For example: I am so excited that the vacation is coming. – Here the opinion word is excited – And its target is I The water is green and clear. – Here the opinion word is green – And its target is Water The Dream Lake is a beautiful place. – Here the opinion is beautiful – And its target is Dream Lake
Why Opinion-Target pair?? Motivation An opinionated sentence gives a sense of general opinion of a person on a *subject material* or *topic*, called *target* in the research literature. *Subject material* or *topic* is diverse. For example, it could be travel article which deals with several tourist attractions. Last two examples in the previous slides are tourism related opinions by us. Opinions change over the course of time. Example At time t1 user p’s view, place x has really good scenic view, let’s go for it. At time t2 = t1+(1year) user p’s view, place y has better scenic view as compared to place x. The opinion of the user about the tourist place x has changed over the 1 year time frame. It has changed from positive to negative over the time. This gives us a sense of belief that by listening to the posts (tweets in our case), we can create a profile that would give us a way to see if there is a change in the interests of an user on a particular topic over a time duration.
Extraction of Opinion-Target Pair -Stanford parser was run over the tweets to give us dependencies among different entities in a tweet. -Following 5 rules were used on the dependency information generated from previous step to generate Opinion-Target pair. -Direct Object Rule -dobj(opinion, target) -I love (opinion1) Firefox(target1) and defended(opinion2) it. -Nominal Subject Rule -nsubj(opinion, target) -IE(target) breaks(opinion) with everything. -Adjective Modifier Rule -amod(target, opinion) -The annoying(opinion) popup(target) The opinion is the adjectival modifier of the target -Prepositional Object Rule -If prep(target1, IN) => pobj(IN, target2) -The prepositional object of a known target is also a target of the same opinion The annoying(op) popup(tar1) in IE(tar2) -Recursive Modifiers Rule -If conj(adj2, opinion adj1) => amod(target, adj2)
What to do with Opinion-Target pairs extracted?? Once we have the opinion-target pair, we used subjectivity lexicon of (Wilson et al., 2005), which contains 8221 words to express the polarity of the opinion. The words are nothing but the opinions. Some samples from the lexicon type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negative type=weaksubj len=1 word1=ability pos1=noun stemmed1=n priorpolarity=positive type=weaksubj len=1 word1=above pos1=anypos stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=amazing pos1=adj stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=absolutely pos1=adj stemmed1=n priorpolarity=neutral type=weaksubj len=1 word1=absorbed pos1=verb stemmed1=n priorpolarity=neutral
How does Opinion-Target pairs extracted look like? Tweet_id Opinion-Target pairPolarity-Target pair Tweet:3will-II+ Tweet:6evil-I I- Tweet:7 best-books books+ Tweet:7 cry-me me- Tweet:7 think-what what* Tweet:7 think-you you* Tweet:9 amazing-houses houses+ Tweet:10 love-I I+ Tweet: The opinion-target pair extracted from this tweet is and the corresponding polarity-target pair is, where amazing has the positive prior polarity.
What Next ? Integration Next, we apply these polarity-target pairs to the tweets to get useful information about the interests of a person.
Integrating opinion-target pair with tweets of trending-topic Trending topics are those that are immediately popular in tweeter world. This helps people discover the *most breaking* news stories from across the world. Polarity-Target pairs are applied to the trending topic tweets generated to find the opinion of a person w.r.t a trending-topic. It gives us a sense of the user who has posted the tweet, along with the actual message, the topic that the tweet is all about and the corresponding polarity. For example: The following tweet has the tweet_id 746 in a general tweet file. RT @GirlOnMission Hasn't Obama's warranty runs out yet? // It was a limited warranty covering nothing substantial anyway! The above tweet has also been tagged as trending topic tweet under the trending topic *Obama*. Opinion-Target and Polarity-Target pairs for the above tweet generated using the five rules, that we discussed, are as follows: tweet_id: 746 limited-warranty warranty- tweet_id: 746 substantial-nothing nothing+ We have a matched tweet_id from both the scenarios above, which gives us the opinion-target and polarity-target pairs which will be used for argument detection and profile construction.
Drawback The polarity that we talked about is the prior-polarity. It does not take the *context of the sentence* into consideration. However, Opinon-Finder does! why prior-polarity is not always effective?? – Explained with example in later slides
Opinion-Finder Software developed at University of Pittsburg to predict the contextual polarity of the sentence. Mainly, it was designed for documents and has limitation on the size of the file that it can process as well as with the sentence splitting module. We modified their software to deal with our purpose, i.e. tweets. It is extremely slow. For example, processes 27M file in 12 hours approx (totally tweets file size is 18GB)
How is Opinion-Finder different from conventional Opinion-Target/Polarity-Target pair Consider a tweet: From intuition, we can say that the above tweet has negative connotation. How is it depicted in Opinion-Finder? Output of the Opinion-Finder No one is happy with Barack Obama 's healthcare plan. Output of the conventional 5-rule system: – Tweet:1 happy-one one+
How does Contextual-Polarity output look like? @A_ClayChillin what the hell did she do, push him out the truck? i think its time for me to go back to bed, dnl going to bed at 3 and waking up at 7. yuck. Quebecor veut vendre Jobboom - secteurs-d-activite - LesAffaires.com - http://bit.ly/5dlkE5 http://bit.ly/5dlkE5 wak @mirandamia maacii ya keik boltantemnyaaa,, *senaaaaannggg* 90% of any pain comes from trying to keep the pain secret. You cannot keep a secret and let it go.
Status as of now.. Opinion-Finder is slow. It runs a pipeline of internal modules, such as Document Preprocessing, Sentence Splitting, Tokenization, POS tagging, Feature Finder, Source Finder etc. ~20% of the tweets has just completed processing using Opinion-Finder.
Argument Detection Argument detection is to find the argument people use to support or oppose an issue. – Example: Obama is bankrupting Americans, he does nothing to improve the economy just drain it. – “Obama” is the issue – “bankrupting American” and “does nothing to improve the economy just drain it” is the argument
Argument Detection Motivation – To discover the reason why people show positive or negative opinion towards to an issue. – If people suddenly change their opinion towards to an issue because of a particular event, we can infer what exactly the event is from the argument they use. – Argument will be used as an attribute in user’s profile. – There is a step in argument detection, which is to classify the polarity of the tweet towards to the issue. From the result of polarity classification, we can infer public’s attribute about the issue.
Argument Detection Approach Step 1. Given a trending topic, retrieve all the tweets associated with that trending topic. (Output from trending topic detection) – Assuming trending topics as issues, we will detect people’s argument for the issue from those tweets which are relevant with that topic. Step 2. Determine whether one tweet is subjective or neutral about the topic. (Going on) – Though some tweets belong to a certain trending topic, they don’t show any subjective opinion to the topic. Example: Barack Obama makes banks an offer they can’t refuse.
Argument Detection Approach Step 3. Polarity classification to decide whether this tweet is positive or negative towards to the topic. (Going on) – After we get argument from tweets, we need to know whether the argument is used to support or oppose the issue. So we should know the polarity of the tweet first. Step 4. Get all opinion target pairs from those tweets which show positive or negative opinion separately. (Output from opinion-target pairs) – We will collect the argument from those opinion- target pairs.
Argument Detection Approach Step 5. Determine whether this opinion target can be used as argument. (Going on) – There are some opinion-target pairs which can’t be used as argument. Examples: Tweet: I envy you guys with a leader like Obama. Opinion-target pairs: – envy-I I- (this opinion-target pair can’t be used as argument) – Use mention co-reference and Mutual Information (MI) to find useful opinion-target pairs.
Argument Detection Approach – Find those targets which are co-referenced with the topic. Example: Tweet: Obama is still the best president you Americans have had in a very long time:) Opinion-target pairs: – best-president president+ Argument – Positive president (best) (Obama and president are co- referenced)
Argument Detection Approach – MI is a quantity that measures the mutual dependence of the two random variables. Calculate MI between topic and target from opinion-target pairs. If the value of MI exceeds a certain threshold, then consider this opinion-target pair. Example: Tweet: Obama’s nasty army. They aren’t funded yet but take a good look… Opinion-target pairs: – nasty-army army- Argument: – Negative army (nasty and little). (MI between Obama and army is high)
Argument Detection Approach Step 6. Argument cluster (Going on) – Cluster those arguments which have the same meaning into the same cluster by WordNet. The target of opinion-target pairs may have similar meaning, so we can cluster them. Example : (we can combine troops and army to the same cluster) – Argument: » Negative troops (Little) » Negative army (Nasty)
Future work Trending Topic Detection – Two more algorithm to overcome the shortage of naïve approach. Opinion-Target Pairs – Run opinion finder to parse all the tweets. – Compare the results of lexicon polarity and contextual polarity. Argument Detection – Identify whether the tweet is subjective or objective towards to an topic. – Identify the polarity of those subjective tweets – Identify those useful opinion target for argument detection – Cluster opinion target
Future work User profile construction. User’s profiles will include those following content: – All the tweets one published – Location, description in user’s tweet account profile – Predicted gender – Predicted interests – Opinion target pairs – Trending topics one have ever discussed and also his opinion towards to those trending topics. – Arguments they use to support or oppose an topic. – …..