Project Discussion-3 -Prof. Vincent Ng -Jitendra Mohanty -Girish Vaidyanathan -Chen Chen.

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

Naïve-Bayes Classifiers Business Intelligence for Managers.

Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.

Farag Saad i-KNOW 2014 Graz- Austria,

Distant Supervision for Emotion Classification in Twitter posts 1/17.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Extract from various presentations: Bing Liu, Aditya Joshi, Aster Data … Sentiment Analysis January 2012.

Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.

Recommender systems Ram Akella November 26 th 2008.

Chapter 5: Information Retrieval and Web Search

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Mining and Summarizing Customer Reviews

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

Deriving Topics and Opinions from Microblogs Feng Jiang Supervisors: Jixue Liu & Jiuyong Li.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Opinion Mining Using Econometrics: A Case Study on Reputation Systems Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan Leonard N. Stern School.

Bayesian Networks. Male brain wiring Female brain wiring.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Introduction to Text and Web Mining. I. Text Mining is part of our lives.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.

Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

MCOM 3.  Magazines today  Cover a wide range of topics  Appeal to Niche markets  Cater to segmented or specific audience  What are some magazines.

Chapter 6: Information Retrieval and Web Search

COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.

BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.

How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

Poorva Potdar Sentiment and Textual analysis of Create-Debate data EECS 595 – End Term Project.

 Goal recap  Implementation  Experimental Results  Conclusion  Questions & Answers.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.

Jointly Modeling Topics, Events and User Interests on Twitter Qiming DiaoJing Jiang School of Information Systems Singapore Management University.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

广州市教育局教学研究室英语科 Module 8 Unit 1 A land of diversity Period 3 Grammar 广州执信中学郑卫红.

Project Deliverable-1 -Prof. Vincent Ng -Girish Ramachandran -Chen Chen -Jitendra Mohanty.

Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.

A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

A Simple Approach for Author Profiling in MapReduce

Topic Modeling for Short Texts with Auxiliary Word Embeddings

A Straightforward Author Profiling Approach in MapReduce

CRF &SVM in Medication Extraction

Aspect-based sentiment analysis

Multimedia Information Retrieval

Precursor pattern analysis for civil unrest events

Text Mining & Natural Language Processing

CS246: Information Retrieval

Introduction to Sentiment Analysis

Austin Karingada, Jacob Handy, Adviser : Dr

Presentation transcript:

Project Discussion-3 -Prof. Vincent Ng -Jitendra Mohanty -Girish Vaidyanathan -Chen Chen

Agenda Brief recap of last presentation Things we have done so far – Finding interests of user – Finding user’s gender – Finding trending topic – Finding Opinion target pair – Argument detection Future plans – Argument detection – User profile construction

Brief recap of last presentation Finding interests of user 20 categories of interest Algorithm – Neural network – Support Vector Machine – Passive Aggressive algorithm Data Twitter data Blog data

Data Preparation music photography art reading movie sport writing travel cooking fashion food politics god singing dancing family animal shopping game Social media We have 120,778 users totally. 60% are used as training data. 20% are used as development data. (Tune learning algorithm parameter) 20% are used as testing data.

Finding interests of user Feature groups POS sequence: 1003 Named entities: 14 Social linguistic: Bigram : Unigram: Unigram for description: Bigram for description: Ngram for user name: Ngram for screen name: Totally: 578,085

Finding interests of user Measures Precision. The fraction of predict users who really have that interest. Recall. The fraction of relevant instances that are retrieved. F score. Accuracy.

Finding interests of user Neural Network Result F scoreAccuracyPrecisionRecall Music photography art reading movie sport writing travel cooking fashion food politics god singing dancing family animal shopping game socialMedia

Finding interests of user Support Vector Machine Result Support vector machineF scoreAccuracyPrecisionRecall Music photography art reading movie sport writing travel cooking fashion food politics god singing dancing family animal shopping game socialMedia

Finding interests of user Passive Aggressive F scoreAccuracyPrecisionRecall Music photography art reading movie sport writing travel cooking fashion food politics god singing dancing family animal shopping game socialMedia

Finding interests of user Result Comparison Micro FscoreMacro F-score Support Vector Machine Neural Network Passive aggressive

Finding interests of user Result Analysis Recall Analysis Recall is low because of the data size per user. Some users claim they have certain interest, while they have not published any tweets or blogs related to those kinds of interest. However, once they publish some tweets or blogs related to those kinds of interest in the future, our system can make the right prediction. Precision Analysis We find some cases that people have published some tweets or blogs related to one certain interest, however, they doesn’t specify it as their interest. Precision is higher than recall for most interest categories.

Finding interests of user Including more features – Tweet POS sequence: 1003 – Named entities: 14 – Social linguistic: – Bigram for tweets and blogs: – Unigram for tweets and blogs: – Unigram for description: – Bigram for description: – Ngram for user name: – Ngram for screen name: – Gender: 2 – Blog pos sequence: 3075 – Unigram for “About me”: – Bigram for “About me”: – Industry: 39 – Location: 4505 – Occupation: 332 – Totally:

Finding interests of user Neural network result after including more feature Improved F score Old F score Music photography art reading movie sport writing travel cooking fashion food politics god singing dancing family animal shopping game socialMedia

Finding interests of user Result after including more feature. After including additional features we are able to predict interests of all categories better than before. Some categories like music, reading, writing, fashion, politics are improved by more than 1 percent. Micro F-scoreMacro F-score Neural Network Neural Network (More features)

Finding interests of user Feature Analysis – Totally there are 16 feature groups. Delete one feature group and then see the result. – As neural network can give the best result, we apply neural network to analyze.

Feature Analysis result Neural NetworkMicro F-ScoreMacro F-score All features remove tweet pos remove ner remove social linguistic remove bigram remove unigram remove description unigram remove description bigram remove name ngram remove screen name ngram remove gender remove blog pos remove aboutMe unigram remove aboutMe bigram remove industry remove location remove occupation

Finding gender of user Motivation: – Help to construct user’s profile – Help to compare opinions between different gender Data – Tweet data – Blog data Feature group: – POS sequence: 1003 – Named entities: 14 – Social linguistic: – Bigram : – Unigram: – Unigram for description: – Bigram for description: – Ngram for user name: – Ngram for screen name: – Interest features * – Total number: (Interest features amount)

Gender Distribution CountPercentage Male % Female %

Finding gender of user Result: If we can improve our interest prediction, it may be possible to improve the gender prediction. Neural NetworkF-ScoreAccuracyPrecisionRecall No interest features Real interest features predicted interest features Support Vector MachineF-ScoreAccuracyPrecisionRecall No interest features Real interest features predicted interest features Passive Aggressive AlgorithmF-ScoreAccuracyPrecisionRecall No interest features Real interest features predicted interest features

Finding Trending Topics Motivation – Helps in finding the interesting topics that attract people’s attention – Trending topics are helpful in argument detection Possible Approaches – Naïve Approach – Online Clustering – Latent Dirchlet Allocation (LDA)

Trending Topics Results Naïve Approach: A brief recap – Visit every tweet in our dataset and find the words and phrases that are occurring frequently. – Those are the probable trending topics in our dataset – The timeframe of a trending topic can be found in a similar fashion by keeping track of the timestamp associated with every tweet and find the minimum and maximum timestamp with respect to every phrase/word

Some Results for the month of December in our dataset in chronological order using Naïve Approach

Topics over the month of December

Problems in Naïve approach Many irrelevant words or phrases with large counts will be considered as trending topics. For example, – Youtube video – I arrived – Watching movie

Solution for the problem in the previous slide ( part of future work ) Possible solutions – Online Clustering – Latent Dirichlet Allocation

Solution for the problem in the previous slide ( part of future work ) Online clustering algorithm – All the tweets are ordered on the timeline – Tweets are represented vector of tf-idf(term freq. & inverted document freq.) weights – Tweets which have highest similarities are clustered together. – Every cluster corresponds to a trending topic. Online clustering algorithm works better because it uses tf-idf weights for all terms and find the similarity between a tweet and all the current clusters available. For Example, consider the following tweets I love lady gaga I love gandhi Lady gaga is the best singer in the world

Solution for the problem in the previous slide ( part of future work ) Latent Dirchlet Allocation(LDA) – LDA is a bag of words model – In LDA, each tweet is viewed as a mixture of various topics. – Suppose a tweet has a particular trending topic in it, It has a high probability of belonging to that topic.

Opinion-Target: What is it?? Opinion words in an opinionated sentence, in most cases, are adjectives which act upon directly on its target. For example: I am so excited that the vacation is coming. – Here the opinion word is excited – And its target is I The water is green and clear. – Here the opinion word is green – And its target is Water The Dream Lake is a beautiful place. – Here the opinion is beautiful – And its target is Dream Lake

Why Opinion-Target pair?? Motivation An opinionated sentence gives a sense of general opinion of a person on a *subject material* or *topic*, called *target* in the research literature. *Subject material* or *topic* is diverse. For example, it could be travel article which deals with several tourist attractions. Last two examples in the previous slides are tourism related opinions by us. Opinions change over the course of time. Example At time t1 user p’s view, place x has really good scenic view, let’s go for it. At time t2 = t1+(1year) user p’s view, place y has better scenic view as compared to place x. The opinion of the user about the tourist place x has changed over the 1 year time frame. It has changed from positive to negative over the time. This gives us a sense of belief that by listening to the posts (tweets in our case), we can create a profile that would give us a way to see if there is a change in the interests of an user on a particular topic over a time duration.

Extraction of Opinion-Target Pair -Stanford parser was run over the tweets to give us dependencies among different entities in a tweet. -Following 5 rules were used on the dependency information generated from previous step to generate Opinion-Target pair. -Direct Object Rule -dobj(opinion, target) -I love (opinion1) Firefox(target1) and defended(opinion2) it. -Nominal Subject Rule -nsubj(opinion, target) -IE(target) breaks(opinion) with everything. -Adjective Modifier Rule -amod(target, opinion) -The annoying(opinion) popup(target) The opinion is the adjectival modifier of the target -Prepositional Object Rule -If prep(target1, IN) => pobj(IN, target2) -The prepositional object of a known target is also a target of the same opinion The annoying(op) popup(tar1) in IE(tar2) -Recursive Modifiers Rule -If conj(adj2, opinion adj1) => amod(target, adj2)

What to do with Opinion-Target pairs extracted?? Once we have the opinion-target pair, we used subjectivity lexicon of (Wilson et al., 2005), which contains 8221 words to express the polarity of the opinion. The words are nothing but the opinions. Some samples from the lexicon type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negative type=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negative type=weaksubj len=1 word1=ability pos1=noun stemmed1=n priorpolarity=positive type=weaksubj len=1 word1=above pos1=anypos stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=amazing pos1=adj stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=absolutely pos1=adj stemmed1=n priorpolarity=neutral type=weaksubj len=1 word1=absorbed pos1=verb stemmed1=n priorpolarity=neutral

How does Opinion-Target pairs extracted look like? Tweet_id Opinion-Target pairPolarity-Target pair Tweet:3will-II+ Tweet:6evil-I I- Tweet:7 best-books books+ Tweet:7 cry-me me- Tweet:7 think-what what* Tweet:7 think-you you* Tweet:9 amazing-houses houses+ Tweet:10 love-I I+ Tweet: The opinion-target pair extracted from this tweet is and the corresponding polarity-target pair is, where amazing has the positive prior polarity.

What Next ? Integration Next, we apply these polarity-target pairs to the tweets to get useful information about the interests of a person.

Integrating opinion-target pair with tweets of trending-topic Trending topics are those that are immediately popular in tweeter world. This helps people discover the *most breaking* news stories from across the world. Polarity-Target pairs are applied to the trending topic tweets generated to find the opinion of a person w.r.t a trending-topic. It gives us a sense of the user who has posted the tweet, along with the actual message, the topic that the tweet is all about and the corresponding polarity. For example: The following tweet has the tweet_id 746 in a general tweet file. Hasn't Obama's warranty runs out yet? // It was a limited warranty covering nothing substantial anyway! The above tweet has also been tagged as trending topic tweet under the trending topic *Obama*. Opinion-Target and Polarity-Target pairs for the above tweet generated using the five rules, that we discussed, are as follows: tweet_id: 746 limited-warranty warranty- tweet_id: 746 substantial-nothing nothing+ We have a matched tweet_id from both the scenarios above, which gives us the opinion-target and polarity-target pairs which will be used for argument detection and profile construction.

Drawback The polarity that we talked about is the prior-polarity. It does not take the *context of the sentence* into consideration. However, Opinon-Finder does! why prior-polarity is not always effective?? – Explained with example in later slides

Opinion-Finder Software developed at University of Pittsburg to predict the contextual polarity of the sentence. Mainly, it was designed for documents and has limitation on the size of the file that it can process as well as with the sentence splitting module. We modified their software to deal with our purpose, i.e. tweets. It is extremely slow. For example, processes 27M file in 12 hours approx (totally tweets file size is 18GB)

How is Opinion-Finder different from conventional Opinion-Target/Polarity-Target pair Consider a tweet: From intuition, we can say that the above tweet has negative connotation. How is it depicted in Opinion-Finder? Output of the Opinion-Finder No one is happy with Barack Obama 's healthcare plan. Output of the conventional 5-rule system: – Tweet:1 happy-one one+

How does Contextual-Polarity output look what the hell did she do, push him out the truck? i think its time for me to go back to bed, dnl going to bed at 3 and waking up at 7. yuck. Quebecor veut vendre Jobboom - secteurs-d-activite - LesAffaires.com maacii ya keik boltantemnyaaa,, *senaaaaannggg* 90% of any pain comes from trying to keep the pain secret. You cannot keep a secret and let it go.

Status as of now.. Opinion-Finder is slow. It runs a pipeline of internal modules, such as Document Preprocessing, Sentence Splitting, Tokenization, POS tagging, Feature Finder, Source Finder etc. ~20% of the tweets has just completed processing using Opinion-Finder.

Argument Detection Argument detection is to find the argument people use to support or oppose an issue. – Example: Obama is bankrupting Americans, he does nothing to improve the economy just drain it. – “Obama” is the issue – “bankrupting American” and “does nothing to improve the economy just drain it” is the argument

Argument Detection Motivation – To discover the reason why people show positive or negative opinion towards to an issue. – If people suddenly change their opinion towards to an issue because of a particular event, we can infer what exactly the event is from the argument they use. – Argument will be used as an attribute in user’s profile. – There is a step in argument detection, which is to classify the polarity of the tweet towards to the issue. From the result of polarity classification, we can infer public’s attribute about the issue.

Argument Detection Approach Step 1. Given a trending topic, retrieve all the tweets associated with that trending topic. (Output from trending topic detection) – Assuming trending topics as issues, we will detect people’s argument for the issue from those tweets which are relevant with that topic. Step 2. Determine whether one tweet is subjective or neutral about the topic. (Going on) – Though some tweets belong to a certain trending topic, they don’t show any subjective opinion to the topic. Example: Barack Obama makes banks an offer they can’t refuse.

Argument Detection Approach Step 3. Polarity classification to decide whether this tweet is positive or negative towards to the topic. (Going on) – After we get argument from tweets, we need to know whether the argument is used to support or oppose the issue. So we should know the polarity of the tweet first. Step 4. Get all opinion target pairs from those tweets which show positive or negative opinion separately. (Output from opinion-target pairs) – We will collect the argument from those opinion- target pairs.

Argument Detection Approach Step 5. Determine whether this opinion target can be used as argument. (Going on) – There are some opinion-target pairs which can’t be used as argument. Examples: Tweet: I envy you guys with a leader like Obama. Opinion-target pairs: – envy-I I- (this opinion-target pair can’t be used as argument) – Use mention co-reference and Mutual Information (MI) to find useful opinion-target pairs.

Argument Detection Approach – Find those targets which are co-referenced with the topic. Example: Tweet: Obama is still the best president you Americans have had in a very long time:) Opinion-target pairs: – best-president president+ Argument – Positive president (best) (Obama and president are co- referenced)

Argument Detection Approach – MI is a quantity that measures the mutual dependence of the two random variables. Calculate MI between topic and target from opinion-target pairs. If the value of MI exceeds a certain threshold, then consider this opinion-target pair. Example: Tweet: Obama’s nasty army. They aren’t funded yet but take a good look… Opinion-target pairs: – nasty-army army- Argument: – Negative army (nasty and little). (MI between Obama and army is high)

Argument Detection Approach Step 6. Argument cluster (Going on) – Cluster those arguments which have the same meaning into the same cluster by WordNet. The target of opinion-target pairs may have similar meaning, so we can cluster them. Example : (we can combine troops and army to the same cluster) – Argument: » Negative troops (Little) » Negative army (Nasty)

Future work Trending Topic Detection – Two more algorithm to overcome the shortage of naïve approach. Opinion-Target Pairs – Run opinion finder to parse all the tweets. – Compare the results of lexicon polarity and contextual polarity. Argument Detection – Identify whether the tweet is subjective or objective towards to an topic. – Identify the polarity of those subjective tweets – Identify those useful opinion target for argument detection – Cluster opinion target

Future work User profile construction. User’s profiles will include those following content: – All the tweets one published – Location, description in user’s tweet account profile – Predicted gender – Predicted interests – Opinion target pairs – Trending topics one have ever discussed and also his opinion towards to those trending topics. – Arguments they use to support or oppose an topic. – …..

Thank You!