Opinion spam and Analysis 소프트웨어공학 연구실 G201592009 최효린 1 / 35.

Slides:



Advertisements
Similar presentations
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Advertisements

DT Coursework By D. Henwood.
Linear Regression.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
SEM II : Marketing Research
Correlation and regression Dr. Ghada Abo-Zaid
Chapter 14 Comparing two groups Dr Richard Bußmann.
Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
Anindya Ghose Panos Ipeirotis Arun Sundararajan Stern School of Business New York University Opinion Mining using Econometrics A Case Study on Reputation.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
Evaluating Search Engine
Class 3: Thursday, Sept. 16 Reliability and Validity of Measurements Introduction to Regression Analysis Simple Linear Regression (2.3)
BCOR 1020 Business Statistics
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Recommender systems Ram Akella November 26 th 2008.
Determining the Size of
1 Objective Investigate the feasibility and added value of data mining to Analog Semiconductor Components division of ADI Use data mining to find unique.
JOINT BUY VALUE PROPOSITION TEST (BASED ON CUSTOMER INTERVIEW)
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Mining and Summarizing Customer Reviews
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
Simple Linear Regression
1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Opinion Mining Using Econometrics: A Case Study on Reputation Systems Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan Leonard N. Stern School.
by B. Zadrozny and C. Elkan
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Marketing Research. Monday, February 23 Give a couple examples of Marketing Research. Give a couple examples of Marketing Research. Why do you think Marketing.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Machine Learning CSE 681 CH2 - Supervised Learning.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Analyzing and Interpreting Quantitative Data
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
Designing Ranking Systems for Consumer Reviews: The Economic Impact of Customer Sentiment in Electronic Markets Anindya Ghose Panagiotis Ipeirotis Stern.
Chapter 10 Correlation and Regression
Keys to Successful Marketing  Must understand and meet customer needs and wants  To meet customer needs, marketers must collect information.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Factors that Contribute to the Selection of Products/Services in Small Business.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,
INCREASING HERBAL PRODUCT CONSUMPTION IN THAILAND DURING THE PERIOD Assoc Prof Dr Arthorn Riewpaiboon Department of Pharmaceutical Botany Faculty.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Internal assessment, Results, Discussion, and Format By Mr Daniel Hansson.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Experimental Psychology PSY 433 Chapter 5 Research Reports.
Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.
M ARKET R ESEARCH Topic 3.1. W HAT IS MARKET RESEARCH ? The process of gaining information about customers, products, competitors etc through the collection.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Experimental Psychology
Erasmus University Rotterdam
iSRD Spam Review Detection with Imbalanced Data Distributions
Presentation transcript:

Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35

Contents ABSTRACT 1. INTRODUCTION 2. RELATED WORK 3. OPINION DATA AND ANALYSIS Review Data from Amazon.com Reviews, Reviewers and Products Review Ratings and Feedbacks 4. SPAM DETECTION Detection of Duplicate Reviews Detecting Type 2 & Type 3 Spam Reviews 5. ANALYSIS OF TYPE 1 SPAM REVIEWS 6. CONCLUSIONS 2 / 35

ABSTRACT 웹 상의 “Evaluative text” It is now very common for people to read opinions on the Web. Opinion spam, trustworthiness 등의 문제가 있다. For examples, if one sees that the reviews of the product are mostly positive, one is very likely to buy it. 이 논문은 아마존 (amazon.com) 의 제품 리뷰를 분석하여 “Opinion spam” 을 식별하는 주요 기법을 제시함. 3 / 35

1. INTRODUCTION Reviews contain rich user opinions on products and services. They are used by potential customers to find opinions of existing users before deciding to purchase a product. They are also used by product manufacturers to identify product problems and/or to find marketing intelligence information about their competitors. Positive opinions can result in significant financial gains. 4 / 35

1. INTRODUCTION This gives good incentives for opinion spam. There are generally three types of spam reviews: Type 1 (untruthful opinions) : positive reviews to some target objects in order to promote the objects, or by giving negative reviews to damage their reputations. Type 2 (reviews on brands only) : specifically for the brands, the manufacturers or the sellers of the products. Type 3 (non-reviews) : (1) advertisements, (2) containing no opinions 5 / 35

2. RELATED WORK Current studies are mainly focused on mining opinions in reviews and/or classify reviews as positive or negative based on the sentiments of the reviewers. Since our objective is to detect spam activities in reviews, we discuss some existing work on spam research. Web spam spam recommender systems, etc. 6 / 35

3. OPINION DATA AND ANALYSIS 3.1 Review Data from Amazon.com In this work, we use reviews from amazon.com. The reason for using this data set is that it is large and covers a very wide range of products. Our review data set was crawled in June (5.8 million reviews, 2.14 reviewers and 6.7 million products) Each amazon.com’s review consists of 8 parts, 7 / 35

3. OPINION DATA AND ANALYSIS We used 4 main categories of products in our study, i.e., Books, Music, DVD and mProducts. 8 / 35

3. OPINION DATA AND ANALYSIS 3.2 Reviews, Reviewers and Products Before studying the review spam, let us first have some basic information about reviews, reviewers, products, ratings and feedback on reviews. Specifically, we show the following plots: 1. Number of reviews vs. number of reviewers 2. Number of reviews vs. number of products 9 / 35

3. OPINION DATA AND ANALYSIS 10 / 35

3. OPINION DATA AND ANALYSIS the relationship between the number of feedbacks (readers give to reviews to indicate whether they are helpful) 11 / 35

3. OPINION DATA AND ANALYSIS 3.3 Review Ratings and Feedbacks The review rating and the review feedback are two of the most important items of each review. Review Rating : Amazon uses a 5-point rating scale (with 1 is worst, 5 is best). Review Feedbacks : The percentage of positive feedbacks of a review. 12 / 35

4. SPAM DETECTION In general, spam detection can be regarded as a classification problem with two classes, spam and non-spam. To build a classification model, we need “manually labeled training examples” That is where we have a problem. 13 / 35

4. SPAM DETECTION For the three types of spam, we can only ‘manually labeled training examples’ for spam reviews of type 2 and type 3 as they are recognizable based on the content of a review. However, recognizing whether a review is an untruthful opinion spam (type 1) is extremely difficult by manually reading the review because one can ‘carefully craft’ a spam review which is just like any other innocent review. Thus, other ways have to be explored in order to find training examples for detecting possible type 1 spam reviews. (in Section 5) 14 / 35

4. SPAM DETECTION 4.1 Detection of Duplicate Reviews In this work, we use 2-gram based review content comparison. (the ‘similarity score’ of two reviews, usually called Jaccard distance) Review pairs with similarity score of at least 90% were chosen as duplicates. 15 / 35

4. SPAM DETECTION 16 / 35

4. SPAM DETECTION For spam removal, we can delete all duplicate reviews, we may want to keep only the last copy and remove the rest. 17 / 35

4. SPAM DETECTION 4.2 Detecting Type 2 & Type 3 Spam Reviews As we mentioned earlier, these two types of reviews are recognizable manually. Thus, we employ the classification learning approach based on manually labeled examples. Manually labeling them is extremely time-consuming. However, we are already able to achieve very good classification results (see Section 4.2.3). 18 / 35

4. SPAM DETECTION Model Building Using Logistic Regression The reason for using logistic regression is that it produces a probability estimate of each review being a spam, which is desirable. To build a model, we need to create the training data. We have defined features to characterize reviews. 19 / 35

4. SPAM DETECTION Feature Identification and Construction There are three main types of information related to a review: (1) the content of the review, (2) the reviewer who wrote the review, and (3) the product being reviewed. We thus have three types of features: (1) Review Centric Features, (2) Reviewer Centric Features, and (3) Product Centric Features. 20 / 35

4. SPAM DETECTION Review Centric Features Number of feedbacks (F1), Length of the review title (F4), a review is the first review (F8), etc. Textual features: Percent of positive (F10), and negative (F11), Percent of times brand name (F13), etc. Rating related features: Binary features indicating whether a bad review was written just after the first good review of the product and vice versa (F20, F21), etc. 21 / 35

4. SPAM DETECTION Reviewer Centric Features Ratio of the number of reviews that the reviewer wrote which were the first reviews (F22), Percent of times that a reviewer wrote a review with binary features F20 (F31) and F21 (F32), etc Product Centric Features Price (F33), and Sales rank (F34), etc. 22 / 35

4. SPAM DETECTION Results of Type 1(2) and Type 2(3) Spam Detection Without using feedback features(w/o feedbacks), the same results can be achieved. This is important because feedbacks can be spammed too (see Section 5.2.3). 23 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS From the presentation of Section 4, we can conclude that type 2 and types 3 spam reviews are fairly easy to detect. Detecting type 1 spam reviews is, however, very difficult as they are not easily recognizable manually. Let us recall what type 1 spam reviews aim to achieve: 1. To promote some target objects, e.g., one’s own products (hype spam). 2. To damage the reputation of some other target objects, e.g., products of one’s competitors (defaming spam). 24 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Table 4 gives a simple view of type 1 spam. Clearly, spam reviews in region 1 and 4 are not so damaging in regions 2, 3, 5 and 6 are very harmful(=easily spam). Thus, spam detection techniques should focus on identifying reviews in these regions. 25 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS 5.1 Making Use of Duplicates We would like to build a classification model. But, we have no manually labeled training data, we have to look from other sources. A natural choice is the three types of duplicates discussed in Section 4, which are almost certainly spam reviews. it is reasonable to assume that they are a fairly good and random sample of many kinds of spam. 26 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Model Building Using Duplicates To further confirm its predictability, we want to show that it is also predictive of reviews that are more likely to be spam and are not duplicated, i.e., Outlier reviews. We use the logistic regression model to predict outlier reviews. Is it harmful spam review? or truly have different views from others? 27 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Predicting Outlier Reviews To show the effectiveness of the prediction, let us use lift curves to visualize the results. Figure 7 plots 5 lift curves for the following types of (outlier) reviews. Negative deviation on the same brand (num = 378) Positive deviation on the same brand (num = 755) Negative deviation (num = 21005) Positive deviation (num = 20007) Bad product with average reviews (num = 192) 28 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Predicting Outlier Reviews Let us make several important observations: reviewers who are biased towards a brand and give reviews with negative deviations on products of that brand are very likely to be spammers. It should be removed as spam reviews. The other sides, much less likely to be spammed. 29 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Only Reviews For a customer it is very difficult to tell if that review is trustworthy since there is no other review on the same product to compare with. Figure 8 plots the lift curves for the following types of reviews. Only reviews of the products (num = 7330) The first reviews of the products (num = 9855) The second reviews of the products (num = 9834) 30 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Reviews from Top-Ranked Reviewers Amazon gives ranks to each reviewer based on the frequency that he/she gets helpful feedbacks on his/her reviews. Basically, we discuss whether ranking of reviewers is effective in keeping spammers out. the following reviewer ranks: Top 100 rank (num reviews = 609) Between 100 to 1000 ranks (num reviews = 1413) After more than 2 million rank (num reviews = 1771) 31 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Reviews from Top-Ranked Reviewers “Feedbacks can be spammed too” This is a very surprising result because it shows that top ranked reviewers are less trustworthy as compared to bottom ranked reviewers. 32 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Reviews with Different Levels of Feedbacks Apart from reviewer rank, another feature that a customer looks at before reading a review is the percent of helpful feedbacks that the review gets. This is a very important result. It shows that if usefulness of a review is defined based on the feedbacks. *Feedback spam: Finally we note that feedbacks can be spammed too. where a person or robot clicks on some online advertisements to give the impression of real customer clicks. 33 / 35

5. ANALYSIS OF TYPE 1 SPAM REVIEWS Reviews of Products with Varied Sales Ranks Product sales rank is an important feature for a product. Amazon displays the result of any query for products sorted by their sales ranks. Figure 11 plots the lift curves for reviews of corresponding products with high and low sales ranks. Products with sales rank <= 50 (num review = 3009) Products with sales rank >= 100K (num reviews = 3751) 34 / 35

6. CONCLUSIONS 이 논문에서는 다음과 같은 내용을 연구했다. Opinion spam in reviews as initial investigation. Identified three types of spam Detection techniques of spam 현재까지의 연구는 Opinion spam 에 대한 기본적인 분석이다. 향후에는 더 발전된 Spam detection 방법과, Review 이외에 다 른 분야 (forums, blogs) 에서도 살펴볼 것이다. 35 / 35