1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.

Slides:



Advertisements
Similar presentations
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
Advertisements

Linear Regression.
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Sensor-Based Abnormal Human-Activity Detection Authors: Jie Yin, Qiang Yang, and Jeffrey Junfeng Pan Presenter: Raghu Rangan.
Text-Based Measures of Document Diversity Date : 2014/02/12 Source : KDD’13 Authors : Kevin Bache, David Newman, and Padhraic Smyth Advisor : Dr. Jia-Ling,
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
Yehuda Koren , Joe Sill Recsys’11 best paper award
Review Spam Detection via Temporal Pattern Discovery Sihong Xie, Guan Wang, Shuyang Lin, Philip S. Yu Department of Computer Science University of Illinois.
Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining and Summarizing Customer Reviews Advisor : Dr.
Product Review Summarization from a Deeper Perspective Duy Khang Ly, Kazunari Sugiyama, Ziheng Lin, Min-Yen Kan National University of Singapore.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Subject: Principles of Accounts Title: Accounting Ratio and Interpretation of Accounts.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Mining and Summarizing Customer Reviews
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
On Sparsity and Drift for Effective Real- time Filtering in Microblogs Date : 2014/05/13 Source : CIKM’13 Advisor : Prof. Jia-Ling, Koh Speaker : Yi-Hsuan.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
1 Entity Discovery and Assignment for Opinion Mining Applications (ACM KDD 09’) Xiaowen Ding, Bing Liu, Lei Zhang Date: 09/01/09 Speaker: Hsu, Yu-Wen Advisor:
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
INTELLIGENT ORACLE CEMNET, SCE, NTU Speaker: Zeng Zinan
Designing Ranking Systems for Consumer Reviews: The Economic Impact of Customer Sentiment in Electronic Markets Anindya Ghose Panagiotis Ipeirotis Stern.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Evaluation of Spam Detection and Prevention Frameworks for and Image Spam - A State of Art Pedram Hayati, Vidyasagar Potdar Digital Ecosystems and.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,
LOGO Identifying the Influential Bloggers in a Community Nitin Agarwal, Huan Liu, Lei Tang and Philip S. Yu WSDM 2008 Advisor : Dr. Koh Jia-Ling Speaker.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
SUPPORTING SYNCHRONOUS SOCIAL Q&A THROUGHOUT THE QUESTION LIFECYCLE Matthew Richardson Ryen White Microsoft Research.
SEMANTIC VERIFICATION IN AN ONLINE FACT SEEKING ENVIRONMENT DMITRI ROUSSINOV, OZGUR TURETKEN Speaker: Li, HueiJyun Advisor: Koh, JiaLing Date: 2008/5/1.
{ Adaptive Relevance Feedback in Information Retrieval Yuanhua Lv and ChengXiang Zhai (CIKM ‘09) Date: 2010/10/12 Advisor: Dr. Koh, Jia-Ling Speaker: Lin,
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
1 Blog Cascade Affinity: Analysis and Prediction 2009 ACM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
LOGO Comments-Oriented Blog Summarization by Sentence Extraction Meishan Hu, Aixin Sun, Ee-Peng Lim (ACM CIKM’07) Advisor : Dr. Koh Jia-Ling Speaker :
Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
QUERY-PERFORMANCE PREDICTION: SETTING THE EXPECTATIONS STRAIGHT Date : 2014/08/18 Author : Fiana Raiber, Oren Kurland Source : SIGIR’14 Advisor : Jia-ling.
Bootstrap and Model Validation
Click Through Rate Prediction for Local Search Results
Source: Procedia Computer Science(2015)70:
Speaker: Jim-an tsai advisor: professor jia-lin koh
Hussein Issa Alexander Kogan Rutgers University
iSRD Spam Review Detection with Imbalanced Data Distributions
Measuring the Latency of Depression Detection in Social Media
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presentation transcript:

1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling

2 Outline Introduction Opinion data and analysis Spam detection analysis of type 1 spam reviews conclusion and future work

3 Introduction It is now well recognized that the user generated content contains valuable information that can be exploited for many applications. In this paper, we focus on customer reviews of products. In particular, we investigate opinion spam in reviews.

4 *three types of spam reviews Type 1 (untruthful opinions): also known as fake reviews or bogus reviews. Type 2 (reviews on brands only): not comment on the products for the products but only the brands, the manufacturers or the sellers. Type 3 (non-reviews): two main sub- types:  advertisements  other irrelevant reviews containing no opinions

5 Opinion Data and Analysis Each amazon.com’s review consists of 8 parts  <Review Title> <Number of Helpful Feedbacks>

6 *Reviews, Reviewer, and Products

7 *Review Ratings and Feedbacks

8 Spam Detection spam detection can be regarded as a classification problem with two classes, spam and non-spam. we can only manually label training examples for spam reviews of type 2 and type 3 as they are recognizable based on the content of a review.  recognizing whether a review is an untruthful opinion spam (type 1) is extremely difficult by manually reading the review.

9 In our analysis, we found a large number of duplicate and near-duplicate reviews.  1. Duplicates from different userids on the same product.  2. Duplicates from the same userid on different products.  3. Duplicates from different userids on different products.

10 *Detection of Duplication Reviews shingle method (2-grams) Jaccard distance (Similarity score) > 90%  duplicates. The maximum similarity score : the maximum of similarity scores between different reviews of a reviewer.

11

12 *Detecting Type 2 & Type 3 Model Building Using Logistic Regression  The reason for using logistic regression is that it produces a probability estimate of each review being a spam, which is desirable.  The AUC (Area under ROC Curve) is employed to evaluate the classification results.

13 Feature Identification and Construction  There are three main types of information (1) the content of the review, (2) the reviewer who wrote the review, (3) the product being reviewed.  three types of features: (1) review centric features, (2) reviewer centric features, (3) product centric features.  three types based on their average ratings Good (rating ≥ 4), bad (rating ≤ 2.5) and Average, otherwise

14 *Review Centric Features* number of feedbacks(F1) number of helpful feedbacks(F2) percent of helpful feedbacks(F3) length of the review title(F4) length of review body(F5) Position of the review of a product sorted by date. ascending (F6) and descending (F7) the first review (F8) the only review (F9)

15 *Review Centric Features * **Textual features percent of positive bearing words(F10) percent of negative bearing words(F11) cosine similarity (F12) percent of times brand name (F13) percent of numerals words(F14) percent of capitals words(F15) percent of all capital words(F16) **Rating related features rating of the review (F17) the deviation from product rating (F18) the review is good, average or bad (F19) a bad review was written just after the first good review of the product and vice versa (F20, F21)

16 *Reviewer Centric Features* Ratio of the number of reviews that the reviewer wrote which were the first reviews (F22) ratio of the number of cases in which he/she was the only reviewer (F23) average rating given by reviewer (F24) standard deviation in rating (F25) the reviewer always gave only good, average or bad rating (F26) a reviewer gave both good and bad ratings (F27) a reviewer gave good rating and average rating (F28) a reviewer gave bad rating and average rating (F29) a reviewer gave all three ratings (F30) percent of times that a reviewer wrote a review with binary features F20 (F31) and F21 (F32).

17 *Product Centric Features* Price of the product (F33) Sales rank of the product (F34) Average rating (F35) standard deviation in ratings (F36)

18 *Results of Type 2 and Type 3 Spam Detection We run logistic regression on the data using 470 spam reviews for positive class and rest of the reviews for negative class.

19 Analysis of Type 1 Spam Reviews 1. To promote some target objects (hype spam). 2. To damage the reputation of some other target objects (defaming spam).

20 *Making Use of Duplicates the same person writes the same review for different versions of the same product may not be spam. We propose to treat all duplicate spam reviews as positive examples, and the rest of the reviews as negative examples.  We then use them to learn a model to identify non-duplicate reviews with similar characteristics, which are likely to be spam reviews.

21 *Model Building Using Duplicates building the logistic regression model using duplicates and non-duplicates is not for detecting duplicate spam  Our real purpose is to use the model to identify type 1 spam reviews that are not duplicated.  check whether it can predict outlier reviews

22 Outlier reviews : whose ratings deviate from the average product rating a great deal. Sentiment classification techniques may be used to automatically assign a rating to a review solely based on its review content.

23 *Predicting Outlier Reviews Negative deviation is considered as less than -1 from the mean rating and positive deviation as larger than +1 from the mean rating of the product. Spammers may not want their review ratings to deviate too much from the norm to make the reviews too suspicious.

24 the model built using duplicated spam as positive data is also predictive of non- duplicate spam reviews to a good extent

25 *Some Other Interesting Reviews ** Only Reviews We did not use any position features of reviews (F6, F7, F8, F9, F20 and F21) and number of reviews of product (F12, F18, F35, and F36) related features in model building to prevent overfitting. only reviews are very likely to be candidates of spam

26 **Reviews from Top-Ranked Reviewers Top-ranked reviewers generally write a large number of reviews, much more than bottom ranked reviewers. Top ranked reviewers also score high on some important indicators of spam reviews.  top ranked reviewers are less trustworthy as compared to bottom ranked reviewers.

27 **Reviews with Different Levels of Feedbacks If usefulness of a review is defined based on the feedbacks that the review gets, it means that people can be readily fooled by a spam review.  feedback spam

28 **Reviews of Products with Varied Sales Ranks Spam activities are more limited to low selling products.  difficult to damage reputation of a high selling or popular product by writing a spam review.

29 Conclusions & Future work Results showed that the logistic regression model is highly effective. It is very hard to manually label training examples for type 1 spam. We will further improve the detection methods, and also look into spam in other kinds of media, e.g., forums and blogs.