iSRD Spam Review Detection with Imbalanced Data Distributions

iSRD Spam Review Detection with Imbalanced Data Distributions
Yan Zhu

Agenda Overview Objective
Sentiment Analysis And Imbalanced Data Distributions ISRD: Methodology Experiments And Results Conclusion

Overview The increasing use of the internet in all aspects of our lives, made us rely on the internet for doing all of our daily life activities. Online product reviews are becoming vital for customers to obtain additional user-centered knowledge about the products. However, some vendors are paying customers to write good reviews so they can boost their revenue through online sales

Overview Examples of spam reviews include untruthful/fake review reports and review reports irrelevant to the products (such as an advertisement). One of the most effective ways to distinguish spam and non-spam reviews is by using machine learning techniques. Non-spam reviews are often the majority population, and the spam or fake reviews are relatively rare and difficult to obtain.

Objective In the paper, the authors discussed sentiment analysis techniques for opinion mining in order to convey user’s sentiments by using document sentiment classification based on supervised learning, and feature based sentiment analysis. To solve the problem of unbalanced data set, we develop iSRD, which is a new classifier framework that deals with imbalanced review data

Sentiment Analysis And Imbalanced Data Distributions
Sentiments often hold the real words that people wants to deliver Successfully analyzing and understanding the sentiments are useful for many domain specific applications Sentiment analysis is categorized into three levels including document level, sentence level and aspect level. Naïve Bayes and Support vector machine proved their abilities to give good results in supervised learning, by using bag-of-words as features

Review spams has been identified into three types: fake reviews, including untruthful reviews; reviews about brand only, that describe and comments on the brand rather than the product or service; the non-review, that are not reviews or might be irrelevant text , questions or advertisements.

The main challenge is that fake reviews are very hard to detect even manually, because there is no clear way to distinguish between fake and true reviews. Machine learning is a best suitable technique that achieves a good generalization from the provided representations and learns the behavior from the given examples in order to classify unseen examples.

In machine learning, imbalanced data distributions often happens because of the lack of examples from the minority class. The problem of imbalanced data appears when users intend to train a good classifier from imbalanced training data, where classifiers are inherently biased toward the majority class, leading to incorrect generalization rules.

Instead of the accuracy, one should focus on precision, recall, sensitivity and specificity, which give us accurate performance for the minority class

Many methods exist to handle data with imbalanced distributions. Examples include sampling and re-weighting. When using those approaches, boosting and bagging are often used to combine classifiers trained from sampled datasets for prediction.

ISRD: Methodology The main theme is to use under sampling to generate a relatively balanced data set, and then user classifiers trained from sampled datasets for prediction. Repeat the sampling for multiple times , each of which will generate a balanced dataset to reduce the sample selection bias. For each balanced dataset, train a classifier, and use the ensemble of the classifiers from all sampled datasets for spam classification.

The proposed iSRD framework for spam review detection with imbalanced data distributions. #S and #N denote the number of Spam and Non-Spam in the dataset. The benchmark dataset is first split into an FIT (training) and a test set. For the training set, we use to change the data imbalance levels so we can observe algorithm performance with respect to different data imbalance conditions. After that, we use random under-sampling to generate m copies of balanced datasets, where m here is equal to 10. We train a classifier from each balanced dataset, and use the majority voting to classify reviews in the test set.

ISRD: Methodology First split the dataset into a training (FIT) and a test set, where training and the test sets contain similar data imbalance ratios. Use β to change the data imbalance levels in the FIT set to evaluate the performance for different data imbalance levels. After we obtain the altered FIT dataset, we apply randomly under-sampling to generate balanced dataset

ISRD: Methodology We trained a classifier from each balanced datasets, and then use the majority voting of the m classifiers to predict the class labels of the reviews in the test sets. In order to validate the performance of the above design, our experiments will record the performance of each classifier against the same supplied test set and then compare the results for validation.

Experiments Data Collection
Collected review reports for multiple hotels located at different cities and different countries Two major data sources Opinion Based Entity Ranking Project Dataset (2012) Deceptive or fake reviews from the Deceptive Opinion Spam Corpus v1.4, which are gathered from Amazon MTurk heterogeneous

Data Collection Data Preprocessing
Form a dataset with two columns where each row denotes a review, and the first column includes all text of the review, and the second column shows the class label Convert texts into bag-of-words representation using StringToWordVector filter in Weka Store the dataset as an ARFF file to be used in the following steps

Data Collection Data Sampling
Randomly select examples and create an imbalanced test dataset with a very close imbalance ratio as the training set.

Build five Fit datasets from the original Fit dataset by Under- sampling minority class (spam), but keeping all non-spam examples.

For each of the Fit dataset, we will then apply random under- sampling to the majority class to create a set of balanced datasets. create 10 copies of randomly sampled datasets, each including the same number of positive and negative samples.

Results Instead of looking and examining the accuracy, other measures, such as precision, recall, sensitivity and specificity, will provide more accurate performance to evaluate the algorithm performance on the minority samples. Compare the performance of our model with a decision tree based classifier (C4.5) by using different statistical measurements.

Results The class of interest here is the spam class which is actually here the Positive class, so our interest is to increase the True Positive Rate and decrease the False Positive Rate

Results

Conclusion In this research we have addressed the problem of detecting spam online reviews from imbalanced data distributions proposed a new classifier technique to overcome the problem of imbalanced data distributions for review spam detection. proposed to use random under-sampling to generate balanced training sets.

Conclusion The experiments show that our proposed method, iSRD, significantly outperforms baseline classifier C4.5 in terms of TNR, FNR, Sensitivity, AUC and PRC, which are the common measures used for imbalanced data evaluation.

Reference Al Najada, H., & Zhu, X. (2014, August). iSRD: Spam review detection with imbalanced data distributions. In Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on (pp ). IEEE.

iSRD Spam Review Detection with Imbalanced Data Distributions

Similar presentations

Presentation on theme: "iSRD Spam Review Detection with Imbalanced Data Distributions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

iSRD Spam Review Detection with Imbalanced Data Distributions

Similar presentations

Presentation on theme: "iSRD Spam Review Detection with Imbalanced Data Distributions"— Presentation transcript:

Similar presentations

About project

Feedback