Presentation is loading. Please wait.

Presentation is loading. Please wait.

Source: Procedia Computer Science(2015)70:

Similar presentations


Presentation on theme: "Source: Procedia Computer Science(2015)70:"— Presentation transcript:

1 Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification
Source: Procedia Computer Science(2015)70: Authors: Kwang Leng Goh,Ashutosh Kumar Singh Speaker: Jia Qing Wang Date: 2016/10/13 1 1

2 Outline Introduction Related Work Methodology Experiment
Results & Discussions Conclusion & Future Work 2 2

3 Introduction(1/3) The website of a municipal health bureau was attacked by the detectaphone advertisements 3

4 Introduction(2/3) 4

5 Introduction(3/3) The intention of Web spam was to mislead search engines by boosting one page to undeserved rank, and leaded Web user to irrelevant information. Spam detection is needed , however, this process is often time-consuming, expensive and difficult to automate because of massive amounts of data, multi-dimensional attributes. Machine learning methods provide an ideal solution due to its adaptive ability to learn the underlying patterns for classifying spam and non-spam.22 The paper point out that … [22] Erdlyi M, Garz A, Benczr AA (2011) Web spam classification: a few features worth more. DOI /

6 Related Work(1/3) Web spam detection problem  classification problem
Extract features Dimensionality reduction use feature selection and feature extraction methods Classifier Experiment result Novel high-quality features for web pages Link features Content features …… PCA LDA

7 Related Work(2/3) Datasets: WEBSPAM-UK2006 and WEBSPAM-UK2007
A&B: the number of words in the page, number of words in the title, average word length, fraction of anchor text and visible text… C: in-degree, out-degree, PageRank, TrustRank, truncated PageRank D: just simple numeric transformations and combinations of the link-based features. Both datasets provide evaluated sets, SET 1 for training and SET 2 for testing as the motivation behind the Web Spam Challenge Series is to provide solution to combat Web spam from machine learning. The distribution of feature vectors is shown as: [22] Erdlyi M, Garz A, Benczr AA (2011) Web spam classification: a few features worth more. DOI / [24] Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML, vol 96, pp 148–156 [26] Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Machine learning 29(2-3):131–163

8 Related Work(3/3) Performance Evaluation
As binary classification problem, these datasets we use is very unbalanced, common evaluation measures are not suited , this paper use AUC as the performance evaluation. The receiver characteristic curve is determined by plotting true positive rate vs the false positive rate in various threshold value. AUC aims at measuring the performance of the prediction of spamicity. A perfect model will score AUC of 1, while an area of 0.5 represent a chance of flipping a coin.

9 Methodology(1/1) several machine learning algorithms from top 10 data mining algorithms are described and evaluated in the paper. Support Vector Machine (SVM) Multilayer Perceptron Neural Network (MLP) Bayesian Network (BN) C4.5 Decision Tree (DT) Random Forest (RF) Nave Bayes (NB) K-nearest Neighbour (KNN) Furthermore, several meta-algorithms are presented to enhance the AUC results of selected machine learning algorithms. Boosting algorithms Bagging Dagging Rotation Forest  as example .  as example

10 Experiment(1/3) WEBSPAM-UK2007 (B+C 137 features) Spam 223 No-spam
5248 AUC = 0.693 DT decides the target class of a new sample based on selected features from available data using the concept of information entropy.

11 Experiment(2/3) Input: sample set Attributes/features
possible decision (information gain) class 11

12 Experiment(3/3) DT based adaboost algorithm AUC = 0.769>0.693
DT decides the target class of a new sample based on selected features from available data using the concept of information entropy.

13 Results & Discussions(1/3)
B+C: AUC = 0.693

14 Results & Discussions(2/3)

15 Results & Discussions(3/3)

16 Conclusion & Future Work
Random Forest has proven to be a powerful classifier than most top data mining tools including SVM and MLP in Web spam detection For future work, the features for Web spam detection are intended to comprehensively compared and studied. comprehensively compare features Add new features

17 Thanks


Download ppt "Source: Procedia Computer Science(2015)70:"

Similar presentations


Ads by Google