Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification Source: Procedia Computer Science(2015)70:434-441 Authors: Kwang Leng Goh,Ashutosh Kumar Singh Speaker: Jia Qing Wang Date: 2016/10/13 1 1
Outline Introduction Related Work Methodology Experiment Results & Discussions Conclusion & Future Work 2 2
Introduction(1/3) The website of a municipal health bureau was attacked by the detectaphone advertisements 3
Introduction(2/3) 4
Introduction(3/3) The intention of Web spam was to mislead search engines by boosting one page to undeserved rank, and leaded Web user to irrelevant information. Spam detection is needed , however, this process is often time-consuming, expensive and difficult to automate because of massive amounts of data, multi-dimensional attributes. Machine learning methods provide an ideal solution due to its adaptive ability to learn the underlying patterns for classifying spam and non-spam.22 The paper point out that … [22] Erdlyi M, Garz A, Benczr AA (2011) Web spam classification: a few features worth more. DOI 10.1145/1964114.1964121
Related Work(1/3) Web spam detection problem classification problem Extract features Dimensionality reduction use feature selection and feature extraction methods Classifier Experiment result Novel high-quality features for web pages Link features Content features …… PCA LDA
Related Work(2/3) Datasets: WEBSPAM-UK2006 and WEBSPAM-UK2007 A&B: the number of words in the page, number of words in the title, average word length, fraction of anchor text and visible text… C: in-degree, out-degree, PageRank, TrustRank, truncated PageRank D: just simple numeric transformations and combinations of the link-based features. Both datasets provide evaluated sets, SET 1 for training and SET 2 for testing as the motivation behind the Web Spam Challenge Series is to provide solution to combat Web spam from machine learning. The distribution of feature vectors is shown as: [22] Erdlyi M, Garz A, Benczr AA (2011) Web spam classification: a few features worth more. DOI 10.1145/1964114.1964121 [24] Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML, vol 96, pp 148–156 [26] Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Machine learning 29(2-3):131–163
Related Work(3/3) Performance Evaluation As binary classification problem, these datasets we use is very unbalanced, common evaluation measures are not suited , this paper use AUC as the performance evaluation. The receiver characteristic curve is determined by plotting true positive rate vs the false positive rate in various threshold value. AUC aims at measuring the performance of the prediction of spamicity. A perfect model will score AUC of 1, while an area of 0.5 represent a chance of flipping a coin.
Methodology(1/1) several machine learning algorithms from top 10 data mining algorithms are described and evaluated in the paper. Support Vector Machine (SVM) Multilayer Perceptron Neural Network (MLP) Bayesian Network (BN) C4.5 Decision Tree (DT) Random Forest (RF) Nave Bayes (NB) K-nearest Neighbour (KNN) Furthermore, several meta-algorithms are presented to enhance the AUC results of selected machine learning algorithms. Boosting algorithms Bagging Dagging Rotation Forest as example . as example
Experiment(1/3) WEBSPAM-UK2007 (B+C 137 features) Spam 223 No-spam 5248 AUC = 0.693 DT decides the target class of a new sample based on selected features from available data using the concept of information entropy.
Experiment(2/3) Input: sample set Attributes/features possible decision (information gain) class 11
Experiment(3/3) DT based adaboost algorithm AUC = 0.769>0.693 DT decides the target class of a new sample based on selected features from available data using the concept of information entropy.
Results & Discussions(1/3) B+C: AUC = 0.693
Results & Discussions(2/3)
Results & Discussions(3/3)
Conclusion & Future Work Random Forest has proven to be a powerful classifier than most top data mining tools including SVM and MLP in Web spam detection For future work, the features for Web spam detection are intended to comprehensively compared and studied. comprehensively compare features Add new features
Thanks