Source: Procedia Computer Science(2015)70:

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Classification: Alternative Techniques
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
An Overview of Machine Learning
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Sparse vs. Ensemble Approaches to Supervised Learning
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Sparse vs. Ensemble Approaches to Supervised Learning
Machine Learning CS 165B Spring 2012
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Data mining and machine learning A brief introduction.
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
Boosting Neural Networks Published by Holger Schwenk and Yoshua Benggio Neural Computation, 12(8): , Presented by Yong Li.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Man vs. Machine: Adversarial Detection of Malicious Crowdsourcing Workers Gang Wang, Tianyi Wang, Haitao Zheng, Ben Y. Zhao, UC Santa Barbara, Usenix Security.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Spam Detection Ethan Grefe December 13, 2013.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Algorithms Emerging IT Fall 2015 Team 1 Avenbaum, Hamilton, Mirilla, Pisano.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Borja Sanz, Igor Santos, Carlos Laorden, Xabier Ugarte-Pedrero and Pablo Garcia Bringas The 9th Annual IEEE Consumer Communications and Networking Conference.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Learning to Detect and Classify Malicious Executables in the Wild by J
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Click Through Rate Prediction for Local Search Results
Sentiment analysis algorithms and applications: A survey
Stable web spam detection using features based on lexical items
Trees, bagging, boosting, and stacking
Waikato Environment for Knowledge Analysis
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Data Mining Classification: Alternative Techniques
Machine Learning for High-Throughput Stress Phenotyping in Plants
Machine Learning with Weka
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Ensemble learning Reminder - Bagging of Trees Random Forest
Lecture 10 – Introduction to Weka
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Roc curves By Vittoria Cozza, matr
CAMCOS Report Day December 9th, 2015 San Jose State University
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification Source: Procedia Computer Science(2015)70:434-441 Authors: Kwang Leng Goh,Ashutosh Kumar Singh Speaker: Jia Qing Wang Date: 2016/10/13 1 1

Outline Introduction Related Work Methodology Experiment Results & Discussions Conclusion & Future Work 2 2

Introduction(1/3) The website of a municipal health bureau was attacked by the detectaphone advertisements 3

Introduction(2/3) 4

Introduction(3/3) The intention of Web spam was to mislead search engines by boosting one page to undeserved rank, and leaded Web user to irrelevant information. Spam detection is needed , however, this process is often time-consuming, expensive and difficult to automate because of massive amounts of data, multi-dimensional attributes. Machine learning methods provide an ideal solution due to its adaptive ability to learn the underlying patterns for classifying spam and non-spam.22 The paper point out that … [22] Erdlyi M, Garz A, Benczr AA (2011) Web spam classification: a few features worth more. DOI 10.1145/1964114.1964121

Related Work(1/3) Web spam detection problem  classification problem Extract features Dimensionality reduction use feature selection and feature extraction methods Classifier Experiment result Novel high-quality features for web pages Link features Content features …… PCA LDA

Related Work(2/3) Datasets: WEBSPAM-UK2006 and WEBSPAM-UK2007 A&B: the number of words in the page, number of words in the title, average word length, fraction of anchor text and visible text… C: in-degree, out-degree, PageRank, TrustRank, truncated PageRank D: just simple numeric transformations and combinations of the link-based features. Both datasets provide evaluated sets, SET 1 for training and SET 2 for testing as the motivation behind the Web Spam Challenge Series is to provide solution to combat Web spam from machine learning. The distribution of feature vectors is shown as: [22] Erdlyi M, Garz A, Benczr AA (2011) Web spam classification: a few features worth more. DOI 10.1145/1964114.1964121 [24] Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: ICML, vol 96, pp 148–156 [26] Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Machine learning 29(2-3):131–163

Related Work(3/3) Performance Evaluation As binary classification problem, these datasets we use is very unbalanced, common evaluation measures are not suited , this paper use AUC as the performance evaluation. The receiver characteristic curve is determined by plotting true positive rate vs the false positive rate in various threshold value. AUC aims at measuring the performance of the prediction of spamicity. A perfect model will score AUC of 1, while an area of 0.5 represent a chance of flipping a coin.

Methodology(1/1) several machine learning algorithms from top 10 data mining algorithms are described and evaluated in the paper. Support Vector Machine (SVM) Multilayer Perceptron Neural Network (MLP) Bayesian Network (BN) C4.5 Decision Tree (DT) Random Forest (RF) Nave Bayes (NB) K-nearest Neighbour (KNN) Furthermore, several meta-algorithms are presented to enhance the AUC results of selected machine learning algorithms. Boosting algorithms Bagging Dagging Rotation Forest  as example .  as example

Experiment(1/3) WEBSPAM-UK2007 (B+C 137 features) Spam 223 No-spam 5248 AUC = 0.693 DT decides the target class of a new sample based on selected features from available data using the concept of information entropy.

Experiment(2/3) Input: sample set Attributes/features possible decision (information gain) class 11

Experiment(3/3) DT based adaboost algorithm AUC = 0.769>0.693 DT decides the target class of a new sample based on selected features from available data using the concept of information entropy.

Results & Discussions(1/3) B+C: AUC = 0.693

Results & Discussions(2/3)

Results & Discussions(3/3)

Conclusion & Future Work Random Forest has proven to be a powerful classifier than most top data mining tools including SVM and MLP in Web spam detection For future work, the features for Web spam detection are intended to comprehensively compared and studied. comprehensively compare features Add new features

Thanks