Presentation is loading. Please wait.

Presentation is loading. Please wait.

Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.

Similar presentations


Presentation on theme: "Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri."— Presentation transcript:

1 Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri Presented by Sovandy Hang CS 4440, Fall 2007

2 Outline About me About me Introduction Introduction Keywords Keywords How the process works? How the process works? Conclusion Conclusion Questions and answers Questions and answers

3 About Me 5 th year CS and IE major 5 th year CS and IE major Graduate next summer Graduate next summer Interest: Enterprise Resource Planning Interest: Enterprise Resource Planning Think all softwares should be open source Think all softwares should be open source

4 Introduction Web search is a part of our lives. Web search is a part of our lives. Many businesses rely on web. Many businesses rely on web. There is huge economic incentive for commercial website to influence search results. There is huge economic incentive for commercial website to influence search results. Web spamming is cheap and often successful. Web spamming is cheap and often successful. Web spam degrades the quality of search engine. Web spam degrades the quality of search engine. Web spam is annoying. Web spam is annoying.

5 Keywords Web spam Web spam Pagerank Pagerank Spamdexing Spamdexing Spamicity Spamicity Graph-based algorithm Graph-based algorithm

6 Measurement Tool

7 How it work? Feature ExtractionClassificationSmoothing Propagation Stack Graphical Learning Clustering

8 Feature Extraction Data set is obtained by using web crawler. Data set is obtained by using web crawler. For each page, links and its contents are obtained. For each page, links and its contents are obtained. From data set, a full graph is built. From data set, a full graph is built. For each host and page, certain features are computed. For each host and page, certain features are computed. Link-based features are extracted from hostgraph. Link-based features are extracted from hostgraph. Content-based feature are extracted from individual pages. Content-based feature are extracted from individual pages.

9 Linked-based Feature Some important linked-based features are: Degree-related measures Degree-related measures PageRank PageRank TrustRank TrustRank Truncated PageRank Truncated PageRank Estimation of supporters Estimation of supporters

10 Content-based Feature Some important content-based features are: Fraction of visible text Fraction of visible text Compressing rate Compressing rate Corpus precision and corpus recall Corpus precision and corpus recall Query precision and query recall Query precision and query recall Independent trigram likelihood Independent trigram likelihood Entropy of diagram Entropy of diagram

11 Classification Create base classifier from link-based content- based features. Create base classifier from link-based content- based features. Apply cost-sensitive decision tree to classify spam and non-spam hosts. Apply cost-sensitive decision tree to classify spam and non-spam hosts.

12 Smoothing Hosts are now labeled as spam and non-spam by classifier. Hosts are now labeled as spam and non-spam by classifier. It’s an improvement on base classifier. It’s an improvement on base classifier. Few smoothing techniques are: Few smoothing techniques are: Clustering Clustering Propagation Propagation Stacked graphical learning. Stacked graphical learning.

13 Smoothing (Cont.) Based on topological dependencies of spam node: Links are not placed at random. Links are not placed at random. Similar pages tends to link more frequently than dissimilar pages. Similar pages tends to link more frequently than dissimilar pages.Or Spam tends to be clustered on the Web. Non-spam nodes tend to be linked by very few spam nodes, and usually link to no spam nodes. Spam nodes are mainly linked by spam nodes.

14

15 Smoothing - Clustering Split graph into many clusters. Split graph into many clusters. Use METIS graph clustering algorithm. Use METIS graph clustering algorithm. If majority of nodes in cluster are spam, then all hosts in cluster are spam. If majority of nodes in cluster are spam, then all hosts in cluster are spam.

16 Smoothing - Propagation Propagate predictions using random walks. Start from node labeled as spam by base classifier then go forward or backward.

17 Smoothing – Stack Graphical Learning It’s machine learning process. It’s machine learning process. It creates extra features in addition to content- based and linked-based ones. It creates extra features in addition to content- based and linked-based ones.

18 Conclusion Based on assumption that there is a tendency of spammers to be linked together. Based on assumption that there is a tendency of spammers to be linked together. Using both link-based and content-based feature enhance the detection quality. Using both link-based and content-based feature enhance the detection quality. It can be used on web datasets of any size. It can be used on web datasets of any size. Paper does not explain very well each step. Paper does not explain very well each step.

19 Useful Reading “ “Using Spam Farm to Boost PageRank” by Ye Du, Yaoyun Shi, Xin Zhao “Using Annotations in Enterprise Search” by “Using Annotations in Enterprise Search” by Pavel A. Dmitriev, Nadav Eiron, Marcus Fontoura, Eugene Shekita

20 Question ?


Download ppt "Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri."

Similar presentations


Ads by Google