KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 2008.10.30. 이 승 민.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 2008.10.30. 이 승 민

“Know your neighbors: Web spam…” -2/20- CS710 TS IS Lab Contents Contents Introduction 11 Data Set 22 Features 33 Classification 44 Smoothing 55 Conclusion 66

“Know your neighbors: Web spam…” -3/20- CS710 TS IS Lab 1. Introduction What is web spam ? It includes malicious attempts to influence the outcome of ranking algorithms Web spam is not a new problem, and is not likely to be solved in the near future How to ? Traditional Machine Learning assumes that data instances are independent In Web, there are dependencies among pages and hosts

“Know your neighbors: Web spam…” -4/20- CS710 TS IS Lab 1. Introduction Host graph White node: non-spam node Black node: spam node An edge: more than 100 links White node: non-spam node Black node: spam node An edge: more than 100 links

“Know your neighbors: Web spam…” -5/20- CS710 TS IS Lab 1. Introduction Previous work Link spam Creating of a link structure to aim at affecting the outcome of a link-based ranking algorithm Content spam Maliciously crafting the content of Web pages (e.g.,insert keyword) Similar methods used in e-mail spam filtering Cloaking Sending different content to a search engine than to the regular visitors of a web site

“Know your neighbors: Web spam…” -6/20- CS710 TS IS Lab Smoothing Classification Feature Extraction Data Set 1. Introduction Overall scheme Link + Content based features 3 Smoothing techniques WEBSPAM-UK2006 77.9million pages 3 billion links 11,400 hosts Host level labeling 236 features Link feature : 140 features Content feature : 96 features Decision tree(C4.5) Result : 49 features : 0.723 F-msr. Using link structure Graph clustering Propagation using random walks Stacked graphical learning : 40 features : 0.763 F-msr.

“Know your neighbors: Web spam…” -7/20- CS710 TS IS Lab 2. Data Set Data Set WEBSPAM-UK2006 dataset, a publicly available 77.9 million pages and over 3 billion links in about 11,400 hosts Host nameJudgmentSpamicityLabel for the host

“Know your neighbors: Web spam…” -8/20- CS710 TS IS Lab 2. Data Set Measures Precision, Recall, F-measure P = d / (b+d) R = d / (c+d) F = 2PR / (P+R) True positive rate, False positive rate, ROC curve TP = d / (c+d) FP = b / (a+b) ROC curve Validation tenfold cross validation In Out Non spam (-) Spam (+) Non spam (-) ab Spam (+) cd

“Know your neighbors: Web spam…” -9/20- CS710 TS IS Lab 3. Features Link-based features (140 features) Using most of 163 features by Becchetti et al.[4] Degree-related measures (16/17) Measures related to the in-degree and out-degree 16 degree-related features PageRank (11/28) Link-based ranking algorithm that computes a score for each page 11 PageRank-based features TrustRank ( /35) Algorithm that estimates a TrustRank score for each page Using the algorithm, also estimate the spam mass of a page

“Know your neighbors: Web spam…” -10/20- CS710 TS IS Lab 3. Features Truncated PageRank ( /60) A variant of PageRank that diminishes the influence of a page to the PageRank score of its neighbors Estimation of supporters x is d-supporter of y if the shortest path from x to y N d (x) is the set of the d-supporters of page x –Increasing function with respect to d b d (x) = –Bottleneck number of page x 2.2 1.3~1.7

“Know your neighbors: Web spam…” -11/20- CS710 TS IS Lab 3. Features Content-based features (24 features) Using most of the features by Ntoulas et al.[22] Number of words in the page Number of words in the title Average word length Fraction of anchor text A has a link with the anchor text “computer” pointing to page B, then we may conclude that page B talks about “computer” Compression rate Some search engine give higher weight to pages containing the query keywords several times spam

“Know your neighbors: Web spam…” -12/20- CS710 TS IS Lab 3. Features Corpus precision & recall k most frequent words in the dataset, for k=100,200,500,1000 Query precision & recall q most popular terms in a query log, for q=100,200,500,1000 k=100 Page=200 10 P = 10/100 = 0.1 R = 10/200 = 0.05 precision Fraction of pages Independent trigram likelihood : probability distribution of trigrams in a page : set of all trigrams in a page : number of distinct trigrams Entropy of trigrams

“Know your neighbors: Web spam…” -13/20- CS710 TS IS Lab 3. Features From page features to host features Content based feature vector c(h) of host h : the home page of host h : the page with the largest PageRank among all pages in P : the 24 content feature vector of page p : the average of all vectors : the variance of In total, 140 + 96 = 236 link and content features 96 (=4  24) content features

“Know your neighbors: Web spam…” -14/20- CS710 TS IS Lab 4. Classification C4.5 (decision tree) Resulting tree used 45 unique features (Table 1) 18 of them are content features

“Know your neighbors: Web spam…” -15/20- CS710 TS IS Lab 5. Smoothing Smoothing Using differently the link structure of the graph Topological dependency Non-spam nodes usually link to no spam nodes Spam nodes are mainly linked by spam nodes (a) Fraction of spam nodes in out-links (b) Fraction of spam nodes in in-links

“Know your neighbors: Web spam…” -16/20- CS710 TS IS Lab 5. Smoothing Clustering Using the METIS graph clustering algorithm [18] Partitioning the 11,400 hosts of the graph into 1,000 clusters

“Know your neighbors: Web spam…” -17/20- CS710 TS IS Lab 5. Smoothing Propagation Using propagation by random walks [32] A link with probability , returning to a spam node with probability 1- 

“Know your neighbors: Web spam…” -18/20- CS710 TS IS Lab 5. Smoothing Stacked graphical learning Meta learning scheme proposed recently by Kou [8] Using a base learning scheme + generating a set of extra features An extra feature Average predicted spamicity of r(h) –p(h) : prediction for h –r(h) : set of pages related to h Tree uses 40 features, of which 20 are content features 5.5%

“Know your neighbors: Web spam…” -19/20- CS710 TS IS Lab 6. Conclusion Contributions First paper that integrates link and content features Diverse smoothing algorithm, specially stacked graph learning Discussion Low detection rate compared to intrusion detection Publicly available dataset Feature selection using statistical approach Research for each Web Spam category

“Know your neighbors: Web spam…” -20/20- CS710 TS IS Lab Thank you ! Question ?

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 2008.10.30. 이 승 민.

Similar presentations

Presentation on theme: "KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 2008.10.30. 이 승 민."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 2008.10.30. 이 승 민.

Similar presentations

Presentation on theme: "KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 2008.10.30. 이 승 민."— Presentation transcript:

Similar presentations

About project

Feedback