KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 2008.10.30. 이 승 민.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
TrustRank Algorithm Srđan Luković 2010/3482
Analysis and Modeling of Social Networks Foudalis Ilias.
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Mining and Searching Massive Graphs (Networks)
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Link Analysis, PageRank and Search Engines on the Web
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Chapter 5: Information Retrieval and Web Search
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Countering Spam Using Classification Techniques Steve Webb Data Mining Guest Lecture February 21, 2008.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.
Algorithmic Detection of Semantic Similarity WWW 2005.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Ranking Link-based Ranking (2° generation) Reading 21.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Post-Ranking query suggestion by diversifying search Chao Wang.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB SPAM.
Source: Procedia Computer Science(2015)70:
CS 440 Database Management Systems
Web Spam
Presentation transcript:

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민

“Know your neighbors: Web spam…” -2/20- CS710 TS IS Lab Contents Contents Introduction 11 Data Set 22 Features 33 Classification 44 Smoothing 55 Conclusion 66

“Know your neighbors: Web spam…” -3/20- CS710 TS IS Lab 1. Introduction What is web spam ? It includes malicious attempts to influence the outcome of ranking algorithms Web spam is not a new problem, and is not likely to be solved in the near future How to ? Traditional Machine Learning assumes that data instances are independent In Web, there are dependencies among pages and hosts

“Know your neighbors: Web spam…” -4/20- CS710 TS IS Lab 1. Introduction Host graph White node: non-spam node Black node: spam node An edge: more than 100 links White node: non-spam node Black node: spam node An edge: more than 100 links

“Know your neighbors: Web spam…” -5/20- CS710 TS IS Lab 1. Introduction Previous work Link spam Creating of a link structure to aim at affecting the outcome of a link-based ranking algorithm Content spam Maliciously crafting the content of Web pages (e.g.,insert keyword) Similar methods used in spam filtering Cloaking Sending different content to a search engine than to the regular visitors of a web site

“Know your neighbors: Web spam…” -6/20- CS710 TS IS Lab Smoothing Classification Feature Extraction Data Set 1. Introduction Overall scheme Link + Content based features 3 Smoothing techniques WEBSPAM-UK million pages 3 billion links 11,400 hosts Host level labeling 236 features Link feature : 140 features Content feature : 96 features Decision tree(C4.5) Result : 49 features : F-msr. Using link structure Graph clustering Propagation using random walks Stacked graphical learning : 40 features : F-msr.

“Know your neighbors: Web spam…” -7/20- CS710 TS IS Lab 2. Data Set Data Set WEBSPAM-UK2006 dataset, a publicly available 77.9 million pages and over 3 billion links in about 11,400 hosts Host nameJudgmentSpamicityLabel for the host

“Know your neighbors: Web spam…” -8/20- CS710 TS IS Lab 2. Data Set Measures Precision, Recall, F-measure P = d / (b+d) R = d / (c+d) F = 2PR / (P+R) True positive rate, False positive rate, ROC curve TP = d / (c+d) FP = b / (a+b) ROC curve Validation tenfold cross validation In Out Non spam (-) Spam (+) Non spam (-) ab Spam (+) cd

“Know your neighbors: Web spam…” -9/20- CS710 TS IS Lab 3. Features Link-based features (140 features) Using most of 163 features by Becchetti et al.[4] Degree-related measures (16/17) Measures related to the in-degree and out-degree 16 degree-related features PageRank (11/28) Link-based ranking algorithm that computes a score for each page 11 PageRank-based features TrustRank ( /35) Algorithm that estimates a TrustRank score for each page Using the algorithm, also estimate the spam mass of a page

“Know your neighbors: Web spam…” -10/20- CS710 TS IS Lab 3. Features Truncated PageRank ( /60) A variant of PageRank that diminishes the influence of a page to the PageRank score of its neighbors Estimation of supporters x is d-supporter of y if the shortest path from x to y N d (x) is the set of the d-supporters of page x –Increasing function with respect to d b d (x) = –Bottleneck number of page x ~1.7

“Know your neighbors: Web spam…” -11/20- CS710 TS IS Lab 3. Features Content-based features (24 features) Using most of the features by Ntoulas et al.[22] Number of words in the page Number of words in the title Average word length Fraction of anchor text A has a link with the anchor text “computer” pointing to page B, then we may conclude that page B talks about “computer” Compression rate Some search engine give higher weight to pages containing the query keywords several times spam

“Know your neighbors: Web spam…” -12/20- CS710 TS IS Lab 3. Features Corpus precision & recall k most frequent words in the dataset, for k=100,200,500,1000 Query precision & recall q most popular terms in a query log, for q=100,200,500,1000 k=100 Page= P = 10/100 = 0.1 R = 10/200 = 0.05 precision Fraction of pages Independent trigram likelihood : probability distribution of trigrams in a page : set of all trigrams in a page : number of distinct trigrams Entropy of trigrams

“Know your neighbors: Web spam…” -13/20- CS710 TS IS Lab 3. Features From page features to host features Content based feature vector c(h) of host h : the home page of host h : the page with the largest PageRank among all pages in P : the 24 content feature vector of page p : the average of all vectors : the variance of In total, = 236 link and content features 96 (=4  24) content features

“Know your neighbors: Web spam…” -14/20- CS710 TS IS Lab 4. Classification C4.5 (decision tree) Resulting tree used 45 unique features (Table 1) 18 of them are content features

“Know your neighbors: Web spam…” -15/20- CS710 TS IS Lab 5. Smoothing Smoothing Using differently the link structure of the graph Topological dependency Non-spam nodes usually link to no spam nodes Spam nodes are mainly linked by spam nodes (a) Fraction of spam nodes in out-links (b) Fraction of spam nodes in in-links

“Know your neighbors: Web spam…” -16/20- CS710 TS IS Lab 5. Smoothing Clustering Using the METIS graph clustering algorithm [18] Partitioning the 11,400 hosts of the graph into 1,000 clusters

“Know your neighbors: Web spam…” -17/20- CS710 TS IS Lab 5. Smoothing Propagation Using propagation by random walks [32] A link with probability , returning to a spam node with probability 1- 

“Know your neighbors: Web spam…” -18/20- CS710 TS IS Lab 5. Smoothing Stacked graphical learning Meta learning scheme proposed recently by Kou [8] Using a base learning scheme + generating a set of extra features An extra feature Average predicted spamicity of r(h) –p(h) : prediction for h –r(h) : set of pages related to h Tree uses 40 features, of which 20 are content features 5.5%

“Know your neighbors: Web spam…” -19/20- CS710 TS IS Lab 6. Conclusion Contributions First paper that integrates link and content features Diverse smoothing algorithm, specially stacked graph learning Discussion Low detection rate compared to intrusion detection Publicly available dataset Feature selection using statistical approach Research for each Web Spam category

“Know your neighbors: Web spam…” -20/20- CS710 TS IS Lab Thank you ! Question ?