Adversarial Information Retrieval The Manipulation of Web Content.

Slides:



Advertisements
Similar presentations
Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu, Vinay Goel and Brian D. Davison Lehigh University, USA.
Advertisements

Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
TrustRank Algorithm Srđan Luković 2010/3482
Analysis of Large Graphs: TrustRank and WebSpam
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Information Retrieval in Practice
Detecting Web Spam with CombinedRank Abhita Chugh Ravi Tiruvury.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
The PageRank Citation Ranking “Bringing Order to the Web”
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis, PageRank and Search Engines on the Web
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Information Retrieval
Overview of Search Engines
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Using Hyperlink structure information for web search.
Information Retrieval in Folksonomies Nikos Sarkas Social Information Systems Seminar DCS, University of Toronto, Winter 2007.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Lecture 2: Extensions of PageRank
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching Tutorial By: Lola L. Introduction:  When you are using a topic, you might want to use “keyword topics.” Using this might help you find better.
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CS 440 Database Management Systems Web Data Management 1.
TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
WEB SPAM.
HITS Hypertext-Induced Topic Selection
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
A Comparative Study of Link Analysis Algorithms
CS 440 Database Management Systems
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Link Analysis II Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

Adversarial Information Retrieval The Manipulation of Web Content

Introduction Examples TrustRank and Other Methods

What is Adversarial IR? Gathering, Indexing, Retrieving and Ranking Information Subset of the information has been manipulated maliciously Financial Gain

What is the Goal of AIR? Detect the bad sites or communities Improve precision on search engines by eliminating the bad guys

Simplest form First generation engines relied heavily on tf/idf – The top-ranked pages for the query maui resort were the ones containing the most maui’ s and resort’ s SEOs responded with dense repetitions of chosen terms – e.g., maui resort maui resort maui resort – Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Pure word density cannot be trusted as an IR signal

Search Engine Spamming Link-spam Link-bombing Spam Blogs Comment Spam Keyword Spam Malicious Tagging

Spamming Online tutorials for “search engine persuasion techniques” – “How to boost your PageRank” Artificial links and Web communities Latest trend: “Google bombing” – a community of people create (genuine) links with a specific anchor text towards a specific page. Usually to make a political point

Google Bombing

Our Focus Link Manipulation

Trust Rank Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a human classify them – Propagate goodness of pages 10

Propagation Trust function T – T(p) returns the propability that p is a good page Initial values – T(p) = 1, if p was found to be a good page – T(p) = 0, if p was found to be a spam page Iterations: – propagate Trust following out-links – only a fixed number of iteration M. 11

Propagation (2) Problem with propagation –Pages reachable from good seeds might not be good –the further away we are from good seed pages, the less certain we are that a page is good. 12 – solution: reduce trust as we move further away from the good seed pages (trust attenuation).

Trust attenuation – dampening –Propagate a dampened trust score ß < 1 at first step –At n-th step propagate a trust of ß^n 13

Trust attenuation – splitting –Parent trust value is splittet among child nodes –Observation: the more the links the less the care in choosing them –Mix damp and split? ß^n(splitted trust) 14

Selection – Inverse PageRank The seed set S should: – be as small as possible – cover a large part of the Web Covering is related to out-links in the very same way PageRank is related to in-link – Inverse PageRank ! Perform PageRank on a graph with inverted links –G' = (V, E') where ( p, q )  E'  ( q, p )  E. 15

Algorithm 1.Select seeds ( s ) and order by preference 2.Invoke oracle (human) on the first L seeds, 3.Initialize and normalize oracle response d 4.Compute TrustRank score (as in PageRank formula): t * = ß · T · t *+( 1 − ß ) · d T is the adjacency matrix of the Web Graph. ß is the dampening factor. (usually.85) 16

Algorithm - example –s = [ 0. 08, 0. 13, 0. 08, 0. 10, 0. 09, 0. 06, 0. 02] –Ordering = [ 2, 4, 5, 1, 3, 6, 7 ] –L=3 {2, 4, 5} d =[0, 0.5, 0, 0.5, 0, 0, 0] –ß =0.85 M=20 –t * = [ 0, 0. 18, 0. 12, 0. 15, 0. 13, 0. 05, ] –NB. max=0.18 –Issues with page 1 and 5 17

Issues with TrustRank Coverage of the seed set may not be broad enough  Many different topics exist, each with good pages TrustRank has a bias towards communities that are heavily represented in the seed set  inadvertently helps spammers that fool these communities

Bias towards larger partitions Divide the seed set into n partitions, each has m i nodes t i : TrustRank score calculated by using partition i as the seed set t : TrustRank score calculated by using all the partitions as one combined seed set

Basic ideas Use pages labeled with topics as seed pages  Pages listed in highly regarded topic directories Trust should be propagated by topics  link between two pages is usually created in a topic specific context

Topical TrustRank  Partition the seed set into topically coherent groups  TrustRank is calculated for each topic  Final ranking is generated by a combination of these topic specific trust scores Note  TrustRank is essentially biased PageRank  Topical TrustRank is fundamentally the same as Topic- Sensitive PageRank, but for demoting spam

Combination of trust scores Simple summation  default mechanism just seen Quality bias  Each topic weighted by a bias factor  Summation of these weighted topic scores  One possible bias: Average PageRank value of the seed pages of the topic

Further Improvements Seed Weighting  Instead of assigning an equal weight to each seed page, assign a weight proportional to its quality / importance Seed Filtering  Filtering out low quality pages that may exist in topic directories Finer topics  Lower layers of the topic directory