Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
The PageRank Citation Ranking “Bringing Order to the Web”
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Chapter 19: Information Retrieval
1 CS/INFO 430 Information Retrieval Lecture 18 Web Search 4.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
Designing for Search Engines MIS 314 MIS 314 Mr. David Auer.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Search Optimization Techniques Dan Belhassen greatBIGnews.com Modern Earth Inc.
Designing for Search Engines MIS 314 MIS 314 Professor Sandvig Professor Sandvig.
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.
Using Hyperlink structure information for web search.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Search Engine Marketing Gay, Charlesworth & Esen Chapter 6.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Search Engine Optimization 101 What is SEM? SEO? How can I use SEO on my blogs and/or my personal web space?
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
Search Engine Optimization Miami (SEO Services Miami in affordable budget)
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Search Engine Optimization
PRESENTATION ON SEO.
WEB SPAM.
Information Retrieval
PRESENTATION ON SEO.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Information Retrieval
Web Spam
Chapter 5: Information Retrieval and Web Search
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Chapter 31: Information Retrieval
Chapter 19: Information Retrieval
Presentation transcript:

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25

Objective of a search engine Providing results for a query: 2/25 relevance similarity between documents and query importance query-independent popularity (e.g. link structure) RANKING

Web spam misleading search engine into ranking 3/25 altering the relevance or importance of a page without improving its true value Ranking: estimates the VALUE of a page SPAM

Boosting techniques and objectives Term spamming: altering relevance 4/25 Link spamming: altering importance

Term spamming (relevance) Based on TFIDF metric: 5/25 term frequency, the number of occurrences of term t in the document TF(t): IDF(t): inverse document frequency, related to the number of document in the collection that contain t TFIDF(p,q) = S TF(t)·IDF(t) t  p and t  q

Altering TFIDF 6/25 making a page very relevant for a specific query repeating some targeted words making a page relevant for a large number of query adding a large number of distinct terms

Altering TFIDF 7/25 ● Spammers don't have real control over terms IDF increasing TFIDF = increasing TF ● Some search engines ignore IDF scores altogether

Term spamming: where ? 8/25 ● Body spam spam terms included in document body ● Title spam higher weight than body ● Meta tag spam heavy spamming, low priority

Term spamming: where ? 9/25 ● Anchor text spam ● URL spam buy-canon-rebel-20d-lens-case.camerasx.com buy-nikon-d100-d70-lens-case.camerasx.com url splitted to get page relevance free, great deals, cheap, inexpensive, cheap, free terms are added to the target page

Term spamming: how ? 10/25 ● Repetition ● Dumping including large sets of unrelated (and rare) words terms are repeated to get increased relevance easy to filter

Term spamming: how ? 11/25 ● Weaving adding spam terms at random position within text ● Phrase stitching merging phrases from different contexts or sources harder to filter!

Link spamming (importance) 12/25 Based on web graph structure the value of a page is (also) relative to incoming links Basic idea: people link pages they consider important on their sites

Ranking algorithms: HITS 13/25 Assigns two values to each page: ● Hub score: links to important authority pages ● Authority score: linked by important hubs circular definition

Ranking algorithms: HITS 14/25 Adding many links to important sites ( to the target page t. increasing hub score: increasing authority score: Increasing the value of n hub pages, also adding links to the target page t.

Ranking algorithms: PageRank 15/25 Ranking factors for a group of pages: PR() = PR static () + PR in () - PR out () - PR sink () PR static (): random jump PR in (): incoming links PR out (): outgoing links PR sink (): pages without outgoing links

Ranking algorithms: PageRank 16/25 Web from a spammer's eyes: Popularity based on number of incoming links

Link spamming: techniques 17/25 Outgoing links: ● Adding links to well known pages ● Directory cloning

Link spamming: techniques 18/25 Incoming links: ● Creating a “honey pot” ● Infiltrating a web directory ● Posting links on blogs, guestbook, wikis, etc. ● Link exchanging ● Buying expired domains ● Creating own spam farm

Hiding techniques: content hiding 19/25 Examples: hidden text... (tinyimg.gif is a 1x1 pixel/transparent image) document.getElementById("inv").style.display = none;

Hiding techniques: cloaking 20/25 Return a page to regular web browsers, another one to web crawlers ● Mantaining a list of IP used by crawlers ● Filtering user agent

Hiding techniques: redirection 21/25 Redirecting URL as soon as the page is loaded <meta http-equiv="refresh" content="0;url=target.html"> location.replace("target.html") easy to parse by SE crawlers don't execute scripts

Spam prevalence: statistics 22/25 Data set (DS1) ● large set of URLs; ● crawling guided by hash functions; ● manual spam evaluation. Data set (DS2) ● Single breadth-first search started at the Yahoo! home page; ● manual spam evaluation.

Spam prevalence: statistics 23/25 Data set (DS3) – Source: AltaVista. Pages grouped into about 31 million websites Sites ordered using descending PageRank, then splitted this list in 20 buckets

Spam prevalence: statistics 24/25 ● Crawls performed at different times; ● Different crawling strategies; ● Maybe the average number of pages per site is different for spam and non-spam sites; ● Classification of spam could be subjective.

Conclusions 25/25 Web content (estimated): 10-15% SPAM Countermeasures: ● Identification, followed by removal of spam pages from indexes; ● Prevention, application of various techniques during crawling; ● Counterbalancing, variation of ranking methods.