Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Evaluating Search Engine
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Link Structure and Web Mining Shuying Wang
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
(hyperlink-induced topic search)
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Overview of Web Data Mining and Applications Part I
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
Using Hyperlink structure information for web search.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Searching  Google: page rank and anchor text  Hits: hubs and authorities  MSN’s Ranknet: learning to rank  Today’s web dragons.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Algorithmic Detection of Semantic Similarity WWW 2005.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
CSE326: Data Structures World Wide What? Hannah Tang and Brian Tjaden Summer Quarter 2002.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB SPAM.
Quality of a search engine
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Search Engines and Link Analysis on the Web
7CCSMWAL Algorithmic Issues in the WWW
A Comparative Study of Link Analysis Algorithms
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Inf 723 Information & Computing
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Junghoo “John” Cho UCLA
Web Information retrieval (Web IR)
COMP5331 Web databases Prepared by Raymond Wong
Discussion Class 9 Google.
Presentation transcript:

Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied Computing 2006

Motivation Link-based ranking algorithms are important to current popular search engines. (e.g., HITS for Teoma) Link farms will deteriorate the performance of link-based ranking algorithms

HITS algorithm Each page has two measures, authority score a shows how good this page is for a query, hub score h shows the possibility that the page points to good authority pages. E is the adjacency matrix. a = E T h h = E a

Example: for query “weather” calculator.html

Factors that degrade HITS Mutually reinforcing relationships Duplicate pages Link farms

Complete hyperlink Definition:  The link with its anchor text as a unit. Duplication of a complete link is a much stronger sign of copying behavior on the Web than a duplicate link target.

Document - Complete link Matrix

Bipartite Graph Two disjoint sets X and Y, each edge starts from an element in X and ends with an element in Y.

Link farms Link farms are usually densely connected via multiple overlapping small bipartite cores. Task: to detect densely connected bipartite components from “document - complete link” matrix

Algorithm for finding bipartite components

Result: k=2 and l=2

Adjustment: document-document matrix

Final matrix

Weighted adjacency matrix

Experiment: HITS result of “rental car”

Experiment: B&H HITS result of “rental car” about_travelguides/addlisting.html

Experiment: CL-HITS result of “rental car”

Experiment: B&H HITS result of “translation online”

Experiment: CL-HITS result of “translation online” /worldlingo_translator.html

Duplicate example: BH-HITS result of “maps”

Duplicate example: CL-HITS result of “maps”

User evaluation CategoryHITSBHITSCL-HITSCL-POP Quite relevant12.9%24.5%48.4%46.3% Relevant10.7%18.3%28.8%26.2% Not sure6.6%10.5%6.7%6.4% Irrelevant26.8%14.8%11.3%12.7% Totally irrelevant42.8%31.9%4.6%8.1%

Discussion Using link alone, the precision at 10 is 66.4%. Much lower than using “complete link”. Random anchor texts.

Questions?