Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.

Slides:



Advertisements
Similar presentations
Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.
Advertisements

Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two.
CSM06 Information Retrieval Lecture 5: Web IR part 2 Dr Andrew Salway
Link Structure and Web Mining Shuying Wang
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
(hyperlink-induced topic search)
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Computer Science 1 Web as a graph Anna Karpovsky.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Using Hyperlink structure information for web search.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Hyperlink Analysis for the Web. Information Retrieval Input: Document collection Goal: Retrieve documents or text with information content that is relevant.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Link-based and Content-based Evidential Information in a Belief Network Model I. Silva, B. Ribeiro-Neto, P. Calado, E. Moura, N. Ziviani Best Student Paper.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
CS155b: E-Commerce Lecture 16: April 10, 2001 WWW Searching and Google.
Post-Ranking query suggestion by diversifying search Chao Wang.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
IR Theory: Web Information Retrieval
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Link-Based Ranking Seminar Social Media Mining University UC3M
Chapter 7 Web Structure Mining
Greg Nilsen University of Pittsburgh April 2003
A Comparative Study of Link Analysis Algorithms
Applying Key Phrase Extraction to aid Invalidity Search
Information Retrieval
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Inf 723 Information & Computing
The Recommendation Click Graph: Properties and Applications
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Graph and Link Mining.
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
IR Theory: Web Information Retrieval
Presentation transcript:

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000

Topic Distillation on the WWW Definition Given a typical user query to find quality documents related to the query topic. Characteristics More general than finding a precise query match Not as ambitious as trying to exactly satisfy user information need In cases where query is ambiguous, it should return relevant documents for (some of) the main query topics.

Related Research HITS Related Page [3] Topic Distillation [2] [1] Web Community [4] Reputation [5] Authoritative sources in a hyperlinked environment ‘97 Improved Algorithms for Topic Distillation in a Hyperlinked Environment ’98 Finding Related Pages in the World Wide Web ’99 Inferring Web Communities from link topology ’98 What is this page known for ? Computing Web Page Reputations. ‘00

HITS (Hyperlink Induced Topic Search) Algorithm Start with a root set S Ss is relatively small (typically up to 200 pages) Ss is rich in relevant pages Ss contains most (or many) of the strongest authorities. Recursively compute the degree of authority and hub for each element. set T a(p) =  h(q) h(p) =  a(q) qp pq set S

HITS (Hyperlink Induced Topic Search) Premises The implicit annotation provided by human creator contains sufficient information to infer authority. The sufficiently broad topics contain embedded communities of hyperlinked pages. Problems Mutually Reinforcing Relationships certain arrangements of documents “conspire” to dominate the computation. Automatically Generated Links no human opinion is expressed by the link. Non-relevant Documents the graph contains documents not relevant to the query topic

Improved Algorithm Improved Connectivity Analysis Mutually reinforcing relationships should have the same infulence on a single document. Pruning Nodes from Neighborhood Graph Relevant threshold : Median Weight Start Set Median Weight Fixed Fraction of Maximum Weight a(p) =  h(q) x auth_wt(q,p) h(p) =  a(q) x hub_wt(p,q) qp pq Similarity(Q,Dj) = Wiq x Wij  i=1 t wiq 2 wij 

Partial Content Analysis Selectively analyze and prune if needed, the nodes that are most influential in the outcome. Query Q formation (use 30 documents) Heuristic : in_degree+2*num_query_matches+has_out_links Pruning Degree Based Pruning Use 4*in_degree+out_degree as a measure of influence Fetch the top 100 nodes, scored against Q and pruned if needed. Iterative Pruning Use connectivity analysis itself to select nodes to prune.(imp) Pruning happens over a sequence of rounds, each runs imp for 10 iterations to get ranked list.

Evaluation All Rare Popular At 5 At 10 26% 36% max base imp med start Without Regulation With Regulation Partial pca0 pca1 0.52 0.46 0.24 0.18 0.36 0.40 0.66 0.58 0.55 0.54 0.73 0.65 0.64 0.50 0.60 0.57 0.48 0.68 0.70 0.69 0.62 0.43 0.67 0.44 0.72 0.75 0.88 0.80 26% 36% Average Precision at Top 5 and 10 ranked authority documents All Rare Popular At 5 At 10 max base imp med start Without Regulation With Regulation Partial pca0 pca1 0.60 0.56 0.44 0.46 0.48 0.42 0.74 0.73 0.64 0.80 0.68 0.87 0.79 0.88 0.76 0. 80 0.78 0.70 0.72 0.75 0.81 0.77 0.69 0.66 0.53 1.00 0.71 0.63 0.54 23% 33% Average Precision at Top 5 and 10 ranked hub documents

Finding Related Pages in the WWW Appears in 8th www conference Definition A related web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com. Algorithms Companion algorithm : derived from HITS. Cocitation algorithm : finds pages that are frequently cocited with the input URL u. Evaluation Two proposed algorithms are 73% better, 51% better than Netscape’s “What’s Related”.

Companion Algorithm Takes as input a URL u and consists of four steps: Build a vicinity graph for u. Contract duplicates and near-duplicates in this graph Compute edge weights based on host to host connections Compute hub/authority score. u

Cocitation Algorithm Degree of co-citation The number of common parents of two nodes. Sibling Set u