Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.

Similar presentations


Presentation on theme: "Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000."— Presentation transcript:

1 Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98)
Ruey-Lung, Hsiao Nov 23, 2000

2 Topic Distillation on the WWW
Definition Given a typical user query to find quality documents related to the query topic. Characteristics More general than finding a precise query match Not as ambitious as trying to exactly satisfy user information need In cases where query is ambiguous, it should return relevant documents for (some of) the main query topics.

3 Related Research HITS Related Page [3] Topic Distillation [2]
[1] Web Community [4] Reputation [5] Authoritative sources in a hyperlinked environment ‘97 Improved Algorithms for Topic Distillation in a Hyperlinked Environment ’98 Finding Related Pages in the World Wide Web ’99 Inferring Web Communities from link topology ’98 What is this page known for ? Computing Web Page Reputations. ‘00

4 HITS (Hyperlink Induced Topic Search)
Algorithm Start with a root set S Ss is relatively small (typically up to 200 pages) Ss is rich in relevant pages Ss contains most (or many) of the strongest authorities. Recursively compute the degree of authority and hub for each element. set T a(p) =  h(q) h(p) =  a(q) qp pq set S

5 HITS (Hyperlink Induced Topic Search)
Premises The implicit annotation provided by human creator contains sufficient information to infer authority. The sufficiently broad topics contain embedded communities of hyperlinked pages. Problems Mutually Reinforcing Relationships certain arrangements of documents “conspire” to dominate the computation. Automatically Generated Links no human opinion is expressed by the link. Non-relevant Documents the graph contains documents not relevant to the query topic

6 Improved Algorithm Improved Connectivity Analysis
Mutually reinforcing relationships should have the same infulence on a single document. Pruning Nodes from Neighborhood Graph Relevant threshold : Median Weight Start Set Median Weight Fixed Fraction of Maximum Weight a(p) =  h(q) x auth_wt(q,p) h(p) =  a(q) x hub_wt(p,q) qp pq Similarity(Q,Dj) = Wiq x Wij i=1 t wiq 2 wij

7 Partial Content Analysis
Selectively analyze and prune if needed, the nodes that are most influential in the outcome. Query Q formation (use 30 documents) Heuristic : in_degree+2*num_query_matches+has_out_links Pruning Degree Based Pruning Use 4*in_degree+out_degree as a measure of influence Fetch the top 100 nodes, scored against Q and pruned if needed. Iterative Pruning Use connectivity analysis itself to select nodes to prune.(imp) Pruning happens over a sequence of rounds, each runs imp for 10 iterations to get ranked list.

8 Evaluation All Rare Popular At 5 At 10 26% 36%
max base imp med start Without Regulation With Regulation Partial pca0 pca1 0.52 0.46 0.24 0.18 0.36 0.40 0.66 0.58 0.55 0.54 0.73 0.65 0.64 0.50 0.60 0.57 0.48 0.68 0.70 0.69 0.62 0.43 0.67 0.44 0.72 0.75 0.88 0.80 26% 36% Average Precision at Top 5 and 10 ranked authority documents All Rare Popular At 5 At 10 max base imp med start Without Regulation With Regulation Partial pca0 pca1 0.60 0.56 0.44 0.46 0.48 0.42 0.74 0.73 0.64 0.80 0.68 0.87 0.79 0.88 0.76 0. 80 0.78 0.70 0.72 0.75 0.81 0.77 0.69 0.66 0.53 1.00 0.71 0.63 0.54 23% 33% Average Precision at Top 5 and 10 ranked hub documents

9 Finding Related Pages in the WWW
Appears in 8th www conference Definition A related web page is one that addresses the same topic as the original page. For example, is a page related to Algorithms Companion algorithm : derived from HITS. Cocitation algorithm : finds pages that are frequently cocited with the input URL u. Evaluation Two proposed algorithms are 73% better, 51% better than Netscape’s “What’s Related”.

10 Companion Algorithm Takes as input a URL u and consists of four steps:
Build a vicinity graph for u. Contract duplicates and near-duplicates in this graph Compute edge weights based on host to host connections Compute hub/authority score. u

11 Cocitation Algorithm Degree of co-citation
The number of common parents of two nodes. Sibling Set u


Download ppt "Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000."

Similar presentations


Ads by Google