Download presentation
Presentation is loading. Please wait.
Published byDaniel Stafford Modified over 9 years ago
1
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research Asia SIGIR 2005
2
INTRODUCTION The top search results can hardly cover a sufficient variety of topics (redundant) re-ranking method based on MMR There is no indication about how informative a returned document is on the query topic (coverage) subtopic retrieval method two novel metrics, diversity and information richness
3
BACKGROUND The most famous works on link analysis PageRank and HITS algorithm Explicit link analysis and implicit link analysis two web pages are implicitly linked if they are visited sequentially by the same end-user. DirectHit and Small Web Search
4
AFFINITY RANKING
5
Diversity: Given a set of documents R, we use diversity Div(R) to denote the number of different topics contained in R. Information Richness: Given a document collection D={d1…dn}, we use information richness InfoRich (di) to denote richness of information contained in the document di with respect to the entire collection D.
6
Affinity Graph Construction According to vector space model, similarity between a documents pair of di and dj can be calculated as For further measurement on the significance of the similarity between each document pair, we define the affinity of dj to di as
7
Information Richness Computation After obtaining Affinity Graph, we apply a link analysis algorithm similar to PageRank M is normalized to make the sum of each row equal to 1.
8
Information Richness Computation the score of document di can be deduced from those of all other document linked to it With dumping factor c (similar to the random jumping factor in PageRank):
9
Information Richness Computation information can choose where to flow according to the following two rules: With a probability c, the information will flow into document nodes which di links With a probability of c 1 the information will randomly flow into any document in the collection.
10
Diversity Penalty
11
Re-ranking Method The re-ranking mechanism is a combination of results from fulltext search and Affinity Ranking score-combination
12
Re-ranking Method rank-combination
13
EXPERIMENTS Yahoo! Directory contained a total of 292,216 categories (including leaf categories and non-leaf categories) All categories are organized into a 16-level hierarchy. we have downloaded 792,601 documents in total. ODP (Open Directory Project) We downloaded the directory in August, 2004. ODP includes a total of 172,565 categories we have downloaded 1,547,000 documents in total.
14
EXPERIMENTS Newsgroup dataset The Newsgroup data is composed of 256,449 posts collected from 117 commercial application with a total size of about 400M Title and content of the post are given a 3:1 weighting ratio in indexing process There is no explicit link existing among the posts large amount of posts are very likely to be devoted to the same topic
15
Affinity Ranking vs. K- Means Clustering
16
The top 1000 search results of each query are passed to AR or Kmeans algorithm to re-rank top 10 results For K-Means algorithm, we set K=10 and use the top 1 document of each cluster to construct the top 10 results
17
Affinity Ranking vs. K- Means Clustering
18
Affinity Ranking in Newsgroup dataset Query We compare our approach with the Okapi system in three aspects: diversity, information richness and relevance
19
Affinity Ranking in Newsgroup dataset Four researchers are hired to labele the top 50 search results for each of the 20 queries based on the following steps:
20
Affinity Ranking in Newsgroup dataset N is the number of users X could be diversity, information richness, or relevance of the top search results A and F represent results from our ranking scheme and full-text search
21
Improvement in Top 10 Search Results As the top 10 search results always receive the most attention of end-users In this experiment, we use the rank- combination scheme and which = 0 and =1
22
Improvement within Top 50 Search Results
24
A Case Study This example is extracted from our experiments on the Newsgroup search for the query “Outlook print error”
25
CONCLUSIONS Proposed two new metrics, diversity and information richness A novel ranking scheme, Affinity Ranking, is proposed to re-rank the search results Our experiments showed that the proposed metrics and new ranking method can effectively improve the search performance Future work includes scaling our Affinity Ranking computation, for example, to the Web scale
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.