Presentation is loading. Please wait.

Presentation is loading. Please wait.

윤언근 2008. 10. 29 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Similar presentations


Presentation on theme: "윤언근 2008. 10. 29 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and."— Presentation transcript:

1 윤언근 2008. 10. 29 DataMining lab

2  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and malicious sites  To date, most work on web page ranking has focused on dynamic ranking(query- dependent ranking).

3  However static ranking(query-dependent ranking) is also crucially important for search engine and provides benefits:  Relevance  Efficiency ▪ the search engine’s index is ordered by static ranking. ▪ By traversing the index from high-quality to low-quality pages, the dynamic ranker may abort the search.  Crawl Priority ▪ Search engines need a way to prioritize their crawl - to determine which pages to re-crawl, how often to seek out new pages. ▪ the static rank of page is used to determine this prioritization.

4  The Google is first commercially successful search engine  PageRank algorithm  PageRank algorithm regarded as best mathod for static ranking

5  Advantages of a machine learning approach for static ranking besides the quality of the ranking.  Because the measure consist of many features, it is harder for malicious users to manipulate it ▪ by adjusting the ranking model in advance of the spammer’s attempts, the spammer’s action is blocked  This paper’s Contribution is a systematic study of static features, including PageRank.

6  The basic idea  link from a Web page to another  when creating a page, an author presumably chooses to link to pages deemed to be of good quality.  PageRank score for node j : ▪PR(j) = P(j)  Though much work has been done on optimizing the PageRank computation, It remains a relatively slow, expensive property to compute. F is the set of pages that page i links to. B is the set of pages that page j links to.

7  RankNet is a modification to the standard neural network back-prop algorithm.  RankNet cost function is based on difference between a pair of network outputs.  For each pair of feature vector in the training set, RankNet computes the network output oi and oj  The larger is Oj – Oi, the larger the cost. Where { } and ranking of item i > ranking of item j

8  Work in machine learning has been done on the problems of classification and regression.  classification problem ▪ learn a function f that maps yi = f(xi), for all I ▪ when yi is real-valued as well, this is called regression  regression problem ▪ static ranking ▪ If we let xi = features of page i, yi = value(rank) for each page, we could learn regression function that mapped each page’s features to their rank X = {xi} be a collection of feature vector( real number). Y = {yi} be a collection of associated classes, where yi is the class of the object described by feature vector xi

9  Recent work on ranking problem  attempts to optimize the ordering of the object  the goal of the ranking problem ▪ ▪ where z is { } and item i > item j

10  PageRank  this paper optionally used PageRank of a page  Popularity  number of times that it has been visited by user at some period of time  Anchor text and inlink  these features are based on the information associated with link to the page  Page  simple features such as the number of words, the frequency of the most common term.  Domain  these features are computed as averages across all pages in the domain

11  this paper uses fRank(for feature-based ranking) that uses RankNet and the set of features.

12  This paper needed the correct ordering for a set of pages.  This paper employed a dataset for 28000 queries that are randomly selected by the MSN search engine.  The probability(that a query is selected) is proportional to its ferquency.  Each query is assigned a rating, from 0 to 4.

13  Common queries are more likely to be judged than uncommon queries.  The judged pages tend to be of higher quality.  the data is converted from query-dependent to query-independent  the query is removed  maximum over judgments

14  Because this paper evaluated the pages on query that occur frequency,  our data indicates the correct index ordering and  assigned high value to pages

15  This paper chose to use pairwise accuracy to evaluate the quality of a static ranking.  Pairwise accuracy is the fraction of time  If s(x) is the static ranking assinged to page x, H(x) is the human judgment of relevance for x, then consider the following sets:

16  This measure was chosen for two reasons  first, the discrete human judgments provide only a partial ordering over Web pages  two, the pairwise accuracy is the fraction of pairs of documents

17  This paper tranined fRank using the following parameters.  2 layer network  10 hidden nodes  input weights ▪ all initialized to be zero  output weights ▪ uniform random distribution in the range[-0.1,0.1]  This paper used ‘tanh’ as the transfer function from the input to the hidden layer and a linear function from the hidden layer to output.

18  Where k is the initial rate (0.001) and is the number of times  In our experiments, computing the fRank for all 5billion Web pages was approximately 100 times faster than pageRank for the same set ℰ In general, lower training rates will require more training iteration. A higher training rate allows the network to converge more rapidly; however, the chances of a nonoptimal solution are greater

19  Basic Results  Results for individual feature sets

20  Ablation Result

21  fRank performance as feature sets are added

22  Top ten URLs for PageRank vs. fRank Technology sites Consumer- oriented sites


Download ppt "윤언근 2008. 10. 29 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and."

Similar presentations


Ads by Google