Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accelerating Ranking-System Using WebGraph

Similar presentations


Presentation on theme: "Accelerating Ranking-System Using WebGraph"— Presentation transcript:

1 Accelerating Ranking-System Using WebGraph
Project Report by Padmaja Adipudi

2 Outline of My Talk Needle Search Engine/Ranking-System
Ranking-System Issue/Resolution Accelerating Ranking-System using WebGraph Ranking Algorithms Overview Google’s PageRank, ClusterRank, SourceRank & Truncated PageRank Experimental Results Efficiency Measure Quality Measure Conclusion Which algorithm is better in terms of Efficiency & Quality

3 Search Engine Web is a terrific place to get the information on any topic. Search Engine is a useful application for the information retrieval on the WWW. Search Engine has five basic components, a Crawler, a Parser, a Ranking-System, a Repository and a Front-End.

4 Ranking-System Determines the importance of a Web page.
Google's PageRank algorithm is the famous Ranking-System and is based on URL link structure. In Google’s PageRank, the importance of a Web page is based on the importance of it’s parent Web pages.

5 Needle Search Engine A Search Engine developed by former students at UCCS. ClusterRank algorithm is implemented as the Ranking-System. The former student Yi-Zhang developed a Cluster ranking system which takes an average of 3 hours to rank 300,000 URLs.

6 Ranking-System Issue The major issue with the current ranking system is, it takes long update times, 3 hours for 300K URLs. As the number of pages increases it is going to be a severe problem.

7 Project Goal Accelerate the existing Ranking-System of the Needle Search Engine at UCCS using a package called “WebGraph”. Upgrade the Needle Search Engine system up to 1 Million Web pages from the 50K Web pages (crawled).

8 Steps to reach Goal Use WebGraph package to represent the graph efficiently using compression techniques. Compute the Page-Rank using algorithms namely ClusterRank, SourceRank and Truncated PageRank. Compare the results based on time and quality measure for ClusterRank with the results of SourceRank, Truncated PageRank and choose the best for the Needle Search Engine.

9 Work Flow ClusterRank Page Rank Results SourceRank Compressed Graph
Truncated PageRank

10 Why Truncated & Source Algorithms
These are the latest papers available in the Page Ranking area. Authors used WebGraph package for their experiments while developing the algorithm.

11 Node Graph Node graph is used in ranking system.
Node graph consists of nodes and directed links from node to node. URLs are represented by nodes and the hyperlinks are represented as directed links between nodes. Compression techniques to represent the Node graph in efficient manner.

12 Google’s PageRank Page Lawrence, Brin Sergey, Rajeev Motwani, Terry Winograd from Stanford University, 1999. Importance of a page is based on the incoming link count and also how important are those incoming links. PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) PR(Tn): Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the web all the way up to PR(Tn) for the last page. C(Tn): Each page spreads its vote out evenly amongst all of its outgoing links. The count, or number, of outgoing links for page 1 is C(T1), C(Tn) for page n, and so on for all pages. PR(Tn)/C(Tn): if a page (page A) has a back link from page N, the share of the vote page A gets is PR(Tn)/C(Tn). d: All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is "damped down" by multiplying it by 0.85 (the factor d).

13 ClusterRank Yi Zhang, a student at UCCS is the author, 2006.
Algorithm is based on Google’s PageRank. Designed to speed up PageRank calculation and also to provide a feature of grouping similar Web pages together in to clusters. The original PageRank algorithm is applied on Clusters. The rank is then distributed to members of the by weighted average.

14 ClusterRank (Cont’d) Group all pages into clusters.
Perform first level clustering for dynamically generated page. URLs are grouped based on the “?” , “#” Example: All URLs below will be grouped in to one Cluster

15 ClusterRank (Cont’d) Perform second level clustering on virtual directory and graph density. URLs are grouped based on the last “/” symbol of the URL. Density is calculated for the proposed clusters. Approve the cluster based on the pre-set threshold value.

16 ClusterRank (Cont’d) Calculate the rank for each cluster using the original PageRank algorithm. Distribute the rank number to its members by weighted average by using: PR = CR * Pi/Ci. The notations here are: PR: The rank of a member page CR: The cluster rank from previous stage Pi: The incoming links of this page Ci: Total incoming links of this cluster.

17 SourceRank James Caverlee, Ling Liu, and S.Webb from Georgia Institute of Technology, 2007. The Web graph is represented as Sources. The Source is a logical collection of Web pages. Assigns a score to each page based on the overall quality of the source that the page belongs to, through a random walk over Web sources.

18 SourceRank (Cont’d) Group all pages into Sources based on “Domain”.
URLs are grouped based on the first “/” symbol of the URL Example: All URLs below will be grouped in to one Source

19 SourceRank (Cont’d) Calculate the rank for each Source with the original PageRank algorithm Distribute the rank number to its members by weighted average by using: PR = SR * Si The notations here are: PR: The rank of a member page SR: The source rank from previous stage Si: Total incoming unique links of this source

20 Truncated PageRank L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates from Italy, 2006. In PageRank, the Web page can gain high Page-Rank score with supporters (in-links) that are topologically “Close” to the target node. Spammers can afford to influence only a few levels. Truncated PageRank is similar to PageRank, except that the supporters that are too “close” to a target node do not contribute towards its ranking.

21 Truncated PageRank (Cont’d)
The notations here are: C: Normalization constant  : The damping factor PR(p) =  t · Mt =  damping(t) · Mt

22 WebGraph Package Paolo Boldi and Sebastiano Vigna from Italy, 2004.
Represents the Node graph in efficient manner using Differential compression technique. Allows applications to encode compactly a new version of data with respect to a previous or reference version of same data. WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks) in as little as 3:08 bits per link, and its transposed version in as little as 2:89 bits per link. WebBase is a repository of Web pages crawled by Ubi crawler from Stanford University.

23 WebGraph Package (Cont’d)
Node graph initial representation: Node graph with Reference compression:

24 WebGraph Package (Cont’d)
Node graph with Differential compression: Differential compression allows to code a link in less than a bit (Not possible with plain Reference compression)

25 WebGraph Package (Cont’d)
Link Structure From DB Graph in Ascii format Graph in BV format Graph in BV Format PageRank Module

26 BVGraph Details BVGraph: Boldi Vigna Graph
BVGraph is generated using a graph that is represented in ASCII format. The first line contains the number of nodes ‘n’, then ‘n’ lines follow the i-th line containing the successors of the node ‘i’ in the increasing order (nodes are numbered from 0 to n-1). The successors are separated by a single space.

27 BVGraph Details (Cont’d)
For example, consider a graph of three vertices, a, b, and c, consisting of the following edges: (a, b) (a, c) (b, c) (b, a) (a:0, b:1, c:2) This graph could be expressed as below 3 1 2 0 2 1

28 BVGraph – Current Implementation
The URLLinkStructure table in the Database had linking information. ASCII graph is generated by using data in URLLinkStructure table and then the BV Graph is generated ASCII graph is represented as basename.graph-txt BVGraph is generated using the command: java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph basename bvbasename

29 BVGraph – Current Implementation (Cont’d)
The grapgh could be generated for incoming links as well as outgoing links. BVnode-in, BVnode-out, BVSource-in graphs are generated. BVGraph can be loaded using two loading methods load and loadOffline. The load method is used for small graphs The loadOffline method is used for large graphs

30 ClusterRank Using BVGraph
Steps Without BVGraph (Per iteration in Sec) With BVGraph (Per iteration in Sec) 300K 9452 7737

31 ClusterRank Using BVGraph (Cont’d)
Time gain using WebGraph for 300K URLS

32 Time Measure for Algorithms (in Seconds)
URLs: Node InLinks: Average InLinks per Node: 4.6 Clusters: 48271 Cluster InLinks: Average InLinks per Cluster: 16.35 Sources: 425 Source InLinks: 75217 Source: URLs: Node InLinks: 78.06 Clusters: Cluster InLinks: Average InLinks per Cluster: 109.35 Sources: 14892 Source InLinks: Average InLinks per Source: 670.8 URLs: 4 M Node InLinks: 5.82 Clusters: Cluster InLinks: 32.54 Sources: 482 Source InLinks: Cluster Rank 422 6780 2520 Source Rank 3 660 21 Truncated PageRank 2 12 17

33 Time Measure for Algorithms (Cont’d)

34 Time Measure for Algorithms (Cont’d)

35 Time Measure for Algorithms (Cont’d)

36 Time Measure for Algorithms (Cont’d)

37 Node In-Link Distribution across Nodes (4M URLs)

38 Node In-Link Distribution across Nodes (4M URLs)

39 Cluster In-Link Distribution across Clusters (4M URLs)

40 Source In-Link Distribution across Sources (4M URLs)

41 Quality Measure for Algorithms
Survey performed on quality of ranking algorithms, using 25 search keywords, by a group of people Obtained keywords from Google’s Keyword tool at: Listed below are the keywords identified. pictures university faculty stadium undergraduate map admissions scholarships loan mba alumni computer graduate business research students technology accommodation campus vacations dean department aid gpa parking

42 Quality Measure for Algorithms (Cont’d)
Survey performed to identify the following from KeyWord Search First page accuracy Second page accuracy Result order on the first page Result order on the second page Overall, are the important pages showing up early? Overall, the percentage in result hits are relevant?

43 Quality Measure For Algorithms (Cont’d)
 Algorithm Quality measure based on the scale 1 to 5 (1 being the best) ClusterRank  2.06 SourceRank  1.65 Truncated PageRank  2.94

44 Conclusion The ClusteRank computation can be accelerated using WebGraph. The SourceRank algorithm takes less time for Page-Rank calculation compared to ClusterRank and is close to Truncated PageRank for the existing 4M URLs. The SourceRank has better quality points out of the three algorithms. By considering the Efficiency and Quality, SourceRank is better out of the three for the existing data based on experiments performed.

45 Success Criteria Identified the efficiency of Page-Rank computation algorithm using time-measure generated by experiments Identified the quality of the algorithm using manual survey results Implemented the efficient algorithm for the Needle Search Engine in UCCS Upgraded the existing Needle Search Engine to 1 Million pages (crawled, actual URLs are 4 Million) from the current 50K URLs (crawled, actual URLs are 300K).

46 References [1] Paolo Boldi, Sebastiano Vigna. The WebGraph Framework 1: Compression Techniques. [2] Yen-Yu Chen, Qingqing Gan, Torsten Suel. I/O-Efficient Techniques for Computing PageRank. [3] Taher H. Haveliwala. Efficient Computation of PageRank.

47 References (Cont’d) [4] Yi Zhang. Design and Implementation of a Search Engine with the Cluster Rank Algorithm. [5] John A. Tomlin. A New Paradigm for Ranking Pages on the World Wide Web. [6] Lawrence Page, Sergey Brin, Rajeeve Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web

48 References (Cont’d) [7] Ricardo BaezaYates, Paolo Boldi, Carlos Castillo. Generalizing PageRank: Damping Functions for LinkBased Ranking Algorithms. [8] Gonzalo Navarro. Compressing Web Graphs like Texts. [9] The Spiders Apprentice.

49 References (Cont’d) [10] James Caverlee, Ling Liu, S.Webb. Spam-Resilient Web Ranking via influence Throttling. [11] G. Jeh, J. Widom, “SimRank: A Measure of Structural-Context Similarity”. [12] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Using rank propagation and probabilistic counting for link-based spam detection, Technical report”, 2006.


Download ppt "Accelerating Ranking-System Using WebGraph"

Similar presentations


Ads by Google