Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Anatomy of a Large-Scale Hypertextual Web Search Engine A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith November 2005.

Similar presentations

Presentation on theme: "The Anatomy of a Large-Scale Hypertextual Web Search Engine A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith November 2005."— Presentation transcript:

1 The Anatomy of a Large-Scale Hypertextual Web Search Engine A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith November 2005

2 2 Agenda Introduction Overview of Google PageRank –Motivation & Description –Example –Issues & Comparison –Further Work Application Conclusions

3 3 Introduction About the paper –Brin & Page, 1998, Stanford University –Details a prototype search engine, Google –Covers both architecture and algorithms –Cited in web metrics with relation to significance Also relevant to Web Graph Properties PageRank –Covered in a separate paper from Brin & Page – Is the primary metric used in the paper

4 4 Overview : What is Google? Web search engine –Tackles issues faced by previous crawlers of scalability and manipulation Academic –Built on strong understanding of web metrics –Use of hyperlink structures Transparent –Initially released into the public domain –Support for informatics research

5 5 Overview : Architecture URL Server Crawler Store Server Repository Indexer URL Resolver Sorter PageRank Searcher

6 6 Overview: Google Architecture (Explanation for handout only.) URL Server: Finds pages to surf. Crawler: Downloads pages and places them in the repository. Store Server: Document compression. Repository: Cached copies of most web pages. Indexer: Creates the forward index (documents words) and extracts hyperlink tags into the Anchors file. URL Resolver: Converts relative URLs into absolute URLs and creates the Links file. Links file: Ordered pairs of document IDs where a hyperlink exists between them. Sorter: Re-sorts the forward index to create the inverted index (words documents) and creates the Lexicon. Lexicon: Dictionary of all possible search keywords. Doc Index: Maps document identifier codes to URLs. PageRank: An influential web metric used to sort Googles matches. Searcher: Performs searches!

7 7 Overview : Forward Index Indexer identifies key word hits in a document Maps document (page) IDs to word IDs in Lexicon Word IDs partially sorted into barrels –64 of these –Word IDs within a barrel are unsorted. –Individual document may spread over barrels. However, not useful for search!

8 8 Overview : Inverted Index Want to know in what documents a key word occurs Need the Inverted Index Sorts the forward index into its inverted form Function performed by the Sorter

9 9 Overview : Ranking System Proximity of keyword hits –This is the sum of the distance between them Hits have types –Types: body text, heading text, anchor text, url, … –Relative font size factor used Count how many hits occur of each type and range of proximity values –Apply a function to each type-proximity count These form a type-proximity vector, C

10 10 Overview : Ranking System (2) V = C·W (dot product) is computed. –W is the importance associated with each type- proximity class. Combine V with the PageRank score Effect of increasing hits declines –Prevents large scale manipulation Hit Count, x f(x)f(x)

11 11 PageRank : Motivation Academic Citation Analysis* attempted, but… –Web has no formal quality control or peer review –Possible to inflate citation counts artificially –Web pages vary more than academic papers Consider: –One link from the Universitys main page, or one link from Yahoos main page… –Which citation should carry the higher weight ? *Also known as bibliometrics

12 12 PageRank : Description Informal Definition: –A page has a high rank if the sum of the ranks of its backlinks are high –Handles Yahoo case on previous slide Intuitive Definition: –Corresponds to the Random Surfer Model –User keeps clicking on links linearly then gets bored and restarts at a random location Now for the maths…

13 13 PageRank : Description (2) Formal Definition: –c is a dampening factor, was 0.85 –N v is number of out-links from page v –B u is the set of backlinks from the current page –cE(u) corresponds to the surfer getting bored

14 14 PageRank : Example Considering an example network Calculating A: c = dampening factor N = out-degree R = PageRank AB E D C

15 15 PageRank : Example (2) Initially set all PageRank to 1 First Iteration: In-LinksRank (R)Out-Links (N)R/N B111 C120.5 E12 AB E D C

16 16 PageRank : Example (3) Repeat process for B, C, D and E Feed computed values into next iteration Iteration123456 A1.85001.24791.19671.52301.34121.2954 B0.4333 0.63800.49300.48070.5593 C0.85830.79810.97720.90840.86680.9277 D1.00001.72251.21071.16721.44451.2900 E0.85830.79810.97720.90840.86680.9277 OrderADCEBDACEBADCEB DACEBADCEB

17 17 PageRank : Analysis Converges in log n time –Constrained by the time to build a full-text index more than anything Rank Sinks –Caused by two pages that point to each other but not to any other pages: rank accumulates –Solved by random surfer model Manipulation – Google Bombing –French Military Victories links to Defeats –Miserable Failure links to George Bush biography

18 18

19 19 PageRank : Comparison Web Graph Properties –Uses graph of the entire web: depends on full crawl –More sophisticated than simply summing in/out- degrees Web Page Significance –Uses Boolean Spread Activation – match all words –Enhanced citation analysis – building on work of Kleinberg, Egghe & Rousseau –Doesnt suffer from Tightly Knit Communities effect of Kleinbergs Hubs & Authorities

20 20 PageRank : Further Work Personalised PageRank, Haveliwala, 1999 –In-memory, block oriented, algorithm PageRank can be computed in an hour on a PIII 450Mhz using less than 100Mb of main memory –Compute PageRank on the client-side Use local information: bookmarks, searches, history Provide the link structure of the web on a DVD –11/11/05, Personalized Search released

21 21 PageRank : Further Work (2) Topic Sensitive PageRank, Haveliwala, 2002 –Improve Google by giving weight to the informational relationship between sites –A) Uniform Results Similar to current Google but with topics –B) Personalised to a particular user Based on previous searches and users surfing habits

22 22 Applications : Google Google Inc. –Largest search engine Technologies utilised by others (e.g. Yahoo!) Biggest ever technology IPO, 2004 –Redefining search Set a trend for other search providers Raised importance of quality web search results Combining information retrieval methods –Business model based on advertising Potential area for conflict Over 100 factors now influence results

23 23 Applications : PageRank Back-link prediction –Desire for optimal web crawling strategy –Better indicator than citation counts! Improving user navigation –The PageRank Proxy –Providing PageRank information with links Establishing trust –Wealth of authors on the web, who to trust? –Use PageRank to rate trust

24 24 Applications : The Future Internal Development –Project no longer in academic realm Lack of transparency initially intended Role of PageRank unclear Likely focus on extensions and results tuning External Development –APIs Allowing innovative use of Google technologies –Open Source Code Focused on developing infrastructure

25 25 Conclusions Academic Background –Success from strong academic understanding –Raised profile of informatics and search –Good platform for future research Success as a failure –Intention for transparency and use in academia –Commercial success has removed transparency –Potentially bad for further research in this area

26 26 Summary We have seen: –The architecture used by Google –PageRank as a web metric –Strengths and potential manipulations –The commercial success of Google –Applications –Potential areas of future research

27 27 References Work by Brin & Page (now at Google) –Brin, S., Page, L. (1998), The anatomy of a large-scale hypertextual search engine, Computer Networks and ISDN Systems, 30(1-7):107--117. –Page, L., Brin, S., Motwani, R. and Winograd, T. (1998), The PageRank Citation Ranking: Bringing Order to the Web', Stanford Digital Library Technologies Project. –More papers at: on many aspects of web metrics and search in general PageRank – –Take a look at the example at: –

28 28 References (2) Further Developments –Haveliwala, T. H. (1999), Efficient computation of PageRank. Technical report, Stanford University, Stanford, CA, 1999. –Haveliwala, T. H. (2002), Topic-sensitive PageRank. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, Hawaii, May 2002. Commercial Aspect – – Web Metrics –Dhyani, D., Keong N., W., and Bhowmick, S. (2002), A survey of web metrics, ACM Computing Surveys, 34(4):469--503.

Download ppt "The Anatomy of a Large-Scale Hypertextual Web Search Engine A review by: Adam Chamberlain, Adrian Hudnott, Rob Garrood & Ben Smith November 2005."

Similar presentations

Ads by Google