Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

Similar presentations


Presentation on theme: "The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical."— Presentation transcript:

1 The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical and Computer Engineering

2 Overview @ Stanford University –Presented as a prototype of a large-scale search engine –26 million pages, 147 GB –Google ~ googol Issues –Scaling –Exploiting structure in Hypertext PageRank Algorithm Architecture Data Structures, Crawling, Indexing, Searching Results

3 PageRank Algorithm using link graph Anchor Text –Associate the anchor text of a link to the page it points to Information Retrieval –TREC => well controlled, homogenous collections –Not equipped to handle Hypertext documents –Vector Space Model not enough

4 Architecture URL Server Distributed Crawlers Storeserver Repository Indexer Barrels URL Resolver Sorter DumpLexicon Searcher

5 Data Structures BigFiles Repository Document Index Lexicon Hit Lists Forward Index Inverted Index

6 Repository Full HTML of every webpage Compressed using zlib Prefixed by docID, length, URL Files stored one after another

7 Document Index Fixed width ISAM index Stores document status, pointer to repository, document checksum If document has been crawled, ptr to variable length docinfo file stored Otherwise ptr to URLlist stored

8 Hit Lists Plain and Fancy hits 2 bytes for each hit Length of hit list stored before hit

9 Forward Index Stored in 64 barrels. If a document contains words in a barrel, then the docID is recorded into the barrel, with the list of wordID’s and hitlists. Each wordID stored as a relative difference from the minimum wordID in a barrel. (24 bits for the wordID, 8 for hitlist length).

10 Inverted Index Same barrels as forward index, but processed by the sorter. For every wordID, doclist of docIDs generated, with corresponding hitlists. Two sets of inverted barrels, one for hitlists with anchor or title text, another for all hitlists.

11 Indexing the Web Parser – flex used to generate a lexical analyzer – “involved a fair amount or work” Indexing Documents into barrels –Every word hashed into wordID –Occurrences translated into hitlists and written into forward barrels –Lexicon needs to be shared Extra words written into a log, processed by one final indexer

12 Searching 1.Parse the query. 2.Convert words into wordIDs. 3.Seek to the start of the doclist in the short barrel for every word. 4.Scan through the doclists until there is a document that matches all the search terms. 5.Compute the rank of that document for the query. 6.If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7.If we are not at the end of any doclist go to step 4. 8.Sort the documents that have matched by rank and return the top k.

13 Ranking… Count weight generated for each word in query Dot product taken with type weight vector (for single word queries) or with type-prox weight vector (for multiple word queries) Combined with PageRank to give final score.

14 Results High quality pages zlib – 3:1 ratio 9 days to download 26 million pages –Indexer and crawler ran simultaneously Future work: –Query caching, smart disk allocation, updates –User context, relevance feedback

15 Footnote … foot in mouth!! “we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.”


Download ppt "The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical."

Similar presentations


Ads by Google