Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages 107-117, April.

Similar presentations


Presentation on theme: "The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages 107-117, April."— Presentation transcript:

1 The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages 107-117, April 1998 2006. 3. 14 Young Geun Han

2 Contents System Anatomy  Crawling the Web  Indexing the Web  Searching Results and Performance  Storage Requirements  System Performance  Search Performance Conclusions

3 Crawling the Web (1) Crawler  The most fragile application  Involves interacting with many web servers and name servers Running a web crawler  Tricky performance, reliability issues and social issues

4 Crawling the Web (2) Tricky performance  Google has a fast distributed crawling system  Each crawler keeps roughly 300 connection open at once  Google can crawl over 100 web pages per second using four crawlers at peak speeds (roughly 600K per second of data)  Each crawler maintains a its own DNS cache  The crawler uses asynchronous IO and a number of queues Looking up DNSConnecting to hostSending requestReceiving response

5 Crawling the Web (3) Reliability issues  There are many people who don’t know what a crawler is  They consider running a crawler as generating a fair amount of email and phone calls  They consider that we like their web site very much  There are some people who don’t know about the robots exclusion protocol

6 Crawling the Web (4) Social issues  Because of the huge amount of data involved, unexpected things will happen  Easy problem to fix had not come up until we had download tens of millions of pages  Impossible to test a crawler without running it on large part of the Internet  Crawlers need to be designed to be very robust and carefully tested

7 Indexing the Web (1) Parsing  Any parser must handle a huge array of possible errors  Use flex to generate a lexical analyzer for maximum speed URL ServerStore Server Crawler Repository Indexer Barrels Indexer Sorter

8 Indexing the Web (2) Indexing Documents into Barrels  After each document is parsed, it is encoded into a number of barrels  Every word is converted into a wordID by using an in-memory hash table -- the lexicon  New additions to the lexicon hash table are logged to a file  The words in the current document are translated into hit lists  The words are written into the forward barrels  For parallelization, indexer writes a log to a file, instead of sharing the lexicon

9 Indexing the Web (3) Sorting  Takes each of the forward barrels  Sorts it by wordID to produce an inverted barrel  Parallelize the sorting phase  Subdivides the barrels into baskets to load into main memory because the barrels don’t fit into memory  Sorts baskets and writes its contents into the inverted barrel

10 Searching (1) 1. Parse the query 2. Convert words into wordIDs 3. Seek to the start of the doclist in the short barrel for every word 4. Scan through the doclists until there is a document that matches all the search terms 5. Compute the rank of that document for the query 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4 7. If we are not at the end of any doclist go to step 4 8. Sort the documents that have matched by rank and return the top k Figure 4. Google Query Evaluation

11 Searching (2) The Ranking System  Every hitlist include position, font and capitalization information  Factor in hits from anchor text and the PageRank of the document  Ranking function so that no particular factor can have too much influence  For a single word search  In order to rank a document, Google looks at that document’s hit list for a single word query and computes an IR score combined with PageRank  For a multi-word search  Hits occurring close together in a document are weighted higher than hits occurring far apart

12 Searching (3) Feedback  Google has a user feedback mechanism because figuring out the right values for many parameters is very difficult  When the ranking function is modified, this mechanism gives developers some idea of how a change in the ranking function affects the search results

13 Result and Performance (1)

14 Result and Performance (2) Google’s results for a search  A number of results are from the whitehouse.gov domain  Most major commercial search engines do not return any results from whitehouse.gov  There is no title because it was not crawled  Instead, Google relied on anchor text to determine this was a good answer to the query  There are no results about a Bill other than Clinton or about a Clinton other than Bill

15 Result and Performance (3) Storage Requirements Table 1. Statistics

16 Result and Performance (4) System Performance  In total it took roughly 9 days to download the 26 million pages (including errors)  Download the last 11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second  The indexer ran just faster than the crawlers  The indexer runs at roughly 54 pages per second  Using four machines, the whole process of sorting takes about 24 hours

17 Result and Performance (5) Search Performance  Google answers most queries in between 1 and 10 seconds  The search time is mostly dominated by disk IO over NFS Table 2. Search Times

18 Conclusions (1) Google  A scalable search engine  Including page rank, anchor text, and proximity information  A complete architecture for gathering web pages, indexing them, and performing search queries over them

19 Conclusions (2) Future Work  Improve search efficiency and scale to approximately 100 million web pages  Smart algorithms to decide what old web pages should be recrawled and what new ones should be crawled High Quality Search  Google makes heavy use of hypertextual information consisting of link structure and link text  Google also uses proximity and font information  The analysis of link structure and PageRank allows Google to evaluate the quality of web pages

20 Conclusions (3) Scalable Architecture  Google is efficient in both space and time  Google’s major data structures make efficient use of available storage space  The crawling, indexing, and sorting operations are efficient in time  Google overcomes a number of bottlenecks A Research Tool  Not only a high quality search engine but a research tool  A necessary research tool for a wide range of applications


Download ppt "The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages 107-117, April."

Similar presentations


Ads by Google