Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White

Presentation Overview Problem Problem Design Goals Design Goals Google Search Engine Features Google Search Engine Features Google Architecture Google Architecture Scalability Scalability Conclusions Conclusions

Problem Web is vast and growing exponentially Web is vast and growing exponentially Web is heterogenous Web is heterogenous –Ascii –Html –Images –Java applets –Etc. Human Maintained Lists can’t keep up Human Maintained Lists can’t keep up Previous search methodologies relied on keyword matching producing low quality matches Previous search methodologies relied on keyword matching producing low quality matches Human attention is confined to ~10-1000 documents Human attention is confined to ~10-1000 documents  User’s ability to locate documents is getting harder

Solution = Google “Our main goal is to improve the quality of web search engines” Google  googol = 10^100 Google  googol = 10^100 Originally part of the Stanford digital library project known as WebBase, commercialized in 1999 Originally part of the Stanford digital library project known as WebBase, commercialized in 1999

Specific Design Goals Deliver results that have very high precision even at the expense of recall Deliver results that have very high precision even at the expense of recall Make search engine technology transparent, i.e. advertising shouldn’t bias results Make search engine technology transparent, i.e. advertising shouldn’t bias results Bring search engine technology into academic realm in order to support novel research activities on large web data sets Bring search engine technology into academic realm in order to support novel research activities on large web data sets Make system easy to use for most people, e.g. users shouldn’t have to specify more than a couple words Make system easy to use for most people, e.g. users shouldn’t have to specify more than a couple words

Google Search Engine Features Two main features to increase result precision: Uses link structure of web (PageRank) Uses link structure of web (PageRank) Uses text surrounding hyperlinks to improve accurate document retrieval Uses text surrounding hyperlinks to improve accurate document retrieval Other features include: Takes into account word proximity in documents Takes into account word proximity in documents Uses font size, word position, etc. to weight word Uses font size, word position, etc. to weight word Storage of full raw html pages Storage of full raw html pages

PageRank For Dummies Intuition: Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the PageRank of the page. This value is known as the PageRank of the page.

PageRank For Techies N(p): # outgoing links from page p B(p): set of pages that point to p d: tendency to get “bored”, 0  d  1 R(p): PageRank of p R(p) = [(1-d)   q  B(p) R(q)/N(q)] + d

Why do we need d? In the real world virtually all web graphs are not connected, i.e. they have dead- ends, islands, etc. In the real world virtually all web graphs are not connected, i.e. they have dead- ends, islands, etc. If we don’t have d we get “ranks leaks” for graphs that are not connected, i.e. leads to numerical instability If we don’t have d we get “ranks leaks” for graphs that are not connected, i.e. leads to numerical instability

Justifications for using PageRank Attempts to model user behavior Attempts to model user behavior Captures the notion that the more a page is pointed to by “important” pages, the more it is worth looking at Captures the notion that the more a page is pointed to by “important” pages, the more it is worth looking at Takes into account global structure of web Takes into account global structure of web

Google Architecture Implemented in C and C++ on Solaris and Linux

Preliminary “Hitlist” is defined as list of occurrences of a particular word in a particular document including additional meta info: - position of word in doc - font size - font size - capitalization - capitalization - descriptor type, e.g. title, anchor, etc. - descriptor type, e.g. title, anchor, etc.

Google Architecture (cont.) Keeps track of URLs that have and need to be crawled Compresses and stores web pages Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open connections open at once. Uncompresses and parses documents. Stores link information in anchors file. Stores each link and text surrounding link. Converts relative URLs into absolute URLs. Contains full html of every web page. Each document is prefixed by docID, length, and URL.

Google Architecture (cont.) Maps absolute URLs into docIDs stored in Doc Index. Stores anchor text in “barrels”. Generates database of links (pairs of docIds). Parses & distributes hit lists into “barrels.” Creates inverted index whereby document list containing docID and hitlists can be retrieved given wordID. In-memory hash table that maps words to wordIds. Contains pointer to doclist in barrel which wordId falls into. Partially sorted forward indexes sorted by docID. Each barrel stores hitlists for a given range of wordIDs. DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.

Google Architecture (cont.) List of wordIds produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words. New lexicon keyed by wordID, inverted doc index keyed by docID, and PageRanks used to answer queries 2 kinds of barrels. Short barrell which contain hit list which include title or anchor hits. Long barrell for all hit lists.

Google Query Evaluation 1. Parse the query. 2. Convert words into wordIDs. 3. Seek to the start of the doclist in the short barrel for every word. 4. Scan through the doclists until there is a document that matches all the search terms. 5. Compute the rank of that document for the query. 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7. If we are not at the end of any doclist go to step 4. 8. Sort the documents that have matched by rank and return the top k.

Single Word Query Ranking Hitlist is retrieved for single word Hitlist is retrieved for single word Each hit can be one of several types: title, anchor, URL, large font, small font, etc. Each hit can be one of several types: title, anchor, URL, large font, small font, etc. Each hit type is assigned its own weight Each hit type is assigned its own weight Type-weights make up vector of weights Type-weights make up vector of weights # of hits of each type is counted to form count vector # of hits of each type is counted to form count vector Dot product of two vectors is used to compute IR score Dot product of two vectors is used to compute IR score IR score is combined with PageRank to compute final rank IR score is combined with PageRank to compute final rank

Multi-word Query Ranking Similar to single-word ranking except now must analyze proximity Similar to single-word ranking except now must analyze proximity Hits occurring closer together are weighted higher Hits occurring closer together are weighted higher Each proximity relation is classified into 1 of 10 values ranging from a phrase match to “not even close” Each proximity relation is classified into 1 of 10 values ranging from a phrase match to “not even close” Counts are computed for every type of hit and proximity Counts are computed for every type of hit and proximity

Scalability Cluster architecture combined with Moore’s Law make for high scalability. At time of writing: Cluster architecture combined with Moore’s Law make for high scalability. At time of writing: –~ 24 million documents indexed in one week –~518 million hyperlinks indexed –Four crawlers collected 100 documents/sec

Summary of Key Optimization Techniques –Each crawler maintains its own DNS lookup cache –Use flex to generate lexical analyzer with own stack for parsing documents –Parallelization of indexing phase –In-memory lexicon –Compression of repository –Compact encoding of hitlists accounting for major space savings –Indexer is optimized so it is just faster than the crawler so that crawling is the bottleneck –Document index is updated in bulk –Critical data structures placed on local disk –Overall architecture designed avoid to disk seeks wherever possible

Storage Requirements At the time of publication, Google had the following statistical breakdown for storage requirements: At the time of publication, Google had the following statistical breakdown for storage requirements:

Conclusion The writing is not very clear The writing is not very clear –Weak presentation of PageRank model. –Sentences are often too long and dense. –Poor presentation structure No formal user evaluation of search result quality No formal user evaluation of search result quality Still, this is one of the seminal papers in IR. Still, this is one of the seminal papers in IR. –Highly cited –PageRank link analysis algorithm still one of the best algorithms available –Today’s Google architecture still very similar to one cited in this paper –Success of Google based, in large part, to ideas discussed in this paper

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Similar presentations

Presentation on theme: "Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Similar presentations

Presentation on theme: "Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White."— Presentation transcript:

Similar presentations

About project

Feedback