Presentation on theme: "By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine."— Presentation transcript:
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Problem The Web continues to grow rapidly So do its inexperienced users Human-maintained lists: –Cannot to keep up with volume of changes, –Subjective –Do not cover all topics Automated search engines –Bring too many low relevant matches –Advertisers mislead them for commercial purposes
Keeping up with the Web Requirements: Fast crawling technology Efficient use of storage (indices and possibly documents) Hundreds of thousands index queries per second Mitigating factors: Technology performance improves (exceptions: disk seek time, OS stability) while its cost tends to decline.
Design Goal: Search Quality By 1994: All it is needed is a complete index of the Web. By 1997: An index may be complete and still return many junk results that tarnish relevant ones. –Index size has increased, and so does the number of matches, but… –people is still willing to look only a handful of results. Only top relevant documents should be returned. Theory expects the Web’s link structure and link text help finding such relevant documents.
The Web orientation passed from academic to commercial. Search engine development had remained an obscure and propietary area. Google wants to make it more understandable to the academic level and promote continuing research. By caching parts of the Web, Google itself is considered a research plattform from where new results can be derived quickly. Design Goal: Academic Research
Prioritizing Pages: PageRank The Web can be described as a huge graph Such graph can be used to make a fast calculation of the importance of a result item, based on the keywords given by the user. This resource had been unused at large until Google.
An Example of Page Rank The calculation of a page’s rank is defined in terms of: –The page rank of other pages pointing to it: PR(T i ) –The number of pages this page references: C(T i ) –A “dampening factor” from 0 to 1: d PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn)) This formula is calculated using an iterative algorithm in short time.
Intuition behind Page Rank 1.A “random surfer” starts from a random page, clicking random links again and again. The probability of him visiting page A is P(A) = PR(A). http://www.bleb.org/random http://www.bleb.org/random 2.At times he requests to start again. The probability of him starting again is the dampening factor d, used to avoid misleading the system intentionally.
Intuition behind Page Rank www.google.com www.uga.eduwww.cs.uga.edu www.osu.edujohnsmith.com Index looks like: …www.google.comwww.google.com …www.google.comwww.google.com …www.google.comwww.google.com …www.google.comwww.google.com …www.uga.eduwww.uga.edu …www.uga.eduwww.uga.edu …www.osu.eduwww.osu.edu …www.cs.uga.eduwww.cs.uga.edu …www.johnsmith.comwww.johnsmith.com The more references a page has, the more likely the “random surfer” is likely to get to it. That is the page’s PageRank. d exists so that not always the decision is based on page references, as someone could intentionally do that.
Anchor Text Usually describes better a page than the page itself. Associated not only to the page where it is found, but the one it points to. Makes possible to index non-text content. Downside: The destination of these links is not verified, so they may even not exist.
Other Features Takes into account the in-page position of hits. Presentation of words (big size, bold, etc.), weighting them accordingly. The HTML of pages is cached in a repository.
World Wide Web Worm was one of the first search engines. Many former search engines turned into public companies. Details of such search engines is usually confidential. There is known work on post-processing of results of major search engines.
Research on Information Retrieval Produced results based on a controlled set of documents on a specific area. Even the largest benchmark (TREC-96) would not scale well in an much bigger and heterogeneous place like the Web. Given a popular topic, users should not need to give many details on it in order to get relevant results.
The Web is a completely uncontrolled collection of documents varying in their… –languages: both human and programming –vocabulary: from zip codes to product numbers –format: text, HTML, PDF, images, sounds –source: human or machine-generated –External meta information: source reputation, update frequency, etc. are all valuable but hard to measure. Any type of content + influence of search engines + intentional for-profit mislead <> controlled!! The Web vs. Controlled Collections
Architecture Overview Implemented in C/C++, can run in Solaris or Linux. 1.A URLServer sends lists of URLs to be fetched by a set of crawlers 2.Fetched pages are given a docID and sent to a StoreServer which compresses and stores them in a repository 3.The indexer extracts pages from the repository and parses them to classify their words into hits 4.Its output goes to barrels, or partially sorted indexes 5.It also builds the anchors file from links in the page, recording to and from information
5.The URLResolver reads the anchors file, converting relative URLs into absolute, and assigning docIDs 6.The forward index is updated with docIDs the links point to. 7.The links database is also created as pairs of docIDs. This is used to calculate the PageRank 8.The sorter takes barrels and sorts them by wordID (inverted index). A list of wordIDs points to different offsets in the inverted index 9.This list is converted into a lexicon (vocabulary) 10.The searcher is run by a Web server and uses the lexicon, inverted index and PageRank to answer queries Architecture Overview
Data Structures BigFiles:Virtual files across filesystems, go beyond OS capabilities. Repository: Contains the actual HTML compressed 3:1 using open-source zlib. –Stored like variable-length data in a DBMS –Independent of other data structures –Other data structures can be restored from here
Data Structures Document Index: Indexed sequential file with status information about each document. To avoid slow disk seek operations, updates to the URL resolver file are made in batch mode. Otherwise it would take months. Lexicon: Or list of words, is kept on 256MB of main memory, allocating 14 million words and hash pointers.
Hit Lists: Records occurrences of a word in a document plus details. Accounts for most of the space used. –Fancy hits: URL, title, anchor text, –Plain hits: Everything else –Details are contained in bitmaps: Data Structures
Forward Index: Stores wordIDs and references to documents containing them. Stored in partially sorted indexes called “barrels”. Inverted Index: Same as above but after sorting by docID. Stores docIDs pointing to hits. Data Structures
A Simple Inverted Index Example: Pages containing the words "i love you" "god is love," "love is blind," and "blind justice.“ blind (3,8);(4,0) god (2,0) i (1,0) is (2,4);(3,5) justice (4,6) love (1,2);(2,7);(3,0) you (1,7)
Crawling the Web Distributed crawling system, 3 crawlers @300 concurrent connections, or 100 pages per second@600KB/sec. second@600KB/sec Stress on DNS lookup is reduced by having a DNS cache in each crawler. Social consequences due to lack of knowledge (“This page is copyrighted and should not be indexed”) Any behavior can be expected of software crawling the net. Intensive testing required.
Indexing the Web Parser: Must be validated to expect and deal with a huge number of special situations. Indexing into barrels: Word > wordID > update lexicon > hit lists > forward barrels. There is high contention for the lexicon. Sorting: Inverted index is generated from forward barrels by sorting them individually to avoid temp storage, using TPMMS.
Searching The ranking system is better because more data (font, position and capitalization) is maintained about documents. This and PageRank help in finding better results. Feedback: Selected users can grade search engine results to help recalculate efficiency.
Results and Performance “The most important measure of a search engine is the quality of its search results.” Results are correct even for non-commercial or rarely referenced sites. Cost-effective: A significant part of 1997’s Web was held in a 53GB repository, and all other data could fit in additional 55GB.
Google’s Philosophy “…it is crucial to have a competitive search engine that is transparent and in the academic realm.” Google is probably the only leading IT company that is loved by everyone, and remains attached to its principles despite its amazing profit potential.
The Close Future http://labs1.google.com/gvs.html http://froogle.google.com/froogle/about.html