The Anatomy of a Large-Scale Hypertextual Web Search Engine

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Presented by: Vanshika Sharma
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Presented By: - Chandrika B N
Modeling and Optimizing Hypertextual Search Engines Based on the Reasearch of Larry Page and Sergey Brin Gus Johnson Search EnginesModified.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Modeling and Optimizing Hypertextual Search Engines Based on the Reasearch of Larry Page and Sergey Brin Yunfei Zhao Department of Computer Science University.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
CS 440 Database Management Systems Web Data Management 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
The Anatomy Of A Large Scale Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Thanks to Ray Mooney & Scott White
Anatomy of a search engine
CS 440 Database Management Systems
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page Presented By: Paolo Lim April 10, 2007 CS 331 - Data Mining

AKA: The Original Google Paper Larry Page and Sergey Brin CS 331 - Data Mining

Presentation Outline Design goals of Google search engine Link Analysis and other features System architecture and major structures Crawling, indexing, and searching the web Performance and results Conclusions Final exam questions CS 331 - Data Mining

Linear Algebra Background PageRank involves knowledge of: Matrix addition/multiplication Eigenvectors and Eigenvalues Power iteration Dot product Not discussed in detail in presentation For reference: http://cs.wellesley.edu/~cs249B/math/Linear%20Algebra/CS298LinAlgpart1.pdf http://www.cse.buffalo.edu/~hungngo/classes/2005/Expanders/notes/LA-intro.pdf CS 331 - Data Mining

Google Design Goals Scaling with the web’s growth Improved search quality Number of documents increasing rapidly, but user’s ability to look at documents lags Lots of “junk” results, little relevance Academic search engine research Development and understanding in academic realm System that reasonable number of people can actually use Support novel research activities of large-scale web data by other researchers and students CS 331 - Data Mining

Link Analysis Basics PageRank Algorithm Anchor Text Analysis A Top 10 IEEE ICDM data mining algorithm Large basis for ranking system (discussed later) Tries to incorporate ideas from academic community (publishing and citations) Anchor Text Analysis <a href=http://www.com> ANCHOR TEXT </a> CS 331 - Data Mining

Intuition: Why Links, Anyway? Links represent citations Quantity of links to a website makes the website more popular Quality of links to a website also helps in computing rank Link structure largely unused before Larry Page proposed it to thesis advisor CS 331 - Data Mining

Naïve PageRank Each link’s vote is proportional to the importance of its’ source page If page P with important I has N outlinks, then each link gets I / N votes Simple recursive formulation: PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn) PR(X)  PageRank of page X C(X)  number of links going out of page X CS 331 - Data Mining

Naïve PageRank Model (from http://www. stanford The web in 1839 y = y /2 + a /2 a = y /2 + m m = a /2 y/2 Yahoo M’soft Amazon y a/2 y/2 m a/2 m a CS 331 - Data Mining

Solving the flow equations 3 equations, 3 unknowns, no constants No unique solution All solutions equivalent modulo scale factor Additional constraint forces uniqueness y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large graphs CS 331 - Data Mining

Matrix formulation Matrix M has one row and one column for each web page Suppose page j has n outlinks If j ! i, then Mij=1/n Else Mij=0 M is a column stochastic matrix Columns sum to 1 Suppose r is a vector with one entry per web page ri is the importance score of page i Call it the rank vector CS 331 - Data Mining

Example (from http://www. stanford Suppose page j links to 3 pages, including i i j M r = i 1/3 r CS 331 - Data Mining

Eigenvector formulation The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix In fact, its first or principal eigenvector, with corresponding eigenvalue 1 CS 331 - Data Mining

Example (from http://www. stanford y a m Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 r = Mr y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m y = y /2 + a /2 a = y /2 + m m = a /2 CS 331 - Data Mining

Power Iteration Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1,….,1]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <  |x|1 = 1·i·N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean CS 331 - Data Mining

Power Iteration Example (from http://www. stanford Yahoo M’soft Amazon y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a = m 1 1 3/2 1/2 5/4 1 3/4 9/8 22/24 1/2 6/5 3/5 . . . CS 331 - Data Mining

Random Surfer Imagine a random web surfer At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t p(t) is a probability distribution on pages CS 331 - Data Mining

The stationary distribution Where is the surfer at time t+1? Follows a link uniformly at random p(t+1) = Mp(t) Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Then p(t) is called a stationary distribution for the random walk Our rank vector r satisfies r = Mr So it is a stationary distribution for the random surfer CS 331 - Data Mining

Spider traps A group of pages is a spider trap if there are no links from within the group to outside the group Random surfer gets trapped Spider traps violate the conditions needed for the random walk theorem CS 331 - Data Mining

Microsoft becomes a spider trap (from http://www. stanford Yahoo y a m y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 Amazon M’soft y a = m 1 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 3 . . . CS 331 - Data Mining

Random teleports The Google solution for spider traps At each time step, the random surfer has two options: With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for  are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps CS 331 - Data Mining

Matrix formulation Suppose there are N pages Consider a page j, with set of outlinks O(j) We have Mij = 1/|O(j)| when j!i and Mij = 0 otherwise The random teleport is equivalent to adding a teleport link from j to every other page with probability (1-)/N reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| Equivalent: tax each page a fraction (1-) of its score and redistribute evenly CS 331 - Data Mining

Page Rank Construct the NxN matrix A as follows Aij = Mij + (1-)/N Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix satisfying r = Ar Equivalently, r is the stationary distribution of the random walk with teleports CS 331 - Data Mining

Previous example with =0. 8 (from http://www. stanford 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 Yahoo 0.8 + 0.2 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 Amazon M’soft y a = m 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . . CS 331 - Data Mining

Dead ends Pages with no outlinks are “dead ends” for the random surfer Nowhere to go on next step CS 331 - Data Mining

Microsoft becomes a dead end (from http://www. stanford 1/2 1/2 0 1/2 0 0 0 1/2 0 1/3 1/3 1/3 Yahoo 0.8 + 0.2 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 1/15 Amazon M’soft Non- stochastic! y a = m 1 1 0.6 0.787 0.547 0.387 0.648 0.430 0.333 . . . CS 331 - Data Mining

Dealing with dead-ends Teleport Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly Prune and propagate Preprocess the graph to eliminate dead-ends Might require multiple passes Compute page rank on reduced graph Approximate values for dead ends by propagating values from reduced graph CS 331 - Data Mining

Anchor Text Can be more accurate description of target site than target site’s text itself Can point at non-HTTP or non-text Images Videos Databases Possible for non-crawled pages to be returned in the process CS 331 - Data Mining

Other Features List of occurrences of a particular word in a particular document (Hit List) Location information and proximity Keeps track of visual presentation details: Font size of words Capitalization Bold/Italic/Underlined/etc. Full raw HTML of all pages is available in repository CS 331 - Data Mining

Google Architecture (from http://www.ics.uci.edu/~scott/google.htm) Implemented in C and C++ on Solaris and Linux CS 331 - Data Mining

Google Architecture (from http://www.ics.uci.edu/~scott/google.htm) Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open connections open at once. Keeps track of URLs that have and need to be crawled Compresses and stores web pages Stores each link and text surrounding link. Converts relative URLs into absolute URLs. Uncompresses and parses documents. Stores link information in anchors file. Contains full html of every web page. Each document is prefixed by docID, length, and URL. CS 331 - Data Mining

Google Architecture (from http://www.ics.uci.edu/~scott/google.htm) Maps absolute URLs into docIDs stored in Doc Index. Stores anchor text in “barrels”. Generates database of links (pairs of docIds). Parses & distributes hit lists into “barrels.” Partially sorted forward indexes sorted by docID. Each barrel stores hitlists for a given range of wordIDs. In-memory hash table that maps words to wordIds. Contains pointer to doclist in barrel which wordId falls into. Creates inverted index whereby document list containing docID and hitlists can be retrieved given wordID. DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL. CS 331 - Data Mining

Google Architecture (from http://www.ics.uci.edu/~scott/google.htm) 2 kinds of barrels. Short barrell which contain hit list which include title or anchor hits. Long barrell for all hit lists. List of wordIds produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words. New lexicon keyed by wordID, inverted doc index keyed by docID, and PageRanks used to answer queries CS 331 - Data Mining

Google Query Evaluation Parse the query. Convert words into wordIDs. Seek to the start of the doclist in the short barrel for every word. Scan through the doclists until there is a document that matches all the search terms. Compute the rank of that document for the query. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k. CS 331 - Data Mining

Single Word Query Ranking Hitlist is retrieved for single word Each hit can be one of several types: title, anchor, URL, large font, small font, etc. Each hit type is assigned its own weight Type-weights make up vector of weights Number of hits of each type is counted to form count-weight vector Dot product of type-weight and count-weight vectors is used to compute IR score IR score is combined with PageRank to compute final rank CS 331 - Data Mining

Multi-word Query Ranking Similar to single-word ranking except now must analyze proximity of words in a document Hits occurring closer together are weighted higher than those farther apart Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close” Each type and proximity pair has a type-prox weight Counts converted into count-weights Take dot product of count-weights and type-prox weights to computer for IR score CS 331 - Data Mining

Scalability Cluster architecture combined with Moore’s Law make for high scalability. At time of writing: ~ 24 million documents indexed in one week ~518 million hyperlinks indexed Four crawlers collected 100 documents/sec CS 331 - Data Mining

Key Optimization Techniques Each crawler maintains its own DNS lookup cache Use flex to generate lexical analyzer with own stack for parsing documents Parallelization of indexing phase In-memory lexicon Compression of repository Compact encoding of hit lists for space saving Indexer is optimized so it is just faster than the crawler so that crawling is the bottleneck Document index is updated in bulk Critical data structures placed on local disk Overall architecture designed avoid to disk seeks wherever possible CS 331 - Data Mining

Storage Requirements (from http://www.ics.uci.edu/~scott/google.htm) At the time of publication, Google had the following statistical breakdown for storage requirements: CS 331 - Data Mining

Conclusions Search is far from perfect Business potential Topic/Domain-specific PageRank Machine translation in search Non-hypertext search Business potential Brin and Page worth around $15 billion each… at 32 years old! If you have a better idea than how Google does search, please remember me when you’re hiring software engineers!  CS 331 - Data Mining

Possible Exam Questions Given a web/link graph, formulate a Naïve PageRank link matrix and do a few steps of power iteration. Slides 14 – 16 What are spider traps and dead ends, and how does Google deal with these? Spider Trap: Slides 19 – 21 Dead End: Slides 25 – 27 Explain difference between single and multiple word search query evaluation. Slides 35 – 36 CS 331 - Data Mining

References Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Brin, Page, Motwani, Winograd. The PageRank Citation Ranking: Bringing Order to the Web. http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/lectures/02-web-search.pdf http://www.ics.uci.edu/~scott/google.htm CS 331 - Data Mining

Thank you! CS 331 - Data Mining