Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI 21 11 2005.

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine
Rogier Brussee ICI

Context Written 1997, Brin and Page were PhD students
Indexes 24*10^6 pages (<< 10^100) Describes their efforts to create a web search engine open for academia Altavista, Lycos and Yahoo ruled, Internet bubble was still growing.

Disclaimer Google does = Google does in Can only guess what still applies Principles sound right, probably survived Lots of room for tweaking: Dark Art Datastructures described up to bit level should have changed. Scaled up tremendously Index > 10^10 pages ?????? (<<10^100) So did hardware and OS. Business model changed Ads should not drive search result is still stated policy.

What does Google do Preprocess: Crawl
Index: words, anchors and links in docs Invert Index (i.e. sort) Value content (PageRank + “looks” weights) At Query time look up query Rank results (PageRank + IR measure)

Google Architecture (in 1997)
URL server : hands out URL’s to crawl Crawler: gets and parses content and caches DNS. Store Server compresses and formats pages Repository: Big database with compressed content Indexer: decompresses pages in repository and creates an hitlist index : docID  wordID + metadata dressing(Capitals, typeface)  anchors (URL’s + texts)  metadata info about doc’s (like title, headers, size, contenttype) Barrel: ROW 1 : storage systems that store the docID  wordID index divided up in wordID’s ROW2 : storage systems that store inverted index: wordID  docID Lexicon : has WordId  Word (+ metadata) and vice versa : relatively small (>~ 200 MB). Sorter: Inverts the Barrels Anchors: Anchor text from Links found by indexer Db. Links : DB of what links to what. URL-resolver: find and parse URL’s create docID. Doc-index: DB of URL docID and vice versa. PageRank : computes the Page Rank Searcher: Search indices and rank results.

Ranking I Google ranks words differently depending on Capitalisation
Typeface (with respect to average) In title or anchor or .. For phrases also proximity of words is important Gives IR score (precise formula is not mentioned) And then there is PageRank !

Ranking II Together determines rank you see when googling
No single factor is dominant

PageRank Called after Lawrence Page.
Measure of collectively defined importance of web page Probabilistic model of user doing random surfing before Google gives recommandation PageRank is a probability to find user at page in model after infinite number of clicks Quantitative version of effect of information scent Really pioneered by ants ! Go to the ant, thou sluggard; Consider her ways, and be wise (Proverbs 6:6)

Ant model for PageRank (1-d)/n (1-d) d 1 k Chance d to follow a link
Put 1000 ants on every page. Let ants follow links according to rules above. Wait long enough we get a stationary distribution. The number of ants on a node / total number of ants is PageRank 1 k Chance d to follow a link Chance (1-d) to jump to random page out of n pages 2

Mathematical Explanation
We have initial ant distribution p = (p_1, ….p_n) on n pages Normalise sum_i p_i = 1, we have p_i >= 0. We have a Markov chain with transition probability: t_ij = d/k_j + (1-d)/n if there is one of k_j links on page j to page i t_ij = (1-d)/ n otherwise Gives transition matrix T = (t_ji) , i,j = 1,…,n Note: t_ij > 0 and sum_i t_ij = 1. After one “round” ant distribution is Tp = ( sum_j t_ij p_j)_{i = 1,..,n} Note (Tp)_i > 0 and sum (Tp)_i = 1. After n rounds distribution is T^n p. Define lim_{n  infty} T^n p = p^(0) (exists) Tp^(0) = p ^(0) : stationary distribution of Markov chain Pagerank is stationary distribution of the Markov chain Existence of the fixpoint Peron-Frobenius theorem a direct consequence of Brouwers fixpoint theorem: The simplex Delta = { x, \sum x_i =1 , x_i >= 0} is mapped to itself by T but Delta is topologically a closed n-1 dimensional disk. Connectedness of Markov Graph implies uniqueness. It suffices to see that the fixpoint is isolated because by linearity there would be whole eigenspace otherwise. However on \sum x_i =0, which is an invariant complement to the fixpoint p^(o) T is contracting in L_1.

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI 21 11 2005.

Similar presentations

Presentation on theme: "Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI 21 11 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI 21 11 2005.

Similar presentations

Presentation on theme: "Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI 21 11 2005."— Presentation transcript:

Similar presentations

About project

Feedback