1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

2 Outline Introduction Design Goals System Features System Anatomy Results & Performance Conclusion

3 Introduction Google: Large-scale search engine Design to crawl & index the web efficiently Crawler: download web pages Gives better results Google = 10 ^ 100 To engineer a SE is a challenging task. 2 ways of surfing High quality human maintained lists (Yahoo!) too slow to improve, can ’ t cover esoteric topics

4 Introduction(Cont.) Expensive to build and maintain. Search engines (google) search by keywords too many low quality matches people try to mislead automated search engines. Challenges in creating a search engine Fast crawling technology Efficient storage space Handle queries quickly

5 Introduction(Cont.) Scaling with the web Improved hardware performance exceptions : disk seek time, OS Google ’ s data structure are optimized for fast and efficient access. Google is a centralized SE.

6 Design Goals Improved search quality Junk results  Often wash out any results that a user is interested in  As the collection size grows, we need tools with very high precision Use of hypertextual information  Quality filtering link structure and link text Support novel research on large – scale web data

7 System features PageRank : bringing order to the web Most web SE has largely ignored the link graph Maps containing 518 million of hyperlinks Correspond well with people idea of importance citation importance. B C A B and C are backlinks of A

8 For this example: PR(A) = (1-d) + d(PR(T1)/3 + PR(T2) + PR(T3) + PR(T4)/2) Motivation: –Pages that are cited from many places are worth looking at. –Pages that are cited from an important place are worth looking at.

9 System features Pr(A) ＝ (1-d) ＋ (Pr(T1) / C(T1) + … +Pr(Tn) / C(Tn)) Assume page A has pages T1 … Tn which points to it. The parameter d can be set between 0 and 1(0.85). C(A) ： the number of links going out of page A. Random Surfer  Given a random URL  Clicks randomly on links  After a while gets bored and gets a new random URL  d is the probability at each page the “ random surfer ” gets bored and request another random page.

10 System features Difference from traditional methods  Not counting links from pages equally  Normalizing by the number of links in a page Anchor Text Associate link text with the page it points to. advantages:  Anchor provide more accurate description  Can exist for documents that can ’ t be indexed. Images,non-text docs.  Can return pages that hadn ’ t been crawled

12 System features Other Features Location information: use of proximity in search Visualization Information: font relative size Full raw HTML is available  Users can view a cached version of pages

13 System Anatomy

14 System Anatomy Design to avoid disk seek. Web pages are fetched, compressed and stored in repository Indexer parses the documents into hits (stored in barrels) and anchors.

15 Major Data Structures Hit Lists Forward Index Inverted Index Crawling the web Indexing the web Life of Query The Ranking system

21 Hit List What is Hit List? A hit list is a list of occurrences of a particular word in a particular document including position. Font, and capitalization information. Stored in both the forward and inverted indices. Encoded by hand optimized compact encoding(less space, less bit manipulation) 2 bytes storage. Cap:1Imp:3Position: 12 Cap:1Imp:3Type:4Pos:8 Cap:1Imp:3Type:4Hash: 4 Pos: 4 Plain: Fancy: Anchor:

23 Forward Index Given a docID, get it ’ s wordID and hit lists. Partial sorted and stored in forward barrels. Each barrel holds a range of wordID ’ s. Duplicated docIDs exist in the barrels. Instead of storing actual wordID, each wordID is stored as a relative difference from the minimum wordID in that barrel. So 2 24 = 16 millions. docIDwordID:24Nhits: 8Hit hit hit hit wordID:24Nhits: 8Hit hit hit hit null wordID docIDwordID:24Nhits: 8Hit hit hit hit

24 Inverted Index Given a wordID  docID Stored in the same barrels as forward index. Sorted by the sorter. Every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. Two sets of inverted barrels: one set for hit lists which include title or anchor hits, another for all hit lists. wordIDNdocs wordIDndocs docID:27Nhits:5Hit hit hit hit docID:27Nhits:5Hit hit hit hit

25 Crawling the web Fast distributed crawling system. URL Server & Crawlers are implemented in Python. One single URL server, 3 crawlers. Each keeps 300 connections open at the same time, speed at about 600K /sec of data. Internal cached DNS lookup looking up DNS  connecting to host  sending request  receiving response. Asynchronous IO to manage events.

26 Indexing the web Parsing  Should know to handle errors. HTML typos Non-ASCII characters HTML tags nested hundreds deep  Develop their own parser Indexing documents into barrels  Turning words into wordIDs  In-memory hash table – the Lexicon  New additions are logged to files

27 Indexing the web  Parallelization shared lexicon of 14 million pages, log of all the extra words. Sorting  Creating the inverted index  Produces two types of barrels. For titles and anchor For full text  Running sorters at parallel  The sorting is done in main memory

28 Searching 1. Parse the query 2. Convert word into wordIDs 3. Seek to the start of the doclist in the short barrel for every word 4. Scan through the doclist until there is a document that matches all of the search terms 5. Compute the rank of that document 6. If we ’ re at the end of the short barrels, start at the doclist of the full barrel for every word and go to step 4 7. If we ’ re not at the end of any doclist go to step 4 8. Sort the documents by rank return the top K.

30 The Ranking system PageRank(TM) to determine the relative importance of each page Google crawls on the web. Among the characteristics PageRank evaluates are the text included in the links to a site, the text on each page and the PageRank of the sites linking to the site being evaluated. Single word search, check the hit list for that word. In Multi-word search, jots occurring close together in a document are weighted higher than hits occurring far apart.

32 Results Example: query “ Bill Clinton ” Return results from the “ Whitehouse.gov ” Email address of the president All the results are high quality pages No broken links No Bill without Clinton and vice versa.

33 Storage Requirements Using compression on the repository About 55GB for all the data used by the SE Most of the queries can be answered by just the short inverted index With better compression,a high quality SE can fit onto a 7GB drive of a new PC.

35 System Performance It took 9 days to download 26 million pages 48.5 pages per second The Indexer & Crawler ran simultaneously The Indexer runs at 54 pages per second The sorter run in parallel using 4 machines, the whole process took 24 hours.

36 Conclusion Scalable Search Engine High quality search results Search techniques PageRank Anchor Text Proximity information Search feartures Catalog, Site Search, Cached links, Similar pages, Who links to you, File Types Speed: efficient algorithm, thousands of low cost PCs networked together

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

Similar presentations

Presentation on theme: "1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

Similar presentations

Presentation on theme: "1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University."— Presentation transcript:

Similar presentations

About project

Feedback