1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

Slides:

Advertisements

Similar presentations

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Natural Language Processing WEB SEARCH ENGINES August, 2002.

The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.

Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Presented by: Vanshika Sharma

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

Anatomy of Google (circa 1999) Slides from Project part B due a month from now (10/26)

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.

Google and Scalable Query Services

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

Presented By: - Chandrika B N

The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.

HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.

The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Anatomy of a search engine Design criteria of a search engine Architecture Data structures.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.

Search Xin Liu.

“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.

The anatomy of a Large-Scale Hypertextual Web Search Engine.

The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.

1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.

1 CS 430: Information Discovery Lecture 20 Web Search Engines.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

General Architecture of Retrieval Systems 1Adrienn Skrop.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,

The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.

Design and Implementation of a High-Performance distributed web crawler Vladislav Shkapenyuk and Torsten Suel Proc. 18 th Data Engineering Conf., pp ,

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

The Anatomy Of A Large Scale Search Engine

Google and Scalable Query Services

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Search Search Engines Search Engine Optimization Search Interfaces

Hongjun Song Computer Science The University of Memphis

Thanks to Ray Mooney & Scott White

Anatomy of a search engine

Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI

Web Search Engines.

The Search Engine Architecture

Instructor : Marina Gavrilova

Presentation transcript:

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

2 Outline Introduction Design Goals System Features System Anatomy Results & Performance Conclusion

3 Introduction Google: Large-scale search engine Design to crawl & index the web efficiently Crawler: download web pages Gives better results Google = 10 ^ 100 To engineer a SE is a challenging task. 2 ways of surfing High quality human maintained lists (Yahoo!) too slow to improve, can ’ t cover esoteric topics

4 Introduction(Cont.) Expensive to build and maintain. Search engines (google) search by keywords too many low quality matches people try to mislead automated search engines. Challenges in creating a search engine Fast crawling technology Efficient storage space Handle queries quickly

5 Introduction(Cont.) Scaling with the web Improved hardware performance exceptions : disk seek time, OS Google ’ s data structure are optimized for fast and efficient access. Google is a centralized SE.

6 Design Goals Improved search quality Junk results  Often wash out any results that a user is interested in  As the collection size grows, we need tools with very high precision Use of hypertextual information  Quality filtering link structure and link text Support novel research on large – scale web data

7 System features PageRank : bringing order to the web Most web SE has largely ignored the link graph Maps containing 518 million of hyperlinks Correspond well with people idea of importance citation importance. B C A B and C are backlinks of A

8 For this example: PR(A) = (1-d) + d(PR(T1)/3 + PR(T2) + PR(T3) + PR(T4)/2) Motivation: –Pages that are cited from many places are worth looking at. –Pages that are cited from an important place are worth looking at.

9 System features Pr(A) ＝ (1-d) ＋ (Pr(T1) / C(T1) + … +Pr(Tn) / C(Tn)) Assume page A has pages T1 … Tn which points to it. The parameter d can be set between 0 and 1(0.85). C(A) ： the number of links going out of page A. Random Surfer  Given a random URL  Clicks randomly on links  After a while gets bored and gets a new random URL  d is the probability at each page the “ random surfer ” gets bored and request another random page.

10 System features Difference from traditional methods  Not counting links from pages equally  Normalizing by the number of links in a page Anchor Text Associate link text with the page it points to. advantages:  Anchor provide more accurate description  Can exist for documents that can ’ t be indexed. Images,non-text docs.  Can return pages that hadn ’ t been crawled

11

12 System features Other Features Location information: use of proximity in search Visualization Information: font relative size Full raw HTML is available  Users can view a cached version of pages

13 System Anatomy

14 System Anatomy Design to avoid disk seek. Web pages are fetched, compressed and stored in repository Indexer parses the documents into hits (stored in barrels) and anchors.

15 Major Data Structures Hit Lists Forward Index Inverted Index Crawling the web Indexing the web Life of Query The Ranking system

16

17

18

19

20

21 Hit List What is Hit List? A hit list is a list of occurrences of a particular word in a particular document including position. Font, and capitalization information. Stored in both the forward and inverted indices. Encoded by hand optimized compact encoding(less space, less bit manipulation) 2 bytes storage. Cap:1Imp:3Position: 12 Cap:1Imp:3Type:4Pos:8 Cap:1Imp:3Type:4Hash: 4 Pos: 4 Plain: Fancy: Anchor:

22

23 Forward Index Given a docID, get it ’ s wordID and hit lists. Partial sorted and stored in forward barrels. Each barrel holds a range of wordID ’ s. Duplicated docIDs exist in the barrels. Instead of storing actual wordID, each wordID is stored as a relative difference from the minimum wordID in that barrel. So 2 24 = 16 millions. docIDwordID:24Nhits: 8Hit hit hit hit wordID:24Nhits: 8Hit hit hit hit null wordID docIDwordID:24Nhits: 8Hit hit hit hit

24 Inverted Index Given a wordID  docID Stored in the same barrels as forward index. Sorted by the sorter. Every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. Two sets of inverted barrels: one set for hit lists which include title or anchor hits, another for all hit lists. wordIDNdocs wordIDndocs docID:27Nhits:5Hit hit hit hit docID:27Nhits:5Hit hit hit hit

25 Crawling the web Fast distributed crawling system. URL Server & Crawlers are implemented in Python. One single URL server, 3 crawlers. Each keeps 300 connections open at the same time, speed at about 600K /sec of data. Internal cached DNS lookup looking up DNS  connecting to host  sending request  receiving response. Asynchronous IO to manage events.

26 Indexing the web Parsing  Should know to handle errors. HTML typos Non-ASCII characters HTML tags nested hundreds deep  Develop their own parser Indexing documents into barrels  Turning words into wordIDs  In-memory hash table – the Lexicon  New additions are logged to files

27 Indexing the web  Parallelization shared lexicon of 14 million pages, log of all the extra words. Sorting  Creating the inverted index  Produces two types of barrels. For titles and anchor For full text  Running sorters at parallel  The sorting is done in main memory

28 Searching 1. Parse the query 2. Convert word into wordIDs 3. Seek to the start of the doclist in the short barrel for every word 4. Scan through the doclist until there is a document that matches all of the search terms 5. Compute the rank of that document 6. If we ’ re at the end of the short barrels, start at the doclist of the full barrel for every word and go to step 4 7. If we ’ re not at the end of any doclist go to step 4 8. Sort the documents by rank return the top K.

29

30 The Ranking system PageRank(TM) to determine the relative importance of each page Google crawls on the web. Among the characteristics PageRank evaluates are the text included in the links to a site, the text on each page and the PageRank of the sites linking to the site being evaluated. Single word search, check the hit list for that word. In Multi-word search, jots occurring close together in a document are weighted higher than hits occurring far apart.

31

32 Results Example: query “ Bill Clinton ” Return results from the “ Whitehouse.gov ” address of the president All the results are high quality pages No broken links No Bill without Clinton and vice versa.

33 Storage Requirements Using compression on the repository About 55GB for all the data used by the SE Most of the queries can be answered by just the short inverted index With better compression,a high quality SE can fit onto a 7GB drive of a new PC.

34

35 System Performance It took 9 days to download 26 million pages 48.5 pages per second The Indexer & Crawler ran simultaneously The Indexer runs at 54 pages per second The sorter run in parallel using 4 machines, the whole process took 24 hours.

36 Conclusion Scalable Search Engine High quality search results Search techniques PageRank Anchor Text Proximity information Search feartures Catalog, Site Search, Cached links, Similar pages, Who links to you, File Types Speed: efficient algorithm, thousands of low cost PCs networked together