The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages 107-117, April.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Presented by: Vanshika Sharma
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Anatomy of Google (circa 1999) Slides from Project part B due a month from now (10/26)
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Google Search Engine
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Information Retrieval in Practice
Search Engine Architecture
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Hongjun Song Computer Science The University of Memphis
Anatomy of a search engine
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Instructor : Marina Gavrilova
Information Retrieval and Web Design
Presentation transcript:

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April Young Geun Han

Contents System Anatomy  Crawling the Web  Indexing the Web  Searching Results and Performance  Storage Requirements  System Performance  Search Performance Conclusions

Crawling the Web (1) Crawler  The most fragile application  Involves interacting with many web servers and name servers Running a web crawler  Tricky performance, reliability issues and social issues

Crawling the Web (2) Tricky performance  Google has a fast distributed crawling system  Each crawler keeps roughly 300 connection open at once  Google can crawl over 100 web pages per second using four crawlers at peak speeds (roughly 600K per second of data)  Each crawler maintains a its own DNS cache  The crawler uses asynchronous IO and a number of queues Looking up DNSConnecting to hostSending requestReceiving response

Crawling the Web (3) Reliability issues  There are many people who don’t know what a crawler is  They consider running a crawler as generating a fair amount of and phone calls  They consider that we like their web site very much  There are some people who don’t know about the robots exclusion protocol

Crawling the Web (4) Social issues  Because of the huge amount of data involved, unexpected things will happen  Easy problem to fix had not come up until we had download tens of millions of pages  Impossible to test a crawler without running it on large part of the Internet  Crawlers need to be designed to be very robust and carefully tested

Indexing the Web (1) Parsing  Any parser must handle a huge array of possible errors  Use flex to generate a lexical analyzer for maximum speed URL ServerStore Server Crawler Repository Indexer Barrels Indexer Sorter

Indexing the Web (2) Indexing Documents into Barrels  After each document is parsed, it is encoded into a number of barrels  Every word is converted into a wordID by using an in-memory hash table -- the lexicon  New additions to the lexicon hash table are logged to a file  The words in the current document are translated into hit lists  The words are written into the forward barrels  For parallelization, indexer writes a log to a file, instead of sharing the lexicon

Indexing the Web (3) Sorting  Takes each of the forward barrels  Sorts it by wordID to produce an inverted barrel  Parallelize the sorting phase  Subdivides the barrels into baskets to load into main memory because the barrels don’t fit into memory  Sorts baskets and writes its contents into the inverted barrel

Searching (1) 1. Parse the query 2. Convert words into wordIDs 3. Seek to the start of the doclist in the short barrel for every word 4. Scan through the doclists until there is a document that matches all the search terms 5. Compute the rank of that document for the query 6. If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4 7. If we are not at the end of any doclist go to step 4 8. Sort the documents that have matched by rank and return the top k Figure 4. Google Query Evaluation

Searching (2) The Ranking System  Every hitlist include position, font and capitalization information  Factor in hits from anchor text and the PageRank of the document  Ranking function so that no particular factor can have too much influence  For a single word search  In order to rank a document, Google looks at that document’s hit list for a single word query and computes an IR score combined with PageRank  For a multi-word search  Hits occurring close together in a document are weighted higher than hits occurring far apart

Searching (3) Feedback  Google has a user feedback mechanism because figuring out the right values for many parameters is very difficult  When the ranking function is modified, this mechanism gives developers some idea of how a change in the ranking function affects the search results

Result and Performance (1)

Result and Performance (2) Google’s results for a search  A number of results are from the whitehouse.gov domain  Most major commercial search engines do not return any results from whitehouse.gov  There is no title because it was not crawled  Instead, Google relied on anchor text to determine this was a good answer to the query  There are no results about a Bill other than Clinton or about a Clinton other than Bill

Result and Performance (3) Storage Requirements Table 1. Statistics

Result and Performance (4) System Performance  In total it took roughly 9 days to download the 26 million pages (including errors)  Download the last 11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second  The indexer ran just faster than the crawlers  The indexer runs at roughly 54 pages per second  Using four machines, the whole process of sorting takes about 24 hours

Result and Performance (5) Search Performance  Google answers most queries in between 1 and 10 seconds  The search time is mostly dominated by disk IO over NFS Table 2. Search Times

Conclusions (1) Google  A scalable search engine  Including page rank, anchor text, and proximity information  A complete architecture for gathering web pages, indexing them, and performing search queries over them

Conclusions (2) Future Work  Improve search efficiency and scale to approximately 100 million web pages  Smart algorithms to decide what old web pages should be recrawled and what new ones should be crawled High Quality Search  Google makes heavy use of hypertextual information consisting of link structure and link text  Google also uses proximity and font information  The analysis of link structure and PageRank allows Google to evaluate the quality of web pages

Conclusions (3) Scalable Architecture  Google is efficient in both space and time  Google’s major data structures make efficient use of available storage space  The crawling, indexing, and sorting operations are efficient in time  Google overcomes a number of bottlenecks A Research Tool  Not only a high quality search engine but a research tool  A necessary research tool for a wide range of applications