The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
GOOGLE SEARCH ENGINE Presented By Richa Manchanda.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Presented by: Vanshika Sharma
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Anatomy of Google (circa 1999) Slides from Project part B due a month from now (10/26)
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
Presented By: - Chandrika B N
Introduction to Information Retrieval and Anatomy of Google.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Google Search Engine
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
Search Engine Architecture
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Thanks to Ray Mooney & Scott White
Instructor: P.Krishna Reddy
Anatomy of a search engine
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Instructor : Marina Gavrilova
Presentation transcript:

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

The Original Google Paper Google is the common spelling of googol, or , which fit well with the authors’ goal of building very large-scale search engines.

Outline  Design goals  System features  System anatomy  Results and performance  Paper analysis

Design Goals

1.Scale with the rapid growth of the web

Design Goals 2.Improved Search Quality  Number of documents on the web are increasing rapidly, but users’ ability to look at them lags.  Current search engines return lots of “junk” results, with little relevance. (Note: We’re talking about the year 1998) 3.Academic Search Engine Research  Push more development and understanding into the academic realm.  Systems that reasonable number of people can use.  Build an architecture to support novel research activities in large-scale web data.

System Features

1.Makes use of the link structure of the Web to calculate a quality ranking for each page, called the PageRank.  A probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page.  It considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.

PageRank: Bringing Order to the Web  PR(A)  PageRank of a webpage A  PR(T i )  PageRank of a webpage T i pointing to A  C(T i )  Number of outbound links for webpage T i  L(A)  Set of webpages linking to A  d  damping factor, a value between 0 and 1, is the probability that a random surfer will stop clicking  Note that PageRanks form a probability distribution of webpages, so the summation of all webpages will be 1.

PageRank: Bringing Order to the Web  Assume a universe of 4 webpages: A, B, C, and D  Taking into consideration that a random surfer will eventually stop clicking, we assume a damping factor, d, which is generally assumed to be 0.85

System Features 2.Makes use of Anchor text of links on webpages:  E.g. Yahoo!  Text of a link is not only associated with the webpage it is on, it also gives information (sometimes more relevant) to the webpage it points to.  Anchors may exist for documents which generally cannot be indexed by text-based search engines, such as images, programs, and databases.

System Features 3.Uses location information for all hits and thus makes extensive use of proximity in search. 4.Keeps track of visual presentation of text on webpages such as font sizes. Words with bolder/larger font are given more importance. 5.Stores complete raw HTML of webpages in repository.

System Anatomy

Major Data Structures 1.BigFiles  Virtual files spanning multiple file systems and addressable by 64 bit integers. 2.Repository  Contains full compressed HTML of all pages.  Stored one after another prefixed with docID, length, and URL.  Compressed using high speed compression technique (zlib) instead of high compression ratio (bzip).

Major Data Structures 3.Document Index  Keeps information about each document.  It’s a fixed width index, ordered by docID.  Stores document status, pointer into the repository, and checksum.  If document is indexed, points to a variable width file docinfo which contains URL and title. Else points to URLlist containing only the URL. 4.Lexicon  Contains list of null separated words (about 14 million) and hash table of pointers.

Major Data Structures 5.Hit Lists  A list of occurrences of a particular word in a particular document including position, font, and capitalization information.  Hit lists account for most of the space used in both the forward and the inverted indices. 6.Forward Index  Stored in a number of barrels.  If a document contains words that fall into a particular barrel, the docID is recorded into the barrel followed by a list of wordIDs with their hitlists.

Major Data Structures 7.Inverted Index  The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter.

Crawling the Web 1.Several distributed crawlers.  URLserver serves list of URLs to the crawler.  Each crawler keeps ~300 open connections.  At max, a system of 4 crawlers can crawl ~100 pages/sec or ~600 K/second of data.  Each maintains it’s own DNS cache for fast lookup. 2.Parser handles a huge array of possible errors including HTML errors, non-ASCII characters, or HTML tags nested hundreds deep

Indexing the Web 3.Indexing Documents into Barrels  After each document is parsed, it is encoded into a number of barrels.  Every word is converted into a wordID using an in- memory hash table – the lexicon.  Once words are converted into wordIDs, their occurrences in the current document are translated into hit lists and are written into the forward barrels. 4.Sorting  Sorter takes each of the forward barrels and sorts by wordID to produce an inverted barrel for title and anchor hits and full text inverted barrel.

Searching 1.Parse the query 2.Convert words into wordIDs. 3.Seek to the start of the doclist in the short barrel for every word. 4.Scan through the doclists until there is a document that matches all the search terms. 5.Compute the rank of that document for the query. 6.If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7.If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

Results and Performance

 A qualitative analysis of the search results by users has generally been positive.  The current version of Google answers most queries in between 1 and 10 seconds.  Since Google takes into consideration the proximity of word occurrences, results are more relevant than other search engines giving a set of results for all words in queries. (E.g. search for ‘bill clinton’ gives lower importance to results with independent ‘bill’ and ‘clinton’)

Future Works  Current version of Google search times are dominated by disk IO. Introduce query caching, and hardware, software and algorithmic optimizations.  Improve search efficiency and quickly scale to ~100 million web pages.  Develop Google as a resource for large scale research tool for searchers and researchers.

Analyses of the Research Paper  Pros  One of the first descriptions of the PageRank algorithm which changed how search engines ranked and indexed the web.  Using citation graph and anchor text to rank pages closely resembled user behavior of ranking websites.  Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.  The paper mentions Google does not compromise PageRanks for monetary gains giving more credibility to search results. This holds true to date.

Analyses of the Research Paper  Cons  One of the first flaws found in the PageRank algorithm was the “Google Bomb”:  Because of the PageRank, a page will be ranked higher if the sites that link to that page use consistent anchor text.  A Google bomb is created if a large number of sites link to the page in this manner.  Ranking quality is insufficient using only PageRank and anchor text. (Google today uses more than 200 different parameters to judge quality of a webpage.)

Thank You Presented by: Nilay Khandelwal