By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
Google and the Page Rank Algorithm Székely Endre
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky ( at Birmingham Perl Mongers.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Chapter 2: How Search Engines Work. Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine.
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
The Anatomy Of A Large Scale Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Instructor: P.Krishna Reddy
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine

The Problem The Web continues to grow rapidly So do its inexperienced users Human-maintained lists: –Cannot to keep up with volume of changes, –Subjective –Do not cover all topics Automated search engines –Bring too many low relevant matches –Advertisers mislead them for commercial purposes

Internet Growth

Keeping up with the Web Requirements: Fast crawling technology Efficient use of storage (indices and possibly documents) Hundreds of thousands index queries per second Mitigating factors: Technology performance improves (exceptions: disk seek time, OS stability) while its cost tends to decline.

Design Goals

Design Goal: Search Quality By 1994: All it is needed is a complete index of the Web. By 1997: An index may be complete and still return many junk results that tarnish relevant ones. –Index size has increased, and so does the number of matches, but… –people is still willing to look only a handful of results. Only top relevant documents should be returned. Theory expects the Web’s link structure and link text help finding such relevant documents.

The Web orientation passed from academic to commercial. Search engine development had remained an obscure and propietary area. Google wants to make it more understandable to the academic level and promote continuing research. By caching parts of the Web, Google itself is considered a research plattform from where new results can be derived quickly. Design Goal: Academic Research

Google Features

Prioritizing Pages: PageRank The Web can be described as a huge graph Such graph can be used to make a fast calculation of the importance of a result item, based on the keywords given by the user. This resource had been unused at large until Google.

An Example of Page Rank The calculation of a page’s rank is defined in terms of: –The page rank of other pages pointing to it: PR(T i ) –The number of pages this page references: C(T i ) –A “dampening factor” from 0 to 1: d PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) This formula is calculated using an iterative algorithm in short time.

Intuition behind Page Rank 1.A “random surfer” starts from a random page, clicking random links again and again. The probability of him visiting page A is P(A) = PR(A) At times he requests to start again. The probability of him starting again is the dampening factor d, used to avoid misleading the system intentionally.

Intuition behind Page Rank Index looks like: … … … … … … … … … The more references a page has, the more likely the “random surfer” is likely to get to it. That is the page’s PageRank. d exists so that not always the decision is based on page references, as someone could intentionally do that.

Anchor Text Usually describes better a page than the page itself. Associated not only to the page where it is found, but the one it points to. Makes possible to index non-text content. Downside: The destination of these links is not verified, so they may even not exist.

Other Features Takes into account the in-page position of hits. Presentation of words (big size, bold, etc.), weighting them accordingly. The HTML of pages is cached in a repository.

Related Research (1997)

World Wide Web Worm was one of the first search engines. Many former search engines turned into public companies. Details of such search engines is usually confidential. There is known work on post-processing of results of major search engines.

Research on Information Retrieval Produced results based on a controlled set of documents on a specific area. Even the largest benchmark (TREC-96) would not scale well in an much bigger and heterogeneous place like the Web. Given a popular topic, users should not need to give many details on it in order to get relevant results.

The Web is a completely uncontrolled collection of documents varying in their… –languages: both human and programming –vocabulary: from zip codes to product numbers –format: text, HTML, PDF, images, sounds –source: human or machine-generated –External meta information: source reputation, update frequency, etc. are all valuable but hard to measure. Any type of content + influence of search engines + intentional for-profit mislead <> controlled!! The Web vs. Controlled Collections

System Anatomy

Architecture Overview Implemented in C/C++, can run in Solaris or Linux. 1.A URLServer sends lists of URLs to be fetched by a set of crawlers 2.Fetched pages are given a docID and sent to a StoreServer which compresses and stores them in a repository 3.The indexer extracts pages from the repository and parses them to classify their words into hits 4.Its output goes to barrels, or partially sorted indexes 5.It also builds the anchors file from links in the page, recording to and from information

5.The URLResolver reads the anchors file, converting relative URLs into absolute, and assigning docIDs 6.The forward index is updated with docIDs the links point to. 7.The links database is also created as pairs of docIDs. This is used to calculate the PageRank 8.The sorter takes barrels and sorts them by wordID (inverted index). A list of wordIDs points to different offsets in the inverted index 9.This list is converted into a lexicon (vocabulary) 10.The searcher is run by a Web server and uses the lexicon, inverted index and PageRank to answer queries Architecture Overview

Data Structures BigFiles:Virtual files across filesystems, go beyond OS capabilities. Repository: Contains the actual HTML compressed 3:1 using open-source zlib. –Stored like variable-length data in a DBMS –Independent of other data structures –Other data structures can be restored from here

Data Structures Document Index: Indexed sequential file with status information about each document. To avoid slow disk seek operations, updates to the URL resolver file are made in batch mode. Otherwise it would take months. Lexicon: Or list of words, is kept on 256MB of main memory, allocating 14 million words and hash pointers.

Hit Lists: Records occurrences of a word in a document plus details. Accounts for most of the space used. –Fancy hits: URL, title, anchor text, –Plain hits: Everything else –Details are contained in bitmaps: Data Structures

Forward Index: Stores wordIDs and references to documents containing them. Stored in partially sorted indexes called “barrels”. Inverted Index: Same as above but after sorting by docID. Stores docIDs pointing to hits. Data Structures

A Simple Inverted Index Example: Pages containing the words "i love you" "god is love," "love is blind," and "blind justice.“ blind (3,8);(4,0) god (2,0) i (1,0) is (2,4);(3,5) justice (4,6) love (1,2);(2,7);(3,0) you (1,7)

Crawling the Web Distributed crawling system, 3 concurrent connections, or 100 pages per Stress on DNS lookup is reduced by having a DNS cache in each crawler. Social consequences due to lack of knowledge (“This page is copyrighted and should not be indexed”) Any behavior can be expected of software crawling the net. Intensive testing required.

Indexing the Web Parser: Must be validated to expect and deal with a huge number of special situations. Indexing into barrels: Word > wordID > update lexicon > hit lists > forward barrels. There is high contention for the lexicon. Sorting: Inverted index is generated from forward barrels by sorting them individually to avoid temp storage, using TPMMS.

Searching The ranking system is better because more data (font, position and capitalization) is maintained about documents. This and PageRank help in finding better results. Feedback: Selected users can grade search engine results to help recalculate efficiency.

Results and Performance “The most important measure of a search engine is the quality of its search results.” Results are correct even for non-commercial or rarely referenced sites. Cost-effective: A significant part of 1997’s Web was held in a 53GB repository, and all other data could fit in additional 55GB.

Google’s Philosophy “…it is crucial to have a competitive search engine that is transparent and in the academic realm.” Google is probably the only leading IT company that is loved by everyone, and remains attached to its principles despite its amazing profit potential.

The Close Future