© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Presented by: Vanshika Sharma
The Anatomy of a Large-Scale Hypertextual Web Search Engine
From Memex to Google in 120 minutes Rivka Taub Amit Levin.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Modeling and Optimizing Hypertextual Search Engines Based on the Reasearch of Larry Page and Sergey Brin Gus Johnson Search EnginesModified.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented by: Saumeet Mohapatra Electronics &Telecommunication Engineering Regn. No: Roll no: KIIT.UNIVERSITY.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Chapter 2: How Search Engines Work. Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine.
Information Retrieval in Practice
Search Engine Architecture
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Hongjun Song Computer Science The University of Memphis
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term Paper November 05, 2001 Gowri V Pai, Graduate Student Computer Engineering Department Wayne State University

Google Search Engine Authors and Founders Larry Page & Sergey Brin

Google Search Engine Introduction Need for Search Engine Technology –More than 50 Billion Pages on the World Wide Web –Simplicity and Convenience –Quick Data Retrieval Popular Search Engines –Google –Yahoo Search –MSN Search –Altavista

Google Search Engine Objectives Search engine design challenges Features of a quality search engine Search engine system anatomy Search engine applications Performance Metrics Interactive close

Google Search Engine Search Engine Design Challenges Obstacles for Information Retrieval –Rapid growth of number of web users –Rapid growth of amount of information on the web 1994: World Wide Web Worm had an index of web pages 1997: Web crawler claimed to index 100 million web pages 2001: Data available (expected to multiply several folds) Low quality match results returned by keywords –Advertiser gimmicks to mislead users

Google Search Engine Search Engine Design Challenges Technical Challenges –Need for a fast crawling technology to gather web documents and keep them up to date –Efficient utilization of space to store indices and documents themselves –Need for an efficient indexing system to handle gigabytes of data –Quick query handling capabilities – Improved search quality

Google Search Engine Introduction to Google Search Engine Google derived from GOOGOL a number with 100 zeros Features –Stores all of the actual document it crawls in compressed form –Embraces the concept of a “PageRank” –Anchor Propogation –Location information of all hits –Visual presentation details, such as font size –HTML of pages available in repository

Google Search Engine Basic Terminology PageRank B and C are backlinks of A PageRank of a page is the number of visits made by a random surfer OR the probability that a random surfer visits that page

Google Search Engine Basic Terminology PageRank Computation Example PR(A) = (1-d) + d ( PR(T1)/C(T1) + ………+ PR(Tn)/C(Tn)) T1…..Tn pages pointing to page A C(Tn) Number of links going out of page Tn D is a damping factor; usually set to 0.85

Google Search Engine Basic Terminology Anchor Propagation in Google –Anchor text is associated with the page the link is on the page the link refers to –Advantage Web pages which have not actually been crawled can be returned Ex: images, programs, databases Better quality results

Google Search Engine Search Engine System Anatomy URL Server Crawler Store Server Repository Indexer Anchor URL Resolver Barrels Lexicon Sorter Searcher Pagerank DOC Index Links

Google Search Engine Search Engine System Anatomy Terminology: Repository Contains full HTML of every page in a compressed form Documents are sorted in a sequence prefixed by docID, length and URL Document Index Keeps information about each document Including the current document status, a pointer into the repository and various statistics

Google Search Engine Search Engine System Anatomy Terminology: URL Resolver- Convert URL’s into docIDs - URL checksum is computed and binary search is performed on the checksum file Lexicon - Like a dictionary with - 14 million words 2 Parts - 1) List of Words 2) Hash Table of pointers

Google Search Engine Search Engine System Anatomy Terminology: Forward Index - Partially sorted index – the first step to create inverted index Index is sorted in number of barrels, each holding a range of wordIDs

Google Search Engine Search Engine System Anatomy Terminology: Inverted Index- Same number of barrels as forward index, but is processed by sorter For every valid wordID, lexicon contains a pointer into the barrel

Google Search Engine Search Engine System Anatomy Terminology: Hit List- Corresponds to a list of a particular word occurrence in a particular document, including – position, font and capitalization information Accounts for most of the space in both forward and inverted index Mostly used is compact Encoding – requires less space - less bit manipulation

Google Search Engine Major Search Engine Operation Search engine applications Crawling Indexing Searching

Google Search Engine Major Search Engine Operation Crawling Interacting with hundreds of thousands of web servers Google has fast distributed crawling system – keeps 300 connections open at once At peak speeds, the system can crawl 100 web pages/sec using 4 crawlers Each crawler maintains its own DNS cache hence reducing performance stress[ DNS lookup ] URLservers and crawlers are implemented in Python Crawlers use Robots Exclusion Protocol

Google Search Engine Major Search Engine Operation Indexing Parsing : Designed to run on entire web – must handle huge array of errors Indexing documents into barrels : After parsing documents are encoded into number of barrels Every word is converted into wordID using – hash table & lexicon Sorting : Generates inverted index – each forward barrels are sorted by the wordID

Google Search Engine Quality Search Searching Google Query Evaluation Process 1.Parse the query. 2.Convert words into wordIDs. 3.Seek to the start of the doclist in the short barrel for every word. 4.Scan through the doclists until there is a document that matches all the search terms. 5.Compute the rank of that document for the query. 6.If in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7.If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.

Google Search Engine Quality Search Ranking System in Google Single word Query : Looking up for the word in the document’s hit list Each hit has its own type weight depending on the – title, font, URL, anchor Number of hits of each type is counted in the hit list & every count is converted into count-weight Dot product of count-weights with type-weights is taken to compute IR score IR score combined with PageRank for final rank of the document

Google Search Engine Quality Search Ranking System in Google Multiple word Query : [ complicated process ] Hit lists are scanned for hits occurring close together in the document and are weighted high For matched set, proximity is computed depending on the distance between the hits Counts are computed for every hit depending on type and proximity and converted into count-weights Type and proximity has type-prox-weight Dot product of count-weights and type-prox-weight to compute IR score which in turn gives the final rank

Google Search Engine Performance Metrics Performance & Results All pages have high PageRank hence are high quality pages – without any broken links No junk results – importance on proximity of word occurrence Testing performance of search engine is not a easy task, involves extensive user study

Google Search Engine Performance Metrics Storage Space

Google Search Engine Performance Metrics System Performance Experimental Improvement: Major operation of google – crawling, indexing and sorting 9 days to download 26 million pages Indexer was optimised to avoid bottleneck – it runs roughly at 54 pages/sec Both indexer and crawler were run simultaneously to check the performance Sorter runs in parallel [ 4 machines ] – sorting process took 24 hrs

Google Search Engine Performance Metrics Search Performance Most queries are answered between 1 –10 sec

Google Search Engine Interactive Close Conclusion High quality search Efficient in both storage space and time Employ number of techniques to improve performance Overcome bottlenecks