Download presentation
Presentation is loading. Please wait.
1
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term Paper November 05, 2001 Gowri V Pai, Graduate Student Computer Engineering Department Wayne State University
2
Google Search Engine Authors and Founders Larry Page & Sergey Brin
3
Google Search Engine Introduction Need for Search Engine Technology –More than 50 Billion Pages on the World Wide Web –Simplicity and Convenience –Quick Data Retrieval Popular Search Engines –Google –Yahoo Search –MSN Search –Altavista
4
Google Search Engine Objectives Search engine design challenges Features of a quality search engine Search engine system anatomy Search engine applications Performance Metrics Interactive close
5
Google Search Engine Search Engine Design Challenges Obstacles for Information Retrieval –Rapid growth of number of web users –Rapid growth of amount of information on the web 1994: World Wide Web Worm had an index of 110000 web pages 1997: Web crawler claimed to index 100 million web pages 2001: Data available (expected to multiply several folds) Low quality match results returned by keywords –Advertiser gimmicks to mislead users
6
Google Search Engine Search Engine Design Challenges Technical Challenges –Need for a fast crawling technology to gather web documents and keep them up to date –Efficient utilization of space to store indices and documents themselves –Need for an efficient indexing system to handle gigabytes of data –Quick query handling capabilities – Improved search quality
7
Google Search Engine Introduction to Google Search Engine Google derived from GOOGOL a number with 100 zeros Features –Stores all of the actual document it crawls in compressed form –Embraces the concept of a “PageRank” –Anchor Propogation –Location information of all hits –Visual presentation details, such as font size –HTML of pages available in repository
8
Google Search Engine Basic Terminology PageRank B and C are backlinks of A PageRank of a page is the number of visits made by a random surfer OR the probability that a random surfer visits that page
9
Google Search Engine Basic Terminology PageRank Computation Example PR(A) = (1-d) + d ( PR(T1)/C(T1) + ………+ PR(Tn)/C(Tn)) T1…..Tn pages pointing to page A C(Tn) Number of links going out of page Tn D is a damping factor; usually set to 0.85
10
Google Search Engine Basic Terminology Anchor Propagation in Google –Anchor text is associated with the page the link is on the page the link refers to –Advantage Web pages which have not actually been crawled can be returned Ex: images, programs, databases Better quality results
11
Google Search Engine Search Engine System Anatomy URL Server Crawler Store Server Repository Indexer Anchor URL Resolver Barrels Lexicon Sorter Searcher Pagerank DOC Index Links
12
Google Search Engine Search Engine System Anatomy Terminology: Repository Contains full HTML of every page in a compressed form Documents are sorted in a sequence prefixed by docID, length and URL Document Index Keeps information about each document Including the current document status, a pointer into the repository and various statistics
13
Google Search Engine Search Engine System Anatomy Terminology: URL Resolver- Convert URL’s into docIDs - URL checksum is computed and binary search is performed on the checksum file Lexicon - Like a dictionary with - 14 million words 2 Parts - 1) List of Words 2) Hash Table of pointers
14
Google Search Engine Search Engine System Anatomy Terminology: Forward Index - Partially sorted index – the first step to create inverted index Index is sorted in number of barrels, each holding a range of wordIDs
15
Google Search Engine Search Engine System Anatomy Terminology: Inverted Index- Same number of barrels as forward index, but is processed by sorter For every valid wordID, lexicon contains a pointer into the barrel
16
Google Search Engine Search Engine System Anatomy Terminology: Hit List- Corresponds to a list of a particular word occurrence in a particular document, including – position, font and capitalization information Accounts for most of the space in both forward and inverted index Mostly used is compact Encoding – requires less space - less bit manipulation
17
Google Search Engine Major Search Engine Operation Search engine applications Crawling Indexing Searching
18
Google Search Engine Major Search Engine Operation Crawling Interacting with hundreds of thousands of web servers Google has fast distributed crawling system – keeps 300 connections open at once At peak speeds, the system can crawl 100 web pages/sec using 4 crawlers Each crawler maintains its own DNS cache hence reducing performance stress[ DNS lookup ] URLservers and crawlers are implemented in Python Crawlers use Robots Exclusion Protocol
19
Google Search Engine Major Search Engine Operation Indexing Parsing : Designed to run on entire web – must handle huge array of errors Indexing documents into barrels : After parsing documents are encoded into number of barrels Every word is converted into wordID using – hash table & lexicon Sorting : Generates inverted index – each forward barrels are sorted by the wordID
20
Google Search Engine Quality Search Searching Google Query Evaluation Process 1.Parse the query. 2.Convert words into wordIDs. 3.Seek to the start of the doclist in the short barrel for every word. 4.Scan through the doclists until there is a document that matches all the search terms. 5.Compute the rank of that document for the query. 6.If in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4. 7.If we are not at the end of any doclist go to step 4. Sort the documents that have matched by rank and return the top k.
21
Google Search Engine Quality Search Ranking System in Google Single word Query : Looking up for the word in the document’s hit list Each hit has its own type weight depending on the – title, font, URL, anchor Number of hits of each type is counted in the hit list & every count is converted into count-weight Dot product of count-weights with type-weights is taken to compute IR score IR score combined with PageRank for final rank of the document
22
Google Search Engine Quality Search Ranking System in Google Multiple word Query : [ complicated process ] Hit lists are scanned for hits occurring close together in the document and are weighted high For matched set, proximity is computed depending on the distance between the hits Counts are computed for every hit depending on type and proximity and converted into count-weights Type and proximity has type-prox-weight Dot product of count-weights and type-prox-weight to compute IR score which in turn gives the final rank
23
Google Search Engine Performance Metrics Performance & Results All pages have high PageRank hence are high quality pages – without any broken links No junk results – importance on proximity of word occurrence Testing performance of search engine is not a easy task, involves extensive user study
24
Google Search Engine Performance Metrics Storage Space
25
Google Search Engine Performance Metrics System Performance Experimental Improvement: Major operation of google – crawling, indexing and sorting 9 days to download 26 million pages Indexer was optimised to avoid bottleneck – it runs roughly at 54 pages/sec Both indexer and crawler were run simultaneously to check the performance Sorter runs in parallel [ 4 machines ] – sorting process took 24 hrs
26
Google Search Engine Performance Metrics Search Performance Most queries are answered between 1 –10 sec
27
Google Search Engine Interactive Close Conclusion High quality search Efficient in both storage space and time Employ number of techniques to improve performance Overcome bottlenecks
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.