Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by: Saumeet Mohapatra Electronics &Telecommunication Engineering Regn. No: 06005104760 Roll no:604211 KIIT.UNIVERSITY.

Similar presentations


Presentation on theme: "Presented by: Saumeet Mohapatra Electronics &Telecommunication Engineering Regn. No: 06005104760 Roll no:604211 KIIT.UNIVERSITY."— Presentation transcript:

1 Presented by: Saumeet Mohapatra Electronics &Telecommunication Engineering Regn. No: 06005104760 Roll no:604211 KIIT.UNIVERSITY

2  INTRODUCTION  GOOGLE ARCHITECTURE OVERVIEW  HOW GOOGLE WORKS?  GOOGLE's QUERY PROCESSOR  HOW GOOGLE PROCESSES A QUERY?  HARDWARE  THE PAGE RANK SYSTEM  FEEDBACK  RESULTS AND PERFORMANCES  GOOGLE WEB SEARCH FEATURES  CONCLUSION

3  Google, the secretive, extraordinarily successful $6.1 billion global search engine company, is one of the most recognized brands in the world. Yet it selectively discusses its innovative information management infrastructure—which is based on one of the largest distributed computing/grid systems in the world.  Google runs on a unique combination of advanced hardware and software. The speed you experience can be attributed in part to the efficiency of our search algorithm and partly to the thousands of low cost PC's we've networked together to create a superfast search engine.  The heart of our software is PageRank™, a system for ranking web pages developed by founders Lawrence Page and Sergey Brin at Stanford University. And while we have dozens of engineers working to improve every aspect of Google on a daily basis, PageRank continues to provide the basis for all of our web search tools.

4 Lawrence Page was born in East Lansing, Michigan, and received a B.S.E. in Computer Engineering at the University of Michigan Ann Arbor in 1995. He is currently a Ph.D. candidate in Computer Science at Stanford University. Some of his research interests include the link structure of the web, human computer interaction, search engines, scalability of information access interfaces, and personal data mining. Sergey Brin received his B.S. degree in mathematics and computer science from the University of Maryland at College Park in 1993. Currently, he is a Ph.D. candidate in computer science at Stanford University where he received his M.S. in 1995. He is a recipient of a National Science Foundation Graduate Fellowship. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data.

5 High Level Google Architecture

6  Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux. In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.  The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.  The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queriesSection 4.2.5

7  Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing. Parallel processing is a method of computation in which many calculations can be performed simultaneously, significantly speeding up data processing. Google has three distinct parts:  Googlebot, a web crawler that finds and fetches web pages. which moves from site to site on the internet, downloading copies of web pages and saves them in the Google index (also known as the CACHE) for future reference.  The indexer that sorts every word on every page and stores the resulting index of words in a huge database.  The query processor, which compares your search query to the index and recommends the documents that it considers most relevant.

8  The query processor has several parts, including the user interface (search box), the "engine" that evaluates queries and matches them to relevant documents, and the results formatter.  PageRank is Google's system for ranking web pages. A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank. PageRank  Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query, including the popularity of the page, the position and size of the search terms within the page, and the proximity of the search terms to one another on the page. A patent application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org's report for an interpretation of the concepts and the practical applications contained in Google's patent application.A patent applicationSEOmoz.org's report  Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. For example, the spelling-correcting system uses such techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate relevance; they're tweaked to improve quality and performance, and to outwit the latest devious techniques used by spammers.spelling-correcting system  Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google gives more priority to pages that have search terms near each other and in the same order as the query. Google can also match multi- word phrases and sentences. Since Google indexes HTML code in addition to the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title, in the URL, in the body, and in links to the page, options offered by the Advanced-Search page and search operators.Advanced-Search pagesearch operators

9 1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book--it tells which pages contain the words that match any particular query term. 2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result. 3. The search results are returned to the user in a fraction of a second.

10 To provide sufficient service capacity, Google’s physical structure consists of clusters of computers situated around the world known as server farms. These server farms consist of a large number of commodity level computers running Linux based systems that operate with GFS, or the Google file system, with the largest of these farms have over 1000 storage nodes and over 300 TB of disk storage (Ghemawat, S., Gobioff, H., and Leung, S. T., 2003, pp 2). It has been speculated that Google has the world’s largest computer. The estimate states Google as having up to: 899 racks 79,112 machines 158,224 CPUs 316,448 Ghz of processing power 158,224 Gb of RAM 6,180 Tb of Hard Drive spaceGoogleclusterscomputersserver farmsLinuxnodesTBstorageestimateracksCPUsRAM

11  The heart of GOOGLE is Page Rank™, a system for ranking web pages.  Page Rank, named after Larry Page, who came up with it, is one of the ways in which Google determines the importance of a page, which in turn decides where the page will turn up in the results list. The exact Page Rank algorithm is as such:Larry Page  We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The Page Rank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn))  Note that the Page Ranks form a probability distribution over web pages, so the sum of all web pages' Page Ranks will be one.  It works on the probability that a web page will be randomly accessed by a web surfer. It also takes into consideration the pages that link to the page. Using Yahoo as an example, the justification of this is that if a page like Yahoo were to link directly to another page, it is very likely that the page is of high quality.

12  The ranking function has many parameters like the type-weights and the type-prox-weights. Figuring out the right values for these parameters is something of a black art. In order to do this, we have a user feedback mechanism in the search engine.  A trusted user may optionally evaluate all of the results that are returned. This feedback is saved. Then when we modify the ranking function, we can see the impact of this change on all previous searches which were ranked. Although far from perfect, this gives us some idea of how a change in the ranking function affects the search results

13  The most important measure of a search engine is the quality of its search results. While a complete user evaluation is beyond the scope of this paper, our own experience with Google has shown it to produce better results than the major commercial search engines for most searches. Under this we have three catagories:  Storage Requirements  System Performance  Search Performance

14  Storage Requirements: Aside from search quality, Google is designed to scale cost effectively to the size of the Web as it grows. One aspect of this is to use storage efficiently.  System Performance: It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly. For Google, the major operations are Crawling, Indexing, and Sorting.  Search Performance: Improving the performance of search was not the major focus of the research up to this point. The current version of Google answers most queries in between 1 and 10 seconds.

15 In addition to providing easy access to billions of web pages, Google has many special features to help you to find exactly what you're looking for. Click the title of a specific feature to learn more about it.  Book Search Use Google to search the full text of books. Book Search  Cached Links View a snapshot of each page as it looked when we indexed it. Cached Links  Calculator Use Google to evaluate mathematical expressions. Calculator  Currency Conversion Easily perform any currency conversion.Currency Conversion  Definitions Use Google to get glossary definitions gathered from various online sources.Definitions  File Types Search for non-HTML file formats including PDF documents and others.File Types  Froogle To find a product for sale online, use Froogle - Google's product search service.Froogle  Groups See relevant postings from Google Groups in your regular web search results.Groups  I'm Feeling Lucky Bypass our results and go to the first web page returned for your query.I'm Feeling Lucky  Images See relevant images in your regular web search results.Images  Local Search Search for local businesses and services in the U.S., the U.K., and Canada.Local Search  Movies Use Google to find reviews and showtimes for movies playing near you. Movies

16  Music Search Use Google to get quick access to a wide range of music information. Music Search  News Headlines Enhances your search results with the latest related news stories.News Headlines  PhoneBook Look up U.S. street address and phone number information.PhoneBook  Q&A Use Google to get quick answers to straightforward questions.Q&A  Refine Your Search - New! Add instant info and topic-specific links to your search in order to focus and improve your results.Refine Your Search  Results Prefetching Makes searching in Firefox faster.Results Prefetching  Search By Number Use Google to access package tracking information, US patents, and a variety of online databases.Search By Number  Similar Pages Display pages that are related to a particular result.Similar Pages  Site Search Restrict your search to a specific site.Site Search  Spell Checker Offers alternative spelling for queries.Spell Checker  Street Maps Use Google to find U.S. street maps.Street Maps  Travel Information Check the status of an airline flight in the U.S. or view airport delays and weather conditions.Travel Information  Weather Check the current weather conditions and forecast for any location in the U.S.Weather  Web Page Translation Provides you access to web pages in other languages.Web Page Translation  Who Links To You? Find pages that point to a specific URL.Who Links To You?

17  Google is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information.  Furthermore, Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.

18


Download ppt "Presented by: Saumeet Mohapatra Electronics &Telecommunication Engineering Regn. No: 06005104760 Roll no:604211 KIIT.UNIVERSITY."

Similar presentations


Ads by Google