1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Search Engines & Search Engine Optimization (SEO) Presentation by Saeed El-Darahali 7 th World Congress on the Management of e-Business.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Christy Gavin Spring, 2009 The Google Search Engine.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Google and the Page Rank Algorithm Székely Endre
S eminar on Page Ranking Techniques In Search Engines Phapale Gaurav S. [05 IT 6010] Guide: Prof. A. Gupta.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Search Engine Optimization
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
Search Engines & Search Engine Optimization (SEO).
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
UNIT 14 1 Websites. Introduction 2 A website is a set of related webpages stored on a web server. Webmaster: is a person who sets up and maintains a.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search Engine Optimization & Pay Per Click Advertising
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Google Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search Engines By: Faruq Hasan.
Search Xin Liu.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.
How to build a better Google? Adam Bak IST 497E November 21, 2002.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Traffic Source Tell a Friend Send SMS Social Network Group chat Banners Advertisement.
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Chapter 2: How Search Engines Work. Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine.
Search Engine Optimization
Search Engine Optimization
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
PageRank, Ads and Searching
Anatomy of a search engine
Lesson Objectives Aims You should know about: – Web Technologies
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
Presentation transcript:

1 How Does Google Work? The Technology behind Google's Great Results Emre Altug Yavuz Ph.D. candidate Data Communications Lab. Electrical & Computer Engineering University of British Columbia (UBC) Vancouver, BC, CANADA 2004 © Emre A. Yavuz. EECE, UBC

2 What is Google ? A fully automated search engine, which employs robots known as “spiders” to crawl the web frequently and find sites for inclusion in the Google database or index © Emre A. Yavuz. EECE, UBC

3 Some Google Factoids Named for the mathematical term “googol” or ,the number represented by the numeral 1 followed by 100 zeros. Global unique users per month: 81.9 million. Selected by Yahoo (2000) and AOL (2002) as search engine partner. Indexes largest amount of Internet accessible documents. Designed to scale well to extremely large data sets Efficient usage of storage space to store the index. Optimized data structures for fast and efficient access © Emre A. Yavuz. EECE, UBC

4 Who invented it, when and why ? In early 90s, search engines started springing out of academic projects. Low quality of the results and existence of poorly designed search engines prepared the born of Google. Designed and created by Sergie Brin and Larry Page. On September 7, 1998, Google Inc. opened its doors in a garage in Menlo Park, California © Emre A. Yavuz. EECE, UBC

5 How does Google Work ? When you perform a Google search, you are not actually searching the web, but rather an index of the copy of the web stored on Google’s servers. The index is compiled from all the pages that have been returned by a multitude of spiders – called GoogleBot - that crawl the web. When a user types in a query, the search items are looked up in the index and the results are then returned from a separate set of document servers along with advertisement. All of these bits are assembled, with the help of its PageRank technology, into the page of search results © Emre A. Yavuz. EECE, UBC

6 What is PageRank ? The method of measuring a page’s “importance”. The applied version of academic citation literature to the web. An extended idea based on the counted citations or backlinks to a given page by not counting links from all pages equally, and by normalizing the number of links on a page. Assuming page A having pointing pages to itself labeled from t1 to tn, the pagerank of page A is given as follows: PR(A) = (1-d) + d. (PR(t1)/C(t1) + … + PR(tn)/C(tn)) where C(A) is defined as the # of links going out of page A © Emre A. Yavuz. EECE, UBC

7 How to tell what a PageRank of a page is Download a toolbar from Once installed, there will be bar graph at the top of the browser showing a version of PageRank for the page being browsed. Hold the mouse over the bar to see a number from 0 to 10. Only to give you an idea, not very accurate, sometimes guesses, if the page entered is not in indexed, but there is a closer one. Just a representation of actual PageRank. Whilst PageRank is linear, Google uses a non-linear graph to portray it © Emre A. Yavuz. EECE, UBC

8 How significant is PageRank ? The significance of any factor in search engine algorithms depends on the quality of the information it supplies. A factor’s importance is known as its weight. Originally, when the Meta keyword tag was new, it could be used as an indicator of what the page was about. However, the weighting was fast approaching nothing since it was easily abused by the Webmasters with a high level of manipulation. Even though PageRank is harder to be manipulated, it is not impossible to do © Emre A. Yavuz. EECE, UBC

9 Is PageRank enough to determine the quality of a page (1)? “People only link to pages they think are good.” However, there may be other reasons like:  Reciprocal links – “Link to me and I’ll link you.”  Link requirements – “Using our script requires you to put a link to our website.” or “We’ll give you an award in return for a link to our website.”  Friends and family – “This is my friend Pete’s site”  Free Page Add-ons – “This counter is provided by …” 2004 © Emre A. Yavuz. EECE, UBC

10 Is PageRank enough to determine the quality of a page (2)? If a Webmaster picks the outbound links by searching on Google, then PageRank itself will have an influence on the number of links to a page, (in a circular way). Thus the links will no longer be based solely on human judgement and the increase will not be solely because it is a good page, but because its PageRank is already high. Therefore, PageRank is not enough to produce high precision results © Emre A. Yavuz. EECE, UBC

11 Other System Features Title tag – most important factor since high level of importance is placed by most engines & directories. Proximity of search terms – how often do they appear ? How close together are they ? Text characteristics – font size and type, search terms in a larger or bolder font are weighted higher than others. Anchor text – Anchors often provide more accurate descriptions of web pages than the pages themselves. They may exist for documents which can not be indexed by a text based search engine – images, programs, databases etc © Emre A. Yavuz. EECE, UBC

12 The difference between PageRank and other factors Title TagCan only be listed once Keywords in Body textEach successive repetition is less important. Proximity is important. Anchor textHighly weighted, but like keywords in body text, there is a cutoff point where further anchor text is no longer worthwhile PageRankPotentially infinite. You are always capable of increasing your PageRank significantly, but it takes work © Emre A. Yavuz. EECE, UBC

13 How does Google rank pages ? Find all pages matching the keywords of the search. Rank accordingly using “on the page factors” such as keywords bolded, relatively larger etc. Calculate the inbound anchor text. Adjust the results by PageRank scores © Emre A. Yavuz. EECE, UBC

14 System Anatomy (1) Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux. URLserver sends list of URLs to be fetched to the crawlers. The fetched web pages are sent to the storeserver to be compressed and stored into a repository. Every webpage has an associated ID number called a docID. The indexer reads the repository, uncompresses the documents and parses them to be converted into a set of word occurrences called hits © Emre A. Yavuz. EECE, UBC

15 High Level Google Architecture 2004 © Emre A. Yavuz. EECE, UBC

16 System Anatomy (2) The hits record the word, position, fontsize and capitalization. The indexer distributes these hits into a set of barrels and parses out all the links in every webpage and stores important information about them in an anchors file. The URLresolver reads the anchors file and converts relative URLs into absolute URLs and docIDs. The sorter takes the barrels, sorted by docID and resorts them by wordID. It also produces a list of wordIDs and offsets into the inverted index © Emre A. Yavuz. EECE, UBC

17 System Anatomy (3) A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon. The searcher is run by a webserver and uses the lexicon together with the inverted index and the PageRank to answer queries © Emre A. Yavuz. EECE, UBC

18 How does Google make money ? Initially, sold targeted banner advertisements and provided search services to other websites including Yahoo. Later, launched AdWords – a system for automatically selling and displaying advertisements alongside search results. The ads are also ranked according to their popularity. Using the base created by AdWords, launched a context targeted advertisement system – AdSense. Google “next generation corporate software” – released on 2 nd of June 04, query and document update software © Emre A. Yavuz. EECE, UBC

19 How do you maximize your place on Google ? (1) Make sure that all your pages are indexed in the first place. Pay a great deal of attention to your webpage titles. Have keywords well-represented in the body of the webpage. Add content to your pages and to your website, Google likes sites with lots of content. Use keywords as hyperlink names © Emre A. Yavuz. EECE, UBC

20 How do you maximize your place on Google ? (2) Have a good system of navigation between your webpages, PageRank gets passed among the internal links of a website. Get external links to as many pages on your site as you can. Each external link will add to the PageRank not only of the page that is linked, but also of every webpage on your site, if you have good site navigation. Do not submit a redirection web page. Most search engines will skip your web site completely in that case. Try to avoid using frames in your web site © Emre A. Yavuz. EECE, UBC

21 References “The Anatomy of a Large Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page. “PageRank Uncovered”, Chris Ridings and Mike Shishigin. “Google! Everything you always wanted to know, but didn’t have time to find out”, Judy Broom, Betsy Chessler and Katherine Foster. And not surprisingly © Emre A. Yavuz. EECE, UBC

THANKS Questions ? 2004 © Emre A. Yavuz. EECE, UBC

23 Some Features of Google (1) daterange: limits your search to a particular date or range of dates that a page was indexed by Google. only works with Julian dates, so you’ll need to find a Julian date converter online. The Julian date must be an integer (no decimals.) Usage  daterange:start - stop e.g. stjohns daterange: © Emre A. Yavuz. EECE, UBC

24 Some Features of Google (2) filetype: restricts your results to files ending in ".doc" (or.xls,.ppt. etc.), and shows you only files created with the corresponding program. The “dot” in the file extension –.doc – is optional. filetype:extension e.g. stjohns -filetype:pdf 2004 © Emre A. Yavuz. EECE, UBC

25 Some Features of Google (3) inanchor: restricts the results to text in a page’s link anchors. inanchor:terms e.g. stjohns -inanchor:”ubc” intext: ignores link text, URLs, and titles, and only searches body text, helps you find query words that are too common in URLs and links. intext:terms e.g.stjohns -intext:”ubc.ca” 2004 © Emre A. Yavuz. EECE, UBC

26 Some Features of Google (4) intitle: restricts the results to documents containing a particular word in its title. inurl: restricts the results to documents containing a particular word in its URL. site: restricts the results to those websites in a domain. cache: shows the version of a web page that Google has in its cache © Emre A. Yavuz. EECE, UBC

27 Some Features of Google (5) link: restricts the results to those web pages that have links to the specified URL. related: lists web pages that are "similar" to a specified web page. info: presents some information that Google has about a particular web page © Emre A. Yavuz. EECE, UBC

28 Some Features of Google (6) There are actually three different Google phonebook operators. Using phonebook: searches the entire Google phonebook. Using rphonebook: searches residential listings only. Using bphonebook: searches business listings only © Emre A. Yavuz. EECE, UBC

29 Some Features of Google (7) If you begin a query with stocks: Google will treat the rest of the query terms as stock ticker symbols, and will link to a Yahoo finance page showing stock information for those symbols. If you begin a query with define: Google will display definitions for the word or phrase that follows, if definitions are available © Emre A. Yavuz. EECE, UBC