The Anatomy of a Large-Scale Hypertextual Web Search Engine

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
GOOGLE SEARCH ENGINE Presented By Richa Manchanda.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Presented by: Vanshika Sharma
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Presented By: - Chandrika B N
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Search Xin Liu.
Our MP3 Search Engine Crawler –Searching for Artist Name –Searching for Song Title Website Difficulties Looking Back.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Chapter 2: How Search Engines Work. Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine.
Information Retrieval in Practice
Lecture #11 PageRank (II)
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
Search Search Engines Search Engine Optimization Search Interfaces
Thanks to Ray Mooney & Scott White
Instructor: P.Krishna Reddy
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Li An, CIS, TU

Introduction

Introduction

Introduction Google’s Mission To organize the world’s information and make it universally accessible and useful Scaling with the web Improved Search Quality Academic Search Engine Research

System Features It makes use of the link structure of the Web to calculate a quality ranking for each web page, called PageRank PageRank is a trademark of Google. The PageRank process has been patented. Google utilizes link to improve search results

PageRank PageRank is a link analysis algorithm which assigns a numerical weighting to each Web page, with the purpose of "measuring" relative importance. Based on the hyperlinks map An excellent way to prioritize the results of web keyword searches

Simplified PageRank algorithm Assume four web pages: A, B,C and D. Let each page would begin with an estimated PageRank of 0.25. L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: C A D B C A D B

PageRank algorithm including damping factor Assume page A has pages B, C, D ..., which point to it. The parameter d is a damping factor which can be set between 0 and 1. Usually set d to 0.85. The PageRank of a page A is given as follows:

Intuitive Justification A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. A page can have a high PageRank If there are many pages that point to it Or if there are some pages that point to it, and have a high PageRank.

Anchor Text Besides the text of a hyperlink (anchor text) is <A href="http://www.yahoo.com/">Yahoo!</A> Besides the text of a hyperlink (anchor text) is associated with the page that the link is on, it is also associated with the page the link points to. anchors often provide more accurate descriptions of web pages than the pages themselves. anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.

Other Features It has location information for all hits. Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words. Full raw HTML of pages is available in a repository

Architecture Overview

Major Data Structures BigFiles Repository Document Index Lexicon virtual files spanning multiple file systems and are addressable by 64 bit integers. Repository contains the full HTML of every web page. Document Index keeps information about each document. Lexicon two parts – a list of the words and a hash table of pointers. Hit Lists a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Forward Index stored in a number of barrels Inverted Index consists of the same barrels as the forward index, except that they have been processed by the sorter.

Crawling the Web Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers. Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document.

Indexing the Web Parsing Indexing Documents into Barrels Sorting Any parser which is designed to run on the entire Web must handle a huge array of possible errors. Indexing Documents into Barrels After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon. Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. Sorting the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel.

Searching

Results and Performance The current version of Google answers most queries in between 1 and 10 seconds. The table shows some samples search time from the current version of Google. They are repeated to show the speedups resulting from cached IO.

Conclusion Google is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.

Google bomb Because of the PageRank, a page will be ranked higher if the sites that link to that page use consistent anchor text. A Google bomb is created if a large number of sites link to the page in this manner. search term "more evil than Satan himself"  the Microsoft homepage as the top result.

Problems High Quality Search Scalable Architecture The biggest problem facing users of web search engines today is the quality of the results they get back. Scalable Architecture Google is designed to scale. It must be efficient in both space and time

The Future “The ultimate search engine would understand exactly what you mean and give back exactly what you want.” - Larry Page

Thanks!