“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS 542 - 072.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval in Practice
Presented by: Vanshika Sharma
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Introduction to Information Retrieval and Anatomy of Google.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search Xin Liu.
“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CS 440 Database Management Systems Web Data Management 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Chapter 2: How Search Engines Work. Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine.
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
The Anatomy Of A Large Scale Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Thanks to Ray Mooney & Scott White
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Outline Paper Objective Introduction & History Related Work Design Goals Google Search Engine Features Google Architecture Results & Performance Conclusion References

Paper Objective To describe the anatomy of a large scale web search engine = Google

Introduction & History Why the name Google? From googol = = very large scale Web is different from a normal search engine. –Web is vast and is growing exponentially –Web is heterogonous – images, HTML, files … etc. –IR on small and well controlled homogenous collections is much easier. Human Maintained Lists can ’ t keep up –Yahoo! Is a human maintained list and so – subjective, slow to improve, expensive to build and maintain. Google make use of the hypertext info to get better results

Related Work WWWW (World Wide Web Worm) – 1994 – first web search engine - indexed about 110,000 web pages – handled about 1500 queries/day. In 1997, the top SE claimed to index about 2 million web documents - handled about 20 million queries/day. In 2000, it was expected to index more than a billion documents – with more than 100 million queries/day. Current S.E problems –Subjective –If automated S.E then it returns low quality results –Advertisers can mislead automated S.E

Solution = Google Problem statement ! Must handle the problem in very efficient way –Storage requirements –Efficient processing of the indexing system –Handle a huge number of queries/second –Produce a high quality results

Design Goals Deliver results that have very high precision even at the expense of recall –Using hypertextual info can improve the search quality – such as font size, links, titles, anchor text …. Etc –In 1997, only 4 commercial S.E. was able to return themselves in the top ten results ! Make search engine technology transparent, i.e. advertising shouldn ’ t bias results Bring search engine technology into academic environment in order to support novel research activities on large web data sets Make system easy to use for most people, e.g. users shouldn ’ t have to specify more than a couple of words

Google Search Engine Features Two main features to increase result precision: Uses link structure of web (PageRank) Uses text surrounding hyperlinks to improve accurate document retrieval Other features include: Takes into account word proximity in documents Uses font size, word position, etc. to weight word Storage of full raw html pages

PageRank What does it mean for a web page to have a high rank? –Many pages point to it – so it is an important one- –Some important pages point to it such as Yahoo! PR(A) = (1-d) + d [PR(T1)/C(T1) + PR(T2/C(T2) + … + PR(Tn/C(Tn)]. D is called the damping factor - used to prevent misleading the system to get a higher ranking. Page A has T1 … Tn pages which point to A. C(T1) is the number of links going out of page T1. Not all links are treated the same PageRank is calculated using simple iterative algorithms

Anchor Text The text of the link. Chapter1 Chapter1 Objective to return non-textual objects like files, databases, images … etc – which can not be indexed by a text-based S.E Also, it return non crawled pages Target PageLinkTextPage PDF file www … Chapter1 Dr. Wasfi ’ s Page

Google Architecture Most of Google was built using C and C++ for efficiency. Works on Solaris and Linux.

Repository Barrels Lexicon Searcher PageRank Sorter Crawler Store ServerURL Server URL Resolver Links Doc Index Indexer Anchors

Crawling the Web To fetch URL and gather web pages into the store. It is a challenging task. Needs to be fast to keep up to date info Has to interact with the outside world – web servers, name servers … etc To scale well  distributed crawling Each crawler keep 300 open connections Up to 100 web pages per second using 4 crawler DNS cache to improve performance A connection can be in one of these states - looking up DNS, connecting to host, sending request, and receiving response

Major Data Structure Repository: Contains the full html page compressed using zlib standard. Document Index: Keeps information about each document. ordered by docID. Hit Lists: Corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information.

Forwarded Index Stored in the barrels. Each barrel hold a set of wordID If a document has a word in that barrel, the docID is recorded in that barrel. Hit list# Hits =2WordID =1docID Hit list# Hits =3WordID =2

Inverted Index Same as forwarded index expect that it was processed by the sorter. Docs #WordID Docs #WordID Docs #WordID Hit list# hitsdocID Hit list# hitsdocID Hit list# hitsdocID Hit list# hitsdocID Lexicon

Results & Performance Quality of the results is the most important metric in search engines Authors claim that Google outperform major commercial search engines Example

Storage Requirements Scale well. Utilize the storage efficiently Use compression

System Performance Major operations are crawling, indexing and sorting 9 days to get 26 million pages. Average 48.5 pages/second Indexer – 54 pages/second The whole sorting operation takes 24 hours

Search Performance Was not the most important issue in their design that time The response time for a query was between 1 to 10 second for all queries – mainly Disk IO time – Did not have any query cashing, subindices on common terms – for optimization -

Search Performance

Repository Barrels Lexicon Searcher PageRank Sorter Crawler Store ServerURL Server URL Resolver Links Doc Index Indexer Anchors

Conclusion & Future Work The most concern of Google design is to be a scalable web search engine. And to provide high quality results –Page ranking –Anchor text

Future Work To improve search efficiency –Cash the query –Smart disk allocation –Subindices Updates – old and new pages – Add Boolean operators, negation, and stemming relevance feedback and clustering user context result summarization PageRank can be personalized by increasing the weight of a user ’ s home page or bookmarks

References S. Brin,L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7): (1998)

Q & A Thanks for your Attention