The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
GOOGLE SEARCH ENGINE Presented By Richa Manchanda.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Databases & Data Warehouses Chapter 3 Database Processing.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Presented By: - Chandrika B N
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Introduction to Information Retrieval and Anatomy of Google.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
The anatomy of a Large-Scale Hypertextual Web Search Engine.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Search Engine Optimization
Implementation Issues & IR Systems
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Thanks to Ray Mooney & Scott White
Instructor: P.Krishna Reddy
Anatomy of a search engine
Web Search Engines.
The Search Engine Architecture
Presentation transcript:

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja

Why Google was introduced or required?  Because there were problems with existing search engines. For example, Human maintained Lists/indices -- subjective, expensive to build and maintain -- slow to improve -- cannot cover all the esoteric topics  Automated Search Engines -- Rely on keywords matching -- Easy to mislead them June-2010CAM-2 Why Google?

Some facts about Google Why Google is called Google -- Because it is a common spelling of googol or 10^100 and fits well with their goal of building very large-scale search engine. Just to let you know that we are talking about Google of year Much of the modules it incorporated then were made open source. So we know a lot about Google of year But we do not know much about Google of 2010 because most of its modules are proprietary. June-2010CAM-3 Facts about Google

Goals behind Google Scalability -- Number of pages indexed. -- Number of queries handled. Quality -- To provide high quality search results Eliminating junk results -- Using link structure and anchor text for quality filtering. To push more development and understanding into the academic realm. To increase usability. To setup a space lab-like environment where researchers or even students can propose and do interesting experiments on Google’s large scale web data. June-2010CAM-4 Goals

Features of Google Search Engine Uses link structure of the web to calculate a quality ranking for each web page called page rank. The probability that a random surfer visits a page is called its page rank. It gives some approximation of page’s importance and quality. PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)) -- Where PR(A) is the Page Rank of Page A. -- PR(T1) is the Page Rank of a site pointing to Page A -- C(T1) is the number of links off that page which points to A -- PR(Tn) /C(Tn) means we do that for each page pointing to Page A -- Where T1…Tn is the set of pages with incoming links to page A -- d is a dampening factor. It is the probability at each page the random surfer will get bored and request another random page. Nominally this is set to 0.85 June-2010CAM-5 Features

Features of Google (cont.) Anchor Text. -- Google utilizes the data in anchor text and associates it with the page the link points to. For example, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs and databases. This search engine has location information for all hits and so it makes extensive use of proximity in search. Google keeps track of some visual presentation details such as font size of words. Words in, tags are weighted higher than other words. Full raw HTML of pages is available in repository. June-2010CAM-6 Features(cont.)

Google Architecture

88 Google Architecture (cont.) It sends lists of URLs to be fetched to the crawlers Compresses and stores web pages in a repository Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open connections open at once. Reads the repository, un compresses the documents and parses them. Stores link information in anchors file and makes Hit lists The indexer parses out all links in a web page and Stores important information about them in it. Converts relative URLs into absolute URLs & into doc IDs Contains Entire html of every web page. Each document is prefixed by docID, length, and URL.

9 Google Architecture (cont.) Maps absolute URLs into doc IDs stored in Doc Index. Stores anchor text in “barrels”. Generates database of links (pairs of doc Ids). Parses & distributes hit lists into “barrels.” Creates inverted index whereby document list containing doc ID and hit lists can be retrieved given word ID. In-memory hash table that maps words to word Ids. Contains pointer to doc list in barrel which word Id falls into. Partially sorted forward indexes sorted by doc ID. Each barrel stores hit lists for a given range of word IDs. Doc ID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.

Google Architecture (cont.) List of word Ids produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words. 2 kinds of barrels. Short barrel which contains hit list & which includes title or anchor hits. Long barrel for all hit lists.

Results and Performance Performance of a search engine depends on quality of its search results and quality of search results are judged by its users. -- So After collecting lots of feedback from users and researchers, it was found out that the results were of good quality. For example, Google at that time to was able to produce top search results with no broken links. Google also placed heavy importance on the proximity of word occurrences. For example, search results for Bill Clinton does not produce independent results for Bill and Clinton. Storage efficiency was achieved by using compression techniques like zlib, bzip. System Performance was increased by optimizing the indexer, running sorters in parallel, optimizing the data structures to store the information. June-2010CAM-11 Results and Performance

Conclusions Google is designed to be a scalable search engine. Primary goal is to provide high quality search results over a rapidly growing world wide web. Google employs a number of techniques to improve search quality including page rank, anchor text and proximity of information. Furthermore, Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them. June-2010CAM-12 Conclusions

Pros of the paper A landmark paper which gives an insight into the search engine architecture of Google. First known public description of Page Rank. New ways of ranking proposed based on link structure which comes very close to the notion of “Relevant” documents. June-2010CAM-13 Pros of the paper

Cons of paper As we know by now that the paper is about Google of year 1997 and so number of Goals proposed were not being implemented. For example, to make Google a part of academic realm. Judging the quality of webpage by only page rank and data in anchor text is not sufficient. June-2010CAM-14 Cons of the paper