The Search Engine Architecture

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Presented By: - Chandrika B N
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Data mining in web applications
Information Retrieval in Practice
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
CS 430: Information Discovery
The Anatomy Of A Large Scale Search Engine
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Hongjun Song Computer Science The University of Memphis
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Inverted Indexing for Text Retrieval
Web Search Engines.
Discussion Class 9 Google.
Presentation transcript:

The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2011

Outline Introduction Google Summary The PageRank algorithm The Google Architecture Architectural components Architectural interconnections Architectural data structures Evaluation of Google Summary

Problems with search engines circa the last decade Human maintenance Subjective Example: Ranking hits based on $$$ Automated search engines Quality of result Neglect to take user’s context into account Searching process High quality results aren’t always at the top of the list

The Typical Search Engine Process In what stages is the most time spent?

How to scale to modern times? Currently Efficient index Petabyte scale storage space Efficient Crawling Cost effectiveness of hardware Future Qualitative context Maintaining localization data Perhaps send indexing to clients Client computers help gather Google’s index in a distributed, decentralized fashion?

Google The whole idea is to keep up with the growth of the web Design Goals: -Remove Junk Results -Scalable document indices Use of link structure to improve quality filtering Use as an academic digital library Provide search engine datasets Search engine infrastructure and evolution

Google Archival of information Leverage of usage data Use of compression Efficient data structures Proprietary file system Leverage of usage data PageRank algorithm Sort of a “lineage” of a source of information Citation graph

PageRank Algorithm With damping factor d Numerical method to calculate page’s importance this approach might well be followed by people doing research Page Rank of a page A With damping factor d Where PR(x) = Page Rank of page X Where C(x) = the amount of outgoing links from page x Where T1…Tn is the set of pages with incoming links to page A PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn)) It’s actually a bit more complicated than it first looks For instance, what’s PR(T1) and PR(T2) and so on?

PageRank Algorithm An excellent explanation http://www.iprcom.com/papers/pagerank/ Since the PR(A) equation is a probability distribution over all web pages linking to web page A… And because of the (1-d) term and the d*(PR….) term The PageRanks of all the web pages on the web will sum to 1

PageRank: Example So, where do you start? It turns out that you can effectively “guess” what the PageRanks for the web pages are initially In our example, guess 0 for all of the pages Then you run the PR function to calculate PR for all the web pages iteratively You do this until… The page ranks for each web page stop changing in each iteration They “settle down”

Below is the iterative calculation that we would run PageRank: Example Below is the iterative calculation that we would run PR(a) = 1 - $damp + $damp * PR(c); PR(b) = 1 - $damp + $damp * (PR(a)/2) PR(c) = 1 - $damp + $damp * (PR(a)/2 + PR(b) + PR(d)); PR(d) = 1 - $damp;

PageRank Algorithm: First 18 iterations a: 0.00000 b: 0.00000 c: 0.00000 d: 0.00000 a: 0.15000 b: 0.21375 c: 0.39544 d: 0.15000 a: 0.48612 b: 0.35660 c: 0.78721 d: 0.15000 a: 0.81913 b: 0.49813 c: 1.04904 d: 0.15000 a: 1.04169 b: 0.59272 c: 1.22403 d: 0.15000 a: 1.19042 b: 0.65593 c: 1.34097 d: 0.15000 a: 1.28982 b: 0.69818 c: 1.41912 d: 0.15000 a: 1.35626 b: 0.72641 c: 1.47136 d: 0.15000 a: 1.40065 b: 0.74528 c: 1.50626 d: 0.15000 a: 1.43032 b: 0.75789 c: 1.52959 d: 0.15000 a: 1.45015 b: 0.76632 c: 1.54518 d: 0.15000 a: 1.46341 b: 0.77195 c: 1.55560 d: 0.15000 a: 1.47226 b: 0.77571 c: 1.56257 d: 0.15000 a: 1.47818 b: 0.77823 c: 1.56722 d: 0.15000 a: 1.48214 b: 0.77991 c: 1.57033 d: 0.15000 a: 1.48478 b: 0.78103 c: 1.57241 d: 0.15000 a: 1.48655 b: 0.78178 c: 1.57380 d: 0.15000 a: 1.48773 b: 0.78228 c: 1.57473 d: 0.15000 Still changing too much

PageRank: next 13 iterations a: 1.48852 b: 0.78262 c: 1.57535 d: 0.15000 a: 1.48904 b: 0.78284 c: 1.57576 d: 0.15000 a: 1.48940 b: 0.78299 c: 1.57604 d: 0.15000 a: 1.48963 b: 0.78309 c: 1.57622 d: 0.15000 a: 1.48979 b: 0.78316 c: 1.57635 d: 0.15000 a: 1.48990 b: 0.78321 c: 1.57643 d: 0.15000 a: 1.48997 b: 0.78324 c: 1.57649 d: 0.15000 a: 1.49001 b: 0.78326 c: 1.57652 d: 0.15000 a: 1.49004 b: 0.78327 c: 1.57655 d: 0.15000 a: 1.49007 b: 0.78328 c: 1.57656 d: 0.15000 a: 1.49008 b: 0.78328 c: 1.57657 d: 0.15000 a: 1.49009 b: 0.78329 c: 1.57658 d: 0.15000 a: 1.49009 b: 0.78329 c: 1.57659 d: 0.15000 Starting to stabilize

PageRank: Last 9 iterations a: 1.49010 b: 0.78329 c: 1.57659 d: 0.15000 a: 1.49011 b: 0.78329 c: 1.57660 d: 0.15000 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000 a: 1.49011 b: 0.78330 c: 1.57660 d: 0.15000 Average pagerank = 1.0000 Stabilized

Google Architecture Key components Interconnections Data structures A reference architecture for search engines?

Google Data Components BigFiles Repository Use zlib to compress Lexicon Word base Hit Lists Word->document ID map Document Indexing Forward Index Inverted Index

Google File System (GFS) BigFiles A.k.a. Google’s Proprietary Filesystem 64-bit addressable Compression Conventional operating systems don’t suffice No explanation of why? GFS: http://labs.google.com/papers/gfs.html

Google Key Data Components Repository Stores full text of web pages Use zlib to compress Zlib less efficient than bzip Tradeoff of time complexity versus space efficiency Bzip more space efficient, but slower Why is it important to compress the pages?

Google Lexicon Lexicon Why is it important to have a lexicon? Contains 14 million words Implemented as a hash table of pointers to words Full explanation beyond the scope of this discussion Why is it important to have a lexicon? Tokenization Analysis Language Identification SPAM

Mapping queries to hits HitLists wordID->(docID,position,font,capitalization) mapping Takes up most of the space in the forward and inverted indices Types: Fancy,Plain,Anchor

Document Indexing Document Indexing Forward Index Inverted Index docIDs->wordIDs Partially sorted Duplicated doc IDs Makes it easier for final indexing and coding Inverted Index wordIDs->docIDs 2 sets of inverted barrels

Crawling and Indexing Crawling Indexing Distributed, Parallel Social issues Bringing down web servers: politeness Copyright issues Text versus code Indexing Developed their own web page parser Barrels Distribution of compressed documents Sorting

Google’s Query Evaluation 1: Parse the query 2: Convert words into WordIDs Using Lexicon 3: Select the barrels that contain documents which match the WordIDs 4: Search through documents in the selected barrels until one is discovered that matches all the search terms 5: Compute that document’s rank (using PageRank as one of the components) 6: Repeat step 4 until no documents are found and we’ve went through all the barrels 7: Sort the set of returned documents by document rank and return the top k documents

Google Evaluation Performed by generating numerical results Query satisfaction Bill Clinton Example Storage requirements 55GB Total System Performance 9 days to download 26 million pages 63 hours to get the final 11 million (at the time) Search Performance Between 1 and 10 seconds for most queries (at the time)

Wrapup Loads of future work Even at that time, there were issues of: Information extraction from semi-structured sources (such as web pages) Still an active area of research Search engines as a digital library What services, APIs and toolkits should a search engine provide? What storage methods are the most efficient? From 2005 to 2010 to ??? Enhancing metadata Automatic markup and generation What are the appropriate fields? Automatic Concept Extraction Present the Searcher with a context Searching languages: beyond context-free queries Other types of search: Facet, GIS, etc.

The Future? User poses keyword query search “Google-like” result page comes back Along with each link returned, there will be A “Concept Map” outlining – using extraction methods – what the “real” content of the document is This basically allows you to “visually” see what the page rank is Discover information visually Existing evidence that this works well http://vivisimo.com/ Carrot2/3 clustering

Software Architecture Concept Map Chris’s Homepage http://sunset.usc.edu/~mattmann Data Publications Software Data Grid Science Data Systems Software Architecture