Google and Scalable Query Services

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
By Sergey Brin and Lawrence PageSergey BrinLawrence Page developers of Google (1997) The Anatomy of a Large-Scale Hypertextual Web Search Engine.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Adding Semantics to the Web Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 11, 2005.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Anatomy of Google (circa 1999) Slides from Project part B due a month from now (10/26)
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Search Xin Liu.
Google, Web Crawling, and Distributed Synchronization Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 1, 2008.
“The Anatomy of a Large-Scale Hypertextual Web Search Engine,” by Brin and Page, 1998 The Google Story, by Vise and Malseed, 2005.
The Nuts & Bolts of Hypertext retrieval Crawling; Indexing; Retrieval.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 CS 430: Information Discovery Lecture 20 Web Search Engines.
1 Introduction to IR Systems: Supporting Boolean Text Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania HITS and PageRank; Google April 4, 2016.
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Chapter 2: How Search Engines Work. Chapter Objectives Describe the PageRank formula for calculating a webpage’s popularity. Determine how a search engine.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Search Engine Architecture
The Anatomy Of A Large Scale Search Engine
CSE 454 Advanced Internet Systems University of Washington
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Search Engines Search Engine Optimization Search Interfaces
Hongjun Song Computer Science The University of Memphis
Thanks to Ray Mooney & Scott White
Anatomy of a search engine
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
The Search Engine Architecture
Instructor : Marina Gavrilova
Presentation transcript:

Google and Scalable Query Services Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 6, 2005

Administrivia Please send me an email updating your project status Next readings and summaries: Monday – Berners-Lee paper (very short, fluffy) Wednesday – First two sections of the Piazza paper For both – summarize the goals, key ideas, and challenges Reduced reading so you can work on the project!

Today’s Trivia Question

Google Architecture [Brin/Page 98] Focus was on scalability to the size of the Web First to really exploit Link Analysis Started as an academic project @ Stanford; became a startup Our discussion will be on early Google – today they keep things secret!

Google’s Focus Commodity, cheap hardware Lots of racks Special queries Unreliable Not very powerful A fair amount of memory, reasonable hard disks Lots of racks Special air conditioning, power systems, big net pipes Special queries Partitioning of service between “two” versions: The version being crawled and fleshed out The version being searched (Really, different pieces can be crawled & updated at different times)

What Does Google Need to Do? Scalable crawling of documents Archival of documents (“cache”) Inverted indexing Duplicate removal Ranking – requires iteration over link structure PageRank TF/IDF Heuristics Do the new Google services change any of that? Some may not need the crawler, e.g., maps, perhaps Froogle

The Heart of Google Storage The main database: Repository Basically, a warehouse of every HTML page (this is the cached page entry), compressed in zlib Useful for doing additional processing, any necessary rebuilds Repository entry format: [DocID][ECode][UrlLen][PageLen][Url][Page] The repository is indexed (not inverted here)

Repository Index One index for looking up documents by DocID Done in ISAM (think of this as a B+ Tree without smart re-balancing) Index points to repository entries (or to URL entry if not crawled) One index for mapping URL to DocID Sorted by checksum of URL Compute checksum of URL, then binsearch by checksum Allows update by merge with another similar file

Lexicon The list of searchable words As of 1998, 14 million “words” (Presumably, today it’s used to suggest alternative words as well) The “root” of the inverted index As of 1998, 14 million “words” Kept in memory (was 256MB) Two parts: Hash table of pointers to words and the “barrels” (partitions) they fall into List of words (null-separated)

Indices – Inverted and “Forward” Inverted index divided into “barrels” (partitions by range) Indexed by the lexicon; for each DocID, consists of a Hit List of entries in the document Forward index uses the same barrels Used to find multi-word queries with words in same barrel Indexed by DocID, then a list of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs Two barrels: short (anchor and title); full (all text) original tables from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm

Hit Lists (Not Mafia-Related) Used in inverted and forward indices Goal was to minimize the size – the bulk of data is in hit entries For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then): Plain cap 1 font: 3 position: 12 vs. Fancy cap 1 font: 7 type: 4 position: 8 special-cased to: Anchor cap 1 font: 7 type: 4 hash: 4 pos: 4

Google’s Search Algorithm Parse the query Convert words into wordIDs Seek to start of doclist in the short barrel for every word Scan through the doclists until there is a document that matches all of the search terms Compute the rank of that document If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough If not at the end of any doclist, goto step 4 Sort the documents by rank; return the top K

Ranking in Google Considers many types of information: Position, font size, capitalization Anchor text PageRank Done offline, in a non-query-sensitive way Count of occurrences (basically, TF) in a way that tapers off Multi-word queries consider proximity also

Why Isn’t Google Based on a DBMS? Transactional locking is not necessary Helps with partitioning and replication Main memory indexing on lexicon Unusual query model – what’s special here? Weird consistency model! OK if different users see different views As long as we route same user to same machine(s), we’re OK Updates are happening in a separate “instance” Slipstream it in place Can even extend this to change versions of software on the machines – as long as interfaces stay the same

Could We Change a DBMS? What would a DBMS for Google-like environments look like? What would it be useful for, other than Google?

Beyond Google What if we wanted to: Add on-the-fly query capabilities to Google? e.g., query over up-to-the-second stock market results Use WordNet or some thesaurus to supplement Google? Do PageRank in a topic-specific way? Supplement Google with “ontology” info? Do some sort of XML path matching along with keywords? Allow for OLAP-style analysis? Do a cooperative, e.g., P2P, Google? Benefits of this?