Google and Scalable Query Services

Google and Scalable Query Services
Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems April 6, 2005

Administrivia Please send me an email updating your project status
Next readings and summaries: Monday – Berners-Lee paper (very short, fluffy) Wednesday – First two sections of the Piazza paper For both – summarize the goals, key ideas, and challenges Reduced reading so you can work on the project!

Today’s Trivia Question

Google Architecture [Brin/Page 98]
Focus was on scalability to the size of the Web First to really exploit Link Analysis Started as an academic Stanford; became a startup Our discussion will be on early Google – today they keep things secret!

Google’s Focus Commodity, cheap hardware Lots of racks Special queries
Unreliable Not very powerful A fair amount of memory, reasonable hard disks Lots of racks Special air conditioning, power systems, big net pipes Special queries Partitioning of service between “two” versions: The version being crawled and fleshed out The version being searched (Really, different pieces can be crawled & updated at different times)

What Does Google Need to Do?
Scalable crawling of documents Archival of documents (“cache”) Inverted indexing Duplicate removal Ranking – requires iteration over link structure PageRank TF/IDF Heuristics Do the new Google services change any of that? Some may not need the crawler, e.g., maps, perhaps Froogle

The Heart of Google Storage
The main database: Repository Basically, a warehouse of every HTML page (this is the cached page entry), compressed in zlib Useful for doing additional processing, any necessary rebuilds Repository entry format: [DocID][ECode][UrlLen][PageLen][Url][Page] The repository is indexed (not inverted here)

Repository Index One index for looking up documents by DocID
Done in ISAM (think of this as a B+ Tree without smart re-balancing) Index points to repository entries (or to URL entry if not crawled) One index for mapping URL to DocID Sorted by checksum of URL Compute checksum of URL, then binsearch by checksum Allows update by merge with another similar file

Lexicon The list of searchable words As of 1998, 14 million “words”
(Presumably, today it’s used to suggest alternative words as well) The “root” of the inverted index As of 1998, 14 million “words” Kept in memory (was 256MB) Two parts: Hash table of pointers to words and the “barrels” (partitions) they fall into List of words (null-separated)

Indices – Inverted and “Forward”
Inverted index divided into “barrels” (partitions by range) Indexed by the lexicon; for each DocID, consists of a Hit List of entries in the document Forward index uses the same barrels Used to find multi-word queries with words in same barrel Indexed by DocID, then a list of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs Two barrels: short (anchor and title); full (all text) original tables from

Hit Lists (Not Mafia-Related)
Used in inverted and forward indices Goal was to minimize the size – the bulk of data is in hit entries For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then): Plain cap 1 font: 3 position: 12 vs. Fancy cap 1 font: 7 type: position: 8 special-cased to: Anchor cap 1 font: 7 type: hash: pos: 4

Google’s Search Algorithm
Parse the query Convert words into wordIDs Seek to start of doclist in the short barrel for every word Scan through the doclists until there is a document that matches all of the search terms Compute the rank of that document If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough If not at the end of any doclist, goto step 4 Sort the documents by rank; return the top K

Ranking in Google Considers many types of information:
Position, font size, capitalization Anchor text PageRank Done offline, in a non-query-sensitive way Count of occurrences (basically, TF) in a way that tapers off Multi-word queries consider proximity also

Why Isn’t Google Based on a DBMS?
Transactional locking is not necessary Helps with partitioning and replication Main memory indexing on lexicon Unusual query model – what’s special here? Weird consistency model! OK if different users see different views As long as we route same user to same machine(s), we’re OK Updates are happening in a separate “instance” Slipstream it in place Can even extend this to change versions of software on the machines – as long as interfaces stay the same

Could We Change a DBMS? What would a DBMS for Google-like environments look like? What would it be useful for, other than Google?

Beyond Google What if we wanted to:
Add on-the-fly query capabilities to Google? e.g., query over up-to-the-second stock market results Use WordNet or some thesaurus to supplement Google? Do PageRank in a topic-specific way? Supplement Google with “ontology” info? Do some sort of XML path matching along with keywords? Allow for OLAP-style analysis? Do a cooperative, e.g., P2P, Google? Benefits of this?

Google and Scalable Query Services

Similar presentations

Presentation on theme: "Google and Scalable Query Services"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Google and Scalable Query Services

Similar presentations

Presentation on theme: "Google and Scalable Query Services"— Presentation transcript:

Similar presentations

About project

Feedback