OVERVIEW OF GOOGLE SEARCH ENGINE Xiannong Meng Computer Science Department Bucknell University Lewisburg, PA 17837 U.S.A.

Slides:



Advertisements
Similar presentations
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Presented by Wenhao Xu University of British Columbia.
Advertisements

Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
THE GOOGLE FILE SYSTEM CS 595 LECTURE 8 3/2/2015.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Google File System 1Arun Sundaram – Operating Systems.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
The Google File System and Map Reduce. The Team Pat Crane Tyler Flaherty Paul Gibler Aaron Holroyd Katy Levinson Rob Martin Pat McAnneny Konstantin Naryshkin.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
The Google File System.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Case Study - GFS.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
The Anatomy of a Large- Scale Hypertextual Web Search Engine Sergey Brin, Lawrence Page CS Department Stanford University Presented by Md. Abdus Salam.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
The Google File System Presenter: Gladon Almeida Authors: Sanjay Ghemawat Howard Gobioff Shun-Tak Leung Year: OCT’2003 Google File System14/9/2013.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
EE324 DISTRIBUTED SYSTEMS FALL 2015 Google File System.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Google – Architectures and File Systems Professor Xiannong Meng A Summary of Four Papers about Google Spring 2010 Updated Spring 2014 (Data Centers) 1.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
The Google Cluster Architecture Written By: Luiz André Barroso Jeffrey Dean Urs Hölzle Presented By: Omkar Kasinadhuni Simerjeet Kaur.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Google File System.
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP’03, October 19–22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae.
The Google File System (GFS)
Presentation transcript:

OVERVIEW OF GOOGLE SEARCH ENGINE Xiannong Meng Computer Science Department Bucknell University Lewisburg, PA U.S.A.

Fall 2004Google Overview 2 Hardware Architecture Very little is published on specific search engines Google clusters (early 2003 data) (Barroso et.al.) –15,000 Off-the-shelf PCs –Ranging from single-processor 533-MHz Celeron to dual-processor 1.4 GHz PIII –One or more 80G IDE drives on each PC –Giga-bit switches –2-3 years of life cycle (hardware) –Fault-tolerant software

Fall 2004Google Overview 3 Some Highlights On average, a single query reads hundreds of megabytes of data and consumes billions of CPU cycles Thousands of queries per second at peek time Index structure is partitioned Different queries can run on different processors

Fall 2004Google Overview 4 Design Considerations Use software to provide functionality, rather than using hardware Component-off-the-shelf (COTS) approach: commodity PCs are used with faulty tolerant back-bone network Reliability is achieved through replicating services across many different machines Price/performance consideration beats peek performance

Fall 2004Google Overview 5 Serving a Query When a query is received, it is mapped to a local google cluster that is close to the user geographically Google has clusters distributed across the world Each cluster has a few thousand machines A DNS-based hardware load-balancing scheme is used

Fall 2004Google Overview 6 Serving a Query ( cont ) Query execution consists of two major phases Phase one: –The index servers consult an inverted index that map each query word to a hit list –The index servers then determine a set of relevant documents by intersecting the hit lists of each query words –A relevance score is computed for each document –The result of this phase is an ordered list of docIds –The entire index is randomly divided into pieces (index shards)

Fall 2004Google Overview 7 Serving a Query ( cont ) Phase two: –The document servers take the docIds and compute the actual title and URL for each, along with a summary –Documents are randomly distributed into smaller shards –Multiple servers replicas responsible for handling each shard –Routing requests through a load balancer

Fall 2004Google Overview 8 Document Server Clusters Each must have access to an online, low- latency copy of the entire web Google stores dozens of copies of the web across its clusters Supporting services of a google web server (GWS) besides doc server and index server: –Spell check –Ad serving (if any)

Fall 2004Google Overview 9 Commodity Parts Google’s racks consist of 40 to 80 x86- based servers mounted on either side of a custom made rack Each side of the rack contains 20 2-u or 40 1-u servers These servers are mid-range Several generations of CPU in active use, ranging from 533-MHz single processor to dual 1.4-GHz Pentium III servers

Fall 2004Google Overview 10 Commodity Parts (cont) Each server contains one or more IDE 80 G byte disk drive The servers on each side of a rack interconnect via a 100-Mbps Ethernet switch Each switch has one or two gigabit uplinks to a core gigabit switch that connect all racks together

Fall 2004Google Overview 11 Cost Estimate of an Example Rack Late 2002, a rack of 88 dual-CPU 2-GHz Xeon servers with 2 G bytes Ram and 80 G bytes of disk costs around $278,000 This rack contains GHz Xeon CPUs, 176 G bytes of RAM, 7 T bytes of disk A typical x86-based server contains 8 2- GHz Xeon CPUs, 64 G bytes of RAM, 8 T bytes of disk, costing about $758,000

Fall 2004Google Overview 12 Google File System Google file system (GFS,Ghemawat 2003) –64-bit distributed file system –Single master, multiple chunkservers –Chunk size 64 MB –The largest GFS cluster has over 1000 storage nodes and over 300 TB disk storage space

Fall 2004Google Overview 13 Issues with File Systems Component failures are the norm rather than the exception Files are huge in google by traditional standards Most files are mutated by appending new data rather than overwriting existing data Co-design (GFS) with other applications makes it more efficient

Fall 2004Google Overview 14 Assumptions Built from many inexpensive commodity components that often fail Store a modest number of large files, on the order of a few million files, 100 MB or large for each Workload: large streaming reads and small random reads, e.g. streaming reads involve 100 KB or 1 MB per read

Fall 2004Google Overview 15 Assumptions (cont) Workloads may also have many large, sequential writes to append files Once written, revisions are rare Multiple clients, concurrent access (append) High sustained bandwidth is more important than low latency

Fall 2004Google Overview 16 Architecture A single master cluster Multiple chunk-servers Each accessed by multiple clients Files are divided into fixed-size chunks (64 Mbytes) Each chunk is identified by a unique 64- bit chunk-handle Chunk-servers store chunks on local disks as linux files

Fall 2004Google Overview 17 Metadata The master stores three major types of metadata: –File and chunk namespaces –Mapping from files to chunks –Locations of each chunk’s replicas All metadata have a copy in memory Namespaces and file-to-chunk mapping are also kept persistent storage (local disk and remote replica)

Fall 2004Google Overview 18 Consistent Model What is a consistent model? – How Google (or anyone) guarantees the integrity of the data Guarantee by GFE –File namespace mutations are atomic –State of a file region after a data mutation depends on the type of operations, success or failure, and whether or not concurrent updates occured

Fall 2004Google Overview 19 Consistent Model ( cont ) A file region is consistent if all clients will always see the same data GFS guarantees the consistency by –Applying mutations to a chunk in the same order on all its replicas –Using chunk version number to detect any replica that has become stale because it has missed mutations while its chunkserver was down

Fall 2004Google Overview 20 System Interactions --- Lease Lease: when each mutation is performed, the master grants a chunk least to one of the replicas, primary The primary picks up a serial order for all mutations to the chunk A lease has an initial amount of timeout of 60 seconds As long as the chunk is being updated, the primary can request and will get extension

Fall 2004Google Overview 21 System Interactions --- Data Flow Data is pushed linearly along a chain of chunkservers Each machine’s full outbound bandwidth is used to transfer data Each machine forwards the data to the “closest” machine that has not received the data Data transfer is “pipelined” over TCP connection – as soon as a chunkserver starts receiving data, it forwards the data immediately

Fall 2004Google Overview 22 System Interactions --- Atomic Record Appends Clients only specify the data GFS appends the data the file at least once atomically Returns the location to the client If the append would cause the chunk to exceed its size limit (64M), the current chunk is padded, a new chunk starts with the record in operation If an append fails at any replica, the client retries the operation

Fall 2004Google Overview 23 System Interactions --- Snapshot The snapshot operation makes a copy of a file or a directory tree Copy-on-write is used to implement snapshots When a request is received, the master revokes any outstanding leases, ensuring data integrity Subsequent write would have to interact with the master again

Fall 2004Google Overview 24 Fault Tolerance High availability –Both master and chunkservers restart in seconds –Each chunk is replicated on multiple chunkservers on different racks –A read-only “shadow” master replica, may lag the primary slightly

Fall 2004Google Overview 25 Fault Tolerance ( cont ) Data integrity –A chunk is broken into 64 K blocks, each of which has a 32 bit checksum –Upon a read request, if checksum fails, the reader may read from another chunk replica –Checksum computation is heavily optimized for appending operation

Fall 2004Google Overview 26 Fault Tolerance ( cont ) Diagnostic tools –Extensive and detailed logs are kept to help in problem isolation, debugging, performance analysis

Fall 2004Google Overview 27 Overall System Architecture Google overall architecture based on their 1998 paper, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrance Page, 1998 WWW conference, Brisbane, Australia, April – mlhttp://citeseer.ist.psu.edu/brin98anatomy.ht ml

Fall 2004Google Overview 28 Some Basics Google: a common spelling of Issues to address: accurate, high quality search results Quick history: –1994 WWW Worm had an index of 110,000 web pages –1998 AltaVista had about 140 million pages indexed –About 20 million queries per day in November of 1997 to AltaVista

Fall 2004Google Overview 29 Google Features PageRank: –use the frequency of citations to a page and the citations the page used to point to other pages to measure importance Anchor text: –Provide more accurate description than the page itself –Exist for documents that cannot be indexed by a text-based search engine (e.g. image, db)

Fall 2004Google Overview 30 Google Architecture

Fall 2004Google Overview 31 Components of Google Distributed web crawling URLserver sends lists of URLs to be fetched to the crawlers The web pages fetched are sent to the Storeserver which compresses and stores them A unique docID is assigned for each page The indexing function is performed by the indexer and the sorter

Fall 2004Google Overview 32 Components of Google ( cont ) Indexer: –Each doc is converted into a set of word occurrences called hits –The hits record the word, its position in the doc, font size, and capitalization –The indexer distributes these hits into a set of “barrels”, creating a partially sorted forward index –The indexer also parses out all the links in every web page and stores important information about them (points to and from, text of the link) in an anchor file

Fall 2004Google Overview 33 Components of Google ( cont ) The URLresolver reads the anchors file and converts relative URLs into absolute URLs, and docIDs –Put the anchor text into the forward index, associated with the docID –Generate a database of links which are pairs of docIDs (the host page, and the page pointed by this host page) –The links database is used to compute PageRanks for all documents

Fall 2004Google Overview 34 Major Data Structures BigFiles: virtual files spanning multiple file systems, addressed in 64 bit integers Repository: full HTML of every web page (then) current size: 150 G Document Index: information about each doc, it is a fixed width ISAM (index sequential access mode) index, ordered by docID Hit list: occurrences of a particular word in a particular document including position, font, capitalization Forward index: from doc to words, in a number of barrels Inverted index: from word to doc

Fall 2004Google Overview 35 Crawling the Web Multiple crawler (typically three) Each crawler keeps roughly 300 connections open at once At peak speed, crawl over 100 web pages per second with four crawlers (about 600 K per second) A major performance stress is DNS look- up, each crawler keeps its own DNS cache to improve performance

Fall 2004Google Overview 36 References Luiz Andre Barroso, Jeffrey Dean, and Urs Holzle, “Web Search For A Planet: The Google Cluster Architecture”, IEEE Micro, March-April 2003, pp Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”, Proceedings of SOSP’03, pp Sergey Brin and Lawrence Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Proceedings of the Seventh World Wide Web Conference, 1998, pp