Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Google Cluster Architecture

Similar presentations


Presentation on theme: "The Google Cluster Architecture"— Presentation transcript:

1 The Google Cluster Architecture
Presented by Fatma Canan Pembe

2 PURPOSE To overview the computer architecture of Google
One of the mostly known and used search engines today How it can achieve such a processing power under such big workload

3 OUTLINE Introduction Cluster architectures
Google architecture overview Serving a Google query Design principles of Google clusters Leveraging commodity parts Power problem Hardware-level characteristics Memory system Summary

4 INTRODUCTION Search engines A single query on Google (on average)
require high amounts of computation per request A single query on Google (on average) reads hundreds of megabytes of data consumes tens of billions of CPU cycles A peak request stream on Google Thousands of queries per second requires an infrastructure comparable in size to largest supercomputer installations

5 INTRODUCTION (Cont.) Google
Combines more than 15,000 commodity-class PCs Instead of a smaller number of high-end servers Most important factors that influenced the design Energy efficiency Price-performance ratio Google application affords easy parallelization Different queries can run on different processors A single query can use multiple processors because the overall index is partitioned

6 CLUSTER ARCHITECTURES
collection of independent computers using switched network to provide a common service Many mainframe applications run more "loosely coupled" machines than shared memory machines databases, file servers, Web servers, simulations, etc. Often need to be highly available, requiring error tolerance and repairability Often need to scale

7 DISADVANTAGES OF CLUSTERS
Cost of administering a cluster of N machines administering N independent machines vs. cost of administering a shared address space N processors multiprocessor administering 1 big machine Clusters usually connected using I/O bus whereas multiprocessors usually connected on memory bus Cluster of N machines has N independent memories and N copies of OS but a shared address multi-processor allows 1 program to use almost all memory

8 ADVANTAGES OF CLUSTERS
Error isolation separate address space limits contamination of error Repair Easier to replace a machine without bringing down the system than in a shared memory multiprocessor Scale easier to expand the system without bringing down the application that runs on top of the cluster Cost Large scale machine has low volume => fewer machines to spread development costs vs. leverage high volume off-the-shelf switches and computers Amazon, AOL, Google, Hotmail, and Yahoo rely on clusters of PCs to provide services used by millions of people every day

9 GOOGLE ARCHITECTURE OVERVIEW
Reliability provided in software level rather than in server-class hardware so that commodity PCs can be used to build a cluster at a low price Design for best aggregate throughput rather than peak server response time Building a reliable computing infrastructure from clusters of unreliable commodity PCs

10 SERVING A GOOGLE QUERY When user enters a query
e.g. User browser Domain Name System (DNS) lookup to map to a particular IP address Multiple Google clusters distributed worldwide each cluster with a few thousand machines to handle query traffic

11 SERVING A GOOGLE QUERY (Cont.)
Geographically distributed setup protects against catastrophic failures DNS-based load-balancing system selects a cluster according to user’s geographic proximity available capacity at various clusters User’s browser sends HTTP request to one of the clusters thereafter, processing local to that cluster

12 SERVING A GOOGLE QUERY (Cont.)
A hardware based load balancer in each cluster monitors available Google Web Servers (GWSs) performs local load balancing of requests A GWS machine coordinates the query execution returns results as HTML response

13 SERVING A GOOGLE QUERY (Cont.)

14 SERVING A GOOGLE QUERY (Cont.)
Query execution phases 1. The index servers determine the relevant documents by consulting an inverted index challenging due to large amount of data Raw documents -> several tens of terabytes of data Inverted index -> many terabytes of data Fortunately, search is highly parallelizable by dividing the index into pieces (index shards) For each shard, a pool of machines serve improving reliability a load balancer employed 2. The document servers determine the actual URLs and query-specific summaries of the found documents Again documents are divided into shards

15 DESIGN PRINCIPLES OF GOOGLE CLUSTERS
Software level reliability No fault-tolerant hardware features; e.g. redundant power supplies A redundant array of inexpensive disks (RAID) instead tolerate failures in software Use replication for better request throughput and availability Price/performance beats peak performance CPUs giving the best performance per unit price Not the CPUs with best absolute performance Using commodity PCs reduces the cost of computation

16 FIRST GOOGLE SERVER RACK
In the Computer History Museum (from year 1999) Each tray contains eight 22GB hard drives and one power supply

17 LEVERAGING COMMODITY PARTS
Google’s racks consist of 40 to 80 x86-based servers Server components similar to mid-range desktop PC except for larger disk drives Ranging from single processor 533-MHz Intel-Celeron based servers to dual 1.4-GHz Intel Pentium III servers Servers on each rack interconnennected via 100 Mbps Ethernet All racks interconnectd via a gigabit switch

18 LEVERAGING COMMODITY PARTS (Cont.)
Selection criterion Cost per query [Capital expense (with depreciation) + operating costs (hosting, system administration, repairs)] / performance inexpensive PC-based clusters vs high-end multiprocessor servers Rack -> GHz Xeon CPUs Gbytes RAM + 7 Tbytes of disk space = $278,000 Server -> 8 2-GHz Xeon CPUs + 64 Gbytes RAM + 8 Tbytes of disk space = $758,000

19 LEVERAGING COMMODITY PARTS (Cont.)
Multiprocessor server about 3 times more expensive 22 times fewer CPUs 3 times less RAM Cost difference of high-end server due to higher interconnect bandwidth reliability which are not necessary in Google’s highly redundant architecture

20 THE POWER PROBLEM A mid-range server with dual 1.4-GHz Pentium III processors 90 W of DC power 55 W for the two CPUs 10 W for a disk drive 25 W for DRAM and motherboard Typical efficiency of an ATX power supply -> 75% means 120 W of AC power per server roughly 10 kW per rack

21 THE POWER PROBLEM (Cont.)
A rack 25 ft2 of space Corresponding power density: 400 W/ ft2 With higher-end processors: 700 W/ft2 Typical power density for commercial data centers: between 70 and 150 W/ft2 Much lower than that required for PC clusters Special cooling or additional space required to decrease power density to a tolerable level

22 THE POWER PROBLEM (Cont.)
Reduced-power servers can be used; but must be without a performance penalty must not be considerably more expensive

23 HARDWARE-LEVEL CHARACTERISTICS
Architectural characteristics of the Google query-serving application examined to determine hardware platforms for best price/performance Index server most heavily impacts the overall price/performance

24 INSTRUCTION LEVEL MEASUREMENTS ON THE INDEX SERVER
Characteristic (On a 1-GHz dual-processor Pentium III system) Value Cycles per instruction 1.1 Ratios (percentage) Branch mispredict Level 1 instruction miss* Level 1 data miss* Level 2 miss* Instruction TLB miss* Data TLB miss* * Cache and TLB ratios are per instructions retired 5.0 0.4 0.7 0.3 0.04

25 HARDWARE-LEVEL CHARACTERISTICS
Moderately high CPI Pentium III capable of issuing 3 instructions/cycle Reason: a significant number of difficult-to-predict branches traversing of dynamic data structures data dependent control flow in newer Pentium 4 processor Same workload CPI is nearly twice Approximately the same branch prediction performance Even though Pentium 4 can issue more instructions concurrently has superior branch prediction logic Google workload does not much contain exploitable instruction-level parallelism (ILP)

26 HARDWARE-LEVEL CHARACTERISTICS (Cont.)
To exploit parallelism: Trivially parallelizable computation in processing of queries requires little communication already done using large number of inexpensive nodes at the cluster level Thread-level parallelism at the microarchitecture level Simultaneous multithreading (SMT) systems Chip multiprocessor (CMP) systems

27 HARDWARE-LEVEL CHARACTERISTICS (Cont.)
Simultaneous multithreading (SMT) Experiments with a dual-context (SMT) Intel Xeon processor more than 30% performance improvement over a single-context setup at the upper bound of improvements reported by Intel for their SMT implementation

28 HARDWARE-LEVEL CHARACTERISTICS (Cont.)
Chip multiprocessor (CMP) architectures such as Hydra and Piranha Multiple (four to eight) simpler, in-order, short-pipeline cores to replace a complex high-performance core Penalties of in-order execution minor because of little ILP in the Google application Shorter pipelines reduce/eliminate branch mispredict penalties Available thread level parallelism can allow near-linear speedup With the number of cores A shared L2 cache of reasonable size can speed up inter-processor communication

29 MEMORY SYSTEM Table Main memory system performance parameters Good performance for the instruction cache & instruction translation look-aside buffer due to relatively small inner-loop code size Index data blocks No temporal locality due to size of data and unpredictability in access patterns Benefit from spatial locality Hardware prefetching or larger cache lines can be used Good overall cache hit ratios (even for relatively modest cache sizes)

30 INSTRUCTION LEVEL MEASUREMENTS ON THE INDEX SERVER
Characteristic (On a 1-GHz dual processor Pentium III system) Value Cycles per instruction 1.1 Ratios (percentage) Branch mispredict Level 1 instruction miss* Level 1 data miss* Level 2 miss* Instruction TBL miss* Data TBL miss* * Cache and TLB ratios are per instructions retired 5.0 0.4 0.7 0.3 0.04

31 MEMORY SYSTEM (Cont.) Memory bandwidth
does not appear to be a bottleneck A suitable memory system for the load a relatively modest sized L2 cache short L2 cache and memory latencies longer (perhaps 128 byte) cache lines

32 SUMMARY Google infrastructure
Massively large cluster of inexpensive machines vs a smaller number of large-scale shared memory machines Useful when computation-to-communication ratio is low Communication patterns or data partitioning are dynamic or hard to predict Total cost of ownership is much greater than hardware costs (due to management overhead and software licensing prices) in this cases, they justify their high prices None of these requirements apply at Google

33 SUMMARY (Cont.) Google Partitions index data and computation
to minimize communication to evenly balance the load across servers Produces its software in-house Minimizes system management overhead through extensive automation and monitoring Hardware costs become important Deploys many small multiprocessors Faults effect smaller pieces of the system vs large-scale shared-memory machines which do not handle individual hardware component or software failures enough Most fault types causing a full system crash

34 SUMMARY (Cont.) It appears there are few applications like Google
requiring many thousands of servers and petabytes of storage However, many applications share the characteristics of Focusing on price/performance Ability to run on servers without private state (so servers can be replicated) allowing a PC-based cluster architecture e.g. high-volume Web servers, application servers that are computationally intensive but essentially stateless

35 SUMMARY (Cont.) At Google’s scale
Some limits of massive server parallelism become apparent; e.g.: Limited cooling capacity of commercial data centers Less-than-optimal fit of current CPUs for throughput-oriented applications Nevertheless, using inexpensive PCs increased the amount of computation that can be afforded to spend per query increasing the amount of computation that can be afforded to spend per query thus, helping to improve the search experience of the users

36 THANK YOU


Download ppt "The Google Cluster Architecture"

Similar presentations


Ads by Google