2Key Components Datacenter Architecture Cloud Management SW Cluster: compute, storageNetwork for DCStorageVirtualizationCloud Management SWResource Usage MeteringAutomated Systems ManagementPrivacy and Security MeasuresCloud Programming Env
3What’s a Cluster?Broadly, a group of networked autonomous computers that work together to form a single machine in many respects:To improve performance (speed)To improve throughoutTo improve service availability (high-availability clusters)Based on commercial off-the-shelf, the system is often more cost-effective than single machine with comparable speed or availability
4Highly Scalable Clusters High Performance Cluster (aka Compute Cluster)A form of parallel computers, which aims to solve problems faster by using multiple compute nodes.For parallel efficiency, the nodes are often closely coupled in a high throughput, low-latency networkServer Cluster and DatacenterAims to improve the system’s throughput , service availability, power consumption, etc by using multiple nodes
8Beowulf ClusterA cluster of inexpensive PCs for low-cost personal supercomputingBased on commodity off-the-shelf components:PC computers running a Unix-like Os (BSD, Linux, or OpenSolaris)Interconnected by an Ethernet LANHead node, plus a group of compute nodeHead node controls the cluster, and serves files to the compute nodesStandard, free and open source softwareProgramming in MPIMapReduce
9Why Clustering Today Powerful node (cpu, mem, storage) Today’s PC is yesterday’s supercomputersMulti-core processorsHigh speed networkGigabit (56% in top500 as of Nov 2008)Infiniband System Area Network (SAN) (24.6%)Standard tools for parallel/ distributed computing & their growing popularity.MPI, PBS, etcMapReduce for data-intensive computing
10Major issues in Cluster Design ProgrammabilitySequential vs Parallel ProgrammingMPI, DSM, DSA: hybrid of multithreading and MPIMapReduceCluster-aware Resource managementJob scheduling (e.g. PBS)Load balancing, data locality, communication opt, etcSystem managementRemote installation, monitoring, diagnosis,Failure management, power management, etc
14Multicore Architecture Combine 2 or more independent cores (normally CPU) into a single packageSupport multitasking and multithreading in a single physical package
15Multicore is Everywhere Dual-core commonplace in laptopsQuad-core in desktopsDual quad-core in serversAll major chip manufacturers produce multicore CPUsSUN Niagara (8 cores, 64 concurrent threads)Intel Xeon (6 cores)AMD Opteron (4 cores)
16Multithreading on multi-core David Geer, IEEE Computer, 2007
17Interaction with the OS OS perceives each core as a separate processorOS scheduler maps threads/processes to different coresMost major OS support multi-core today: Windows, Linux, Mac OS X, …
18Cluster Interconnect Network fabric connecting the compute nodes Objective is to strike a balance betweenProcessing power of compute nodesCommunication ability of the interconnectA more specialized LAN, providing many opportunities for perf. optimizationSwitch in the coreLatency vs bw
20Ethernet Switch: allows multiple simultaneous transmissions hosts have dedicated, direct connection to switchswitches buffer packetsEthernet protocol used on each incoming link, but no collisions; full duplexeach link is its own collision domainswitching: A-to-A’ and B-to-B’ simultaneously, without collisionsnot possible with dumb hubC’B126345CB’A’switch with six interfaces(1,2,3,4,5,6)
21switch with six interfaces Switch TableAQ: how does switch know that A’ reachable via interface 4, B’ reachable via interface 5?A: each switch has a switch table, each entry:(MAC address of host, interface to reach host, time stamp)looks like a routing table!Q: how are entries created, maintained in switch table?something like a routing protocol?C’B126345CB’A’switch with six interfaces(1,2,3,4,5,6)
22Switch: self-learning Source: ADest: A’A A’switch learns which hosts can be reached through which interfaceswhen frame received, switch “learns” location of sender: incoming LAN segmentrecords sender/location pair in switch tableAC’B126345CB’A’MAC addr interface TTLA160Switch table(initially empty)
24Interconnecting switches Switches can be connected togetherDEFS2S4S3HIGS1ABCQ: sending from A to G - how does S1 know to forward frame destined to F via S4 and S3?A: self learning! (works exactly the same as in single-switch case!)Q: Latency and Bandwidth for a large-scale network?
25What characterizes a network? Topology (what)physical interconnection structure of the network graphRegular vs irregularRouting Algorithm (which)restricts the set of paths that msgs may followTable-driven, or routing algorithm basedSwitching Strategy (how)how data in a msg traverses a routeStore and forward vs cut-throughFlow Control Mechanism (when)when a msg or portions of it traverse a routewhat happens when traffic is encountered?Interplay of all of these determines performance
26Tree: An Example Diameter and ave distance logarithmic Fixed degree k-ary tree, height d = logk Naddress specified d-vector of radix k coordinates describing path down from rootFixed degreeRoute up to common ancestor and downR = B xor Alet i be position of most significant 1 in R, route up i+1 levelsdown in direction given by low i+1 bits of BBandwidth and Bisection BW?
27Bandwidth Bandwidth For a switch with N ports, Point-to-Point bandwidthBisectional bandwidth of interconnect frabric: rate of data that can be sent across an imaginary line dividing the cluster into two halves each with equal number of ndoes.For a switch with N ports,If it is non-blocking, the bisectional bandwidth = N * the p-t-p bandwidthOversubscribed switch delivers less bisectional bandwidth than non-blocking, but cost-effective. It scales the bw per node up to a point after which the increase in number of nodes decreases the available bw per nodeoversubscription is the ratio of the worst-case achievable aggregate bw among the end hosts to the total bisection bw
28How to Maintain Constant BW per Node? Limited ports in a single switchMultiple switchesLink between a pair of switches be bottleneckFast uplinkHow to organize multiple switchesIrregular topologyRegular topologies: ease of management
30Multidimensional Meshes and Tori 3D Cube2D Mesh2D torusd-dimensional arrayn = kd-1 X ...X kO nodesdescribed by d-vector of coordinates (id-1, ..., iO)d-dimensional k-ary mesh: N = kdk = dÖNdescribed by d-vector of radix k coordinated-dimensional k-ary torus (or k-ary d-cube)?
31Packet Switching Strategies Store and Forward (SF)move entire packet one hop toward destinationbuffer till next hop permittedVirtual Cut-Through and Wormholepipeline the hops: switch examines the header, decides where to send the message, and then starts forwarding it immediatelyVirtual Cut-Through: buffer on blockageWormhole: leave message spread through network on blockage
32SF vs WH (VCT) Switching Unloaded latency: h( n/b+ D) vs n/b+hDh: distancen: size of messageb: bandwidthD: additional routing delay per hop
34Problems with the Cluster Arch Resource fragmentation:If an application grows and requires more servers, it cannot use available servers in other layer 2 domains, resulting in fragmentation and underutilization of resourcesPoor server-to-server connectivityServers in different layer-2 domains to communication through the layer-3 portion of the networkSee papers in the reading list of Datacenter Network Design for proposed approaches
35Datacenter as a Computer OverviewWorkloads and SW infrastructureHW building blocksDatacenter basicsEnergy and power efficiencyDealing with Failures and Repairs
36DatercenterDatacenters are buildings where servers and comm. gear are co-located because of their common environmental requirements and physical security needs, and for ease of maintenance.Traditional DCs typically host a large number of relatively small- or medium-sized applications, each running on a dedicated hw infra that is de-coupled and protected from other systems in the same facility
38Advances in Deployment of DC Conquering complexity.Building racks of servers & complex cooling systems all separately is not efficient.Package and deploy into bigger units:Microsoft Generation 4 Data Center
42Storage Global distributed file systems (e.g. Google’s GFS) Hard to implement at the cluster-level, but lower hw costs and networking fabric utilizationGFS implement replication across different machines (for fault tolerance); Google deploys desktop-clas dis drives, instead of enterprise-grade disksNetwork Attached Storage (NAS), directly connected to the cluster-level switching fabricSimple to deploy, because it pushes the responsibility for data management and integrity to NAS appliance
43Amazon’s Simple Storage Service (S3) Write, read, and delete objects containing from 1 byte to 5 terabytes of data each. The number of objects you can store is unlimited.Each object is stored in a bucket and retrieved via a unique, developer-assigned key.A bucket can be stored in one of several Regions: in the US Standard, EU (Ireland), US West (Northern California) and Asia Pacific (Singapore) Regions.Built to be flexible so that protocol or functional layers can easily be added. The default download protocol is HTTP. A BitTorrent™ protocol interface is provided to lower costs for high-scale distribution.Designed to provide % durability and 99.99% availability of objects over a given year. Designed to sustain the concurrent loss of data in two facilities.Designed to provide 99.99% durability and 99.99% availability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.01% of objects. Designed to sustain the loss of data in a single facility.
44Amazon’s Elastic Block Storage (EBS) EBS provides block level storage volumes for use with EC2 instances. It allows you to create storage volumes from 1 GB to 1 TB that can be mounted as devices by Amazon EC2 instances.Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component.The latency and throughput of Amazon EBS volumes is designed to be significantly better than the Amazon EC2 instance stores in nearly all cases.
45Network Fabric Tradeoff between speed, scale, and cost Two-level hierarchy (in rack and cluster-levels)A switch having 10 times the bi-section bw costs about 100 times as muchA rack with 40 servers, each with a 1-Gbps port may have between 4 to 8 1-Gbps uplinks to the cluster switch: an oversubscription factor between 5 and 10 for comm across racks.“fat-tree” networks built of lower-cost commodity Ethernet switches
46Next Generation of Network Fabric MonsoonWork by Albert Greenberg, Parantap Lahiri, David A. Maltz,Parveen Patel, Sudipta Sengupta.Designed to scale to 100K+ data centers.Flat server address space instead of dozens of VLANS.Valiant Load Balancing.Allows a mix of apps and dynamic scaling.Strong fault tolerance characteristics.
48Latency, BW, and Capacity Assume: 2000 servers (8GB mem and 1TB disk), 40 per rack connected by a 48-port 1Gpbs switch (8 uplinks)Arch: bridge the gap in a cost-efficient mannerSW : hide the complexity, exploit data locality
49DC vs Supercomputers Scale Network Architecture Data Storage Blue Waters = 40K 8-core “servers”Road Runner = 13K cell + 6K AMD serversMS Chicago Data Center = 50 containers = 100K 8-core servers.Network ArchitectureSupercomputers: CLOS “Fat Tree” infinibandLow latency – high bandwidth protocolsData Center: IP based NetworkOptimized for Internet AccessData StorageSupers: separate data farmGPFS or other parallel file systemDCs: use disk on node + memcacheFat tree networkStandard Data Center
50Power UsageDistribution of peak power usage of a Google DC (circa 2007)
51Workload and SW InfraPlatform-level SW: presented in all individual servers, providing basic server-level servicesCluster-level infra: collection of distributed systems sw that manages resources and provides services at the cluster levelMapReduce, Hadoop, etcApplication-level sw: implements a specific serviceOneline service like web search, gmail,Offline computations, e.g. data analysis or generate data used for online services such as building index
52Examples of Appl-SW Web 2.0 applications Provide rich user experience including real-time global collaborationEnable rapid software developmentSoftware to scan voluminous Wikipedia edits to identify spamOrganize global news articles by geographic locationData-intensive workloads based on scalable architectures, such as Google’s MapReduce frameworkFinancial modeling, real-time speech translation, Web searchNext generation rich media, such as virtual worlds, streaming videos, Web conferencing, etc.
53Characteristics Ample parallelism at both data and request levels; Key is not to find parallelism, but to manage and efficiently harness the explicit parallelismData parallelism arises from the large data sets of relatively independent records to be processedRequest-level parallelism comes from hundreds or thousands of requests per second to be responded; the requests rarely invovle read-after-write sharing of data or synchronization
54Characteristics (cont’) Workload churn: isolation from users of Internet services makes it easy to deploy new sw quickly.Google’s front-end web server binaries are released on a weekly cycleThe core of its search services be reimplemented from scratch every 2~3 years!!New products and services frequently emerge, and their success with users directly affects the resulting workload mix in the DCHard to develop a meaningful benchmarkNot too much for HW architects to doCount on sw rewrites to take advantage of new hw capabilities ?!Fault-free is challenging, but possible
55Basic Programming Concepts Data Replication for both perf and availabilityData Partitioning (sharded) for both p. & a.Load balancing:Sharded vs replicated servicesHealth checking & watchdog timers for availabilityNo op could rely on a given server to respond to make progress forwardIntegrity checks for availabilityApplication-specific compressionEventual consistency w.r.t. replicated dataWhen no updates occur for a long period time, eventually all updates will propagete through the system and all replicas will be consistent
56Cluster-level SW Resource Management HW abstraction and Basic Services Map user tasks to hw resources, enforce priorities and quotas, provide basic task managment servicesSimple allocation; or automate allocation of resources; fair-sharing of resources at a finer level of granulaity; power/energy considerationRelated Work at Wayne StateWei, et al, Resource management for end-to-end quality of service assurance [Wei’s PhD disseration’06]Zhou, et al, Proportional resource allocation in web servers, streaming servers, and e-commerce servers proportional ; see cic.eng.wayne.edu for related publications (02-05)HW abstraction and Basic ServicesE.g. reliable distributed storage, message-passing, cluster-level sync. (GFS, Dynamo )
57Cluster-level SW (cont’) Deployment and maintenanceSw image distributionj, confguration management, monitoring service perf and quality, alarm trigger for operators in emergency situations, etcE.g. Autopilot of Microsoft, Google’s Health InfrastructureRelated Work at Wayne StateJia, et al, Measuring machine capacity, ICDCS’08Jia, et al, Autonomic VM configuration, ICAC’09Bu, et al, Autonomic Web apps configuration, ICDCS’09Programming FrameworksTools like MapReduce would improve programmer productibity by automatically handling data partitioning, distribution, and fault tolerance
58Monitoring Infra: An Example Service-level dashboardsKeep track of service quality (w.r.t target level)Info must be fresh for corrective actions and avoid significant disruption within sec not minE.g. how to measure user-perceived pageview response time? (multiple objects, end-to-end)18 objects
59Client-Experienced QoS request-based QoSconnection closeserverlast objectwaiting for new requestsobject 2object 1base pageclientSetup connectionclient-perceived pageview QoSInternetPacketCaptureAnalyzerPerfTCP PacketsHTTPS TransHTTPS TrafficMirroredWei/Xu, sMonitor for Measurement of User-Perceived Laency, USENIX’2006
60Perf Debugging ToolsHelp operators and service designers to develop understanding of the complex interactions between programs, often running on hundreds of servers, so as to determine the root cause of perf anomalies and identify bottlenecksBlack-box monitoring: observing network traffic among system components and inferring causal relationships through statistical inference methods, assuming no knowledge of or assistance from appl or sw;But Info may not accurateAppl/middleware instrumentation systems, like Google’s Dapper, require to modify applications or middleware libraries for passing tracing across machines and across module boundaries. The annotated modules log tracing info to local disks for subsequent analysis
61HW Building BlocksCost Effectiveness of Low-end Servers
62Performance of Parallel Apps Under a model of fixed local computation time, plus the latency penalty of access to global data structure
63Parallel Apps PerfPerf advantage of a cluster of high-end nodes (128-cores) over a cluster with the same number of cores built with low-end servers (4 cores) [4 to 20 times difference in price]
64How Small a Cluster Node should Be? Other factors need to be consideredAmount of parallelism,Network requirementsSmaller server may lead to low utilizationetc
65Datacenter as a Computer OverviewWorkloads and SW infrastructureHW building blocksDatacenter basicsEnergy and power efficiencyDealing with Failures and Repairs
67UPS SystemsA transfer switch that chooses the active power input, with utility power or generator power.Typically, a generator takes s to start and assume the full rated loadBatteries to bridge of time of utility power blackoutAC-DC-AC double conversionWhen utility power fails, UPS loses input AC power but retains internal DC power, and thus the AC output powerRemove voltage spikes or harmonic distortions in the aC feed100s kW up to 2MW
68Power Distribution Units Takes the UPS output (typically 200~480v) and break it up into many 110 or 220v circuits that feed the actual servers on the floor.Each circuit is protected by its own breakerA typical PDU handles 75~225kW of load, whereas a typical circuit handles 20 or 30A at V (a max of 6kW)PDU provides additional redundancy (in circuit)
69Datacenter Cooling Systems CRAC Units: computer room air conditioningWater-based free cooling
70Energy EfficiencyDCPE (DC perf efficiency) : Ratio of amount of computation al work to total energy consumedTotal energy consumed:Power usage effectiveness (PUE): ratio of building power to IT power (currently 1.5 to 2.0)
72Energy Efficiency (cont’) Power usage effectiveness (PUE)Server PUE: ratio of total server input power to its useful power (consumed by the components like motherboard, disks, CPUs, DRAM, I/O bards, etc)User power excludes losses in power supplies, fans,Currently, SPUE between 1.6 to 1.8.With VRM (voltage regulatory module), SPUE can be reduced to 1.2
73Measuring Power Efficiency BenchmarksGreen500 in high-performance computingJoulesort:measuing the total system energy to perform an out-of-core sortSPECpower_ssj2008Compute the perf-to-power ratio of a system running a typical business applicatoin an enterprise Java platform
74Power Efficiency: An Example SPECpolwer_ssj2008 on a server with single-chip 2.83GH quad-core Intel Xeon, $GB mem, one 7.2k RPM, s.5’’ SATA disk
75Activity Profile of Google Servers A sample of 5000 servers over a period of 6 monthsMost of the time, 10-50% utilization
76Energy-Proportional Computing Humans at rest consume at little as 70w, while being able to sustain peaks of 1kW+ for tens of minsFor adult male:
77Causes of Poor Energy Proportionality CPU used to be dominant factor (60%); currently slighly lower than 50% at peak, drops to 30% at low activity levels
78Energy Proportional Computing: How Hardware components: For exampleCPU: dynamic voltage scalingHigh speed disk drives spend 70% of their total power simply keeping the platters spinningNeed smaller rotational speeds, smaller platters,Power distribution and cooling
79SW role: managementSmart use of power management features in existing hw, low-overhead inactive or active low-power modes, as well as implementing power-friendly scheduling of tasks to enhance energy proportionality of hw systemsTwo challenges:Encapsulation in lower-level modules to hide additional infra complexityPerformance robustness: minizing perf variability caused by power management toolsRelated Work at Wayne StateZhong, et al “System-wide energy minimization for hard real-time tasks,” TECS’08.Zhong, et al “Energy-aware modeling and scheduling for dynamic voltage scaling with statistical real-time guarantee,” TC’’07; Zong’s PhD dissertation ‘2007Gong, Poewr/Performance Optimization (ongoing)
80Datacenter as a Computer OverviewWorkloads and SW infrastructureHW building blocksDatacenter basicsEnergy and power efficiencyDealing with Failures and Repairs
81Basic ConceptsFailure: A system failure occurs when the delivered service deviates from the specified service, where the service specification is an agreed description of the expected service. [Avizienis & Laprie 1986]Fault: the root cause of failures, defined as a defective state in materials, design, or implementation. Faults may remain undetected for some time. Once a fault becomes visible, it is called an error.Faults are unobserved defective statesError is “manifestation” of faults(Source: Salfner’08)
82Challenges of High Service Availability High service availability expectation translates into a high-reliability requirement for DCFaults in hw, sw, and operation are inevitableIn Google, about 45% servers need to reboot at least once over a 6-month window; 95%+ requires less often than once a month, but the tail is relatively long.The average downtime ~3 hours, implying 99.85% availabilityDetermining the appropriate level of reliability is fundamentally a trade-off between the cost of failures and the cost of preventing them
83Availability and Reliability Availability: A measure of the time that a system was actually usable, as a fraction of the time that it was intended to be usable. (x nines)Yield: the ratio of requests that is satisfied by the service to the totalReliability Metrics:Time to failure (TTF)Time to repair (TTR)Mean time to failure (MTTF)Mean time to repair (MTTR)Mean time between failures (MTBF)
84Interplay of HW, SW, Operation Faults in hw and sw are inevitable. But endeavor will never halt: e.g. RAID disk drive, ECC memoryA fault-tolerant sw infrastructure would help hide much of the failure complexity from application-level sw.E.g. SW-based RAID system across disk drives residing in multiple machines (as in GFS); MapReduceFlexibility in choosing the level of hw reliability that maximizes overall system cost efficiency (e.g. inexpensive PC-class hw);Simplification in common operational procedures (e.g. hw/sw upgrade)With FT sw, not necessary to keep a server running at all costs. This leads to change in every aspect of the systems, from design to operation and opens opportunity to optimizationAs long as the hw faults can be detected and reported to sw in a timely manner
85Fault Characterization FT-sw be based on fault sources, their statistical characteristics, correponding recovery behaviorFault SeverityCorrupted: committed data are lost or corruptedUnreachable: service is downGoogle service no better than 99.9% when its servers are one of the end pointsAmazon’s Service Level Agreement is 99.95%Degraded: service is available but in degraded modeMasked: faults occur but masked from users by FT hw/sw mechanisms
86Causes of Service-Level Failures Field data study I on Internet services: Operator-caused or misconfig errors are larget contriubutors; hw-related faults (server or networking) accounts about 10-25% [Oppenheimer’2003]Field data study II on early Tandem systems: Hw faults (<10%), sw faults (~60%), op/maintenance (~20%) [Gray’90]Google’s observation over a period of 6 weeks
87ObservationsSeems sw/hw-based ft techniques do well for independent faults.SW-, Op-, and maintenance-induced faults have a higher impact on outages, possibly because they are most likely affect multiple systems at once, creating a corrected failure scenario which is hard to overcome.
88Proactive Failure Management? Failure Prediction?Predict future machine failures with low false-positive rates in a short time horizonDevelop models for a good trade-off between accuracy (both in false-positive rates and time horizon) and the penalties involved in failure occurrence and recoveryIn DC, the penalty is low, the prediction model must be highly accurate to be economically competitive.In systems where a crash is disruptive to op, less accurate prediction models would be beneficialRelated Work at Wayne StateFu/Xu, Exploring spatial/temporal event correlation for failure prediction, SC’07Fu’s PhD dissertation 2008
89In Summary Hardware Software: Economics: Cost effectiveness Building blocks are commodity server-class machines, consumer- or enterprise-grade disk drives, Ethernet-based networking fabricsPerf of the network fabric and storage subsystems be more relevant to CPU and memSoftware:FT sw for high service availability (99.99%)Programmability, Parallel efficciency, ManageabilityEconomics: Cost effectivenesspower and energy factorsUtilization characteristics require systems and components to be energy efficient across a wide load spectrum, particularly at low utilization level
90In Summary: Key Challenges Rapidly changing workloadNew applications with a large variety of computational characteristics emerge at a fast paceNeed creative solutions from both hw and sw; but little benchmark availableBuilding balanced systems from imbalanced componentsProcessors outpaced memory and magnetic storage in perf and power efficiency; more research should be shifted onto the non-cpu subsystemsCurbing energy usagePower becomes the first order resources, as speedPerformance under power/energy budget
91Key Challenges (cont’) Amdahls’ cruel lawSpeedup = 1/ (f_seq + f_par/n) on n-node parallel systems. Sequential part f_seq limits parallel efficiency, no matter how large nFuture perf gains will continue to be delivered mostly due to cores or threads, not so much by faster cpus.Data-level or request-level parallelism is enough? Parallel computing beyond MapReduce!!
92Reading ListBarroso and Holzle, “The datacenter as a computer,” Morgan & Claypool, 2009.