Presentation is loading. Please wait.

Presentation is loading. Please wait.

High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998.

Similar presentations


Presentation on theme: "High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998."— Presentation transcript:

1 High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998

2 7/28/98SPAA/PODC Clusters2 Clusters have Arrived … the SPAA / PDOC testbed going forward

3 7/28/98SPAA/PODC Clusters3 Berkeley NOW http://now.cs.berkeley.edu/

4 7/28/98SPAA/PODC Clusters4 NOW’s Commercial Version 240 procesors, Active Messages, myrinet,...

5 7/28/98SPAA/PODC Clusters5 Berkeley Massive Storage Cluster serving Fine Art at www.thinker.org/imagebase/ or try

6 7/28/98SPAA/PODC Clusters6 Commercial Scene

7 7/28/98SPAA/PODC Clusters7 What’s a Cluster? Collection of independent computer systems working together as if a single system. Coupled through a scalable, high bandwidth, low latency interconnect.

8 7/28/98SPAA/PODC Clusters8 Outline for Part 1 Why Clusters NOW? What is the Key Challenge? How is it overcome? How much performance? Where is it going?

9 7/28/98SPAA/PODC Clusters9 Why Clusters? Capacity Availability Scalability Cost- effectiveness

10 7/28/98SPAA/PODC Clusters10 Traditional Availability Clusters VAX Clusters => IBM sysplex => Wolf Pack Clients Disk array A Disk array B InterconnectServer A Server B

11 7/28/98SPAA/PODC Clusters11 Why HP Clusters NOW? Time to market => performance Technology internet services Engineering Lag Time Node Performance in Large System

12 7/28/98SPAA/PODC Clusters12 Technology Breakthrough Killer micro => Killer switch single chip building block for scalable networks high bandwidth low latency very reliable

13 7/28/98SPAA/PODC Clusters13 Opportunity: Rethink System Design Remote memory and processor are closer than local disks! Networking Stacks ? Virtual Memory ? File system design ? It all looks like parallel programming Huge demand for scalable, available, dedicated internet servers –big I/O, big compute

14 7/28/98SPAA/PODC Clusters14 Example: Traditional File System Clients Server $$$ Global Shared File Cache RAID Disk Storage Fast Channel (HPPI) Expensive Complex Non-Scalable Single point of failure $ Local Private File Cache $$ ° ° ° Bottleneck Server resources at a premium Client resources poorly utilized

15 7/28/98SPAA/PODC Clusters15 Truly Distributed File System VM: page to remote memory File Cache P File Cache P File Cache P File Cache P File Cache P File Cache P File Cache P File Cache P Scalable Low-Latency Communication Network Network RAID striping G = Node Comm BW / Disk BW Local Cache Cluster Caching

16 7/28/98SPAA/PODC Clusters16 Fast Communication Challenge Fast processors and fast networks The time is spent in crossing between them Killer Switch ° ° ° Network Interface Hardware Comm.. Software Network Interface Hardware Comm. Software Network Interface Hardware Comm. Software Network Interface Hardware Comm. Software Killer Platform ns µs ms

17 7/28/98SPAA/PODC Clusters17 Opening: Intelligent Network Interfaces Dedicated Processing power and storage embedded in the Network Interface An I/O card today Tomorrow on chip? $ P M I/O bus (S-Bus) 50 MB/s Mryicom Net P Sun Ultra 170 Myricom NIC 160 MB/s M $ P M P $ P M $ P $ P M

18 7/28/98SPAA/PODC Clusters18 Our Attack: Active Messages Request / Reply small active messages (RPC) Bulk-Transfer (store & get) Highly optimized communication layer on a range of HW Request handler Reply

19 7/28/98SPAA/PODC Clusters19 NOW System Architecture Net Inter. HW UNIX Workstation Comm. SW Net Inter. HW Comm. SW Net Inter. HW Comm. SW Net Inter. HW Comm. SW Global Layer UNIX Resource Management Network RAM Distributed Files Process Migration Fast Commercial Switch (Myrinet) UNIX Workstation UNIX Workstation UNIX Workstation Large Seq. Apps Parallel Apps Sockets, Split-C, MPI, HPF, vSM

20 7/28/98SPAA/PODC Clusters20 Cluster Communication Performance

21 7/28/98SPAA/PODC Clusters21 LogP Interconnection Network MPMPMP ° ° ° P ( processors ) Limited Volume ( L/g to a proc) o (overhead) L (latency) o g (gap) Latency in sending a (small) message between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/rate) Processors Round Trip time: 2 x ( 2o + L)

22 7/28/98SPAA/PODC Clusters22 LogP Comparison Direct, user-level network access Generic AM, FM (uiuc), PM (rwc), Unet (cornell), … Latency1/BW

23 7/28/98SPAA/PODC Clusters23 MPI over AM: ping-pong bandwidth

24 7/28/98SPAA/PODC Clusters24 MPI over AM: start-up

25 7/28/98SPAA/PODC Clusters25 Cluster Application Performance: NAS Parallel Benchmarks

26 7/28/98SPAA/PODC Clusters26 NPB2: NOW vs SP2

27 7/28/98SPAA/PODC Clusters27 NPB2: NOW vs SGI Origin

28 7/28/98SPAA/PODC Clusters28 Where the Time Goes: LU

29 7/28/98SPAA/PODC Clusters29 Where the time goes: SP

30 7/28/98SPAA/PODC Clusters30 LU Working Set 4-processor –traditional curve for small caches –Sharp knee >256KB (1 MB total)

31 7/28/98SPAA/PODC Clusters31 LU Working Set (CPS scaling) Knee at global cache > 1MB machine experiences drop in miss rate at specific size

32 7/28/98SPAA/PODC Clusters32 Application Sensitivity to Communication Performance

33 7/28/98SPAA/PODC Clusters33 Adjusting L, o, and g (and G) in situ Martin, et al., ISCA 97 Lanai Host Workstation  O: stall Ultra on msg write AM lib Lanai Host Workstation AM lib  g: delay Lanai after msg injection (after fragment for bulk transfers)  L: defer marking msg as valid until Rx +  L  O: stall Ultra on msg read Myrinet

34 7/28/98SPAA/PODC Clusters34 Calibration

35 7/28/98SPAA/PODC Clusters35 Split-C Applications Program InputP=16P=32(us) Msg Type Interval RadixInteger radix sort16M 32-bit keys13.77.86.1msg EM3D(write)Electro-magnetic80K Nodes, 40% rmt88.638.08.0write EM3D(read)Electro-magnetic 80K Nodes, 40% rmt230.0114.013.8read SampleInteger sample sort32M 32-bit keys24.713.213.0msg BarnesHierarchical N-Body1 Million Bodies77.943.252.8cached read P-RayRay Tracer1 Million pixel image23.517.9156.2cached read MurPHIProtocol VerificationSCI protocol, 2 proc67.7 35.3183.5Bulk ConnectConnected Comp4M nodes, 2-D mesh, 30%2.31.2212.6 BSP NOW-sortDisk-to-Disk Sort32M 100-byte records127.256.9817.4I/O RadbBulk version Radix16M 32-bit keys7.03.7852.7Bulk

36 7/28/98SPAA/PODC Clusters36 Sensitivity to Overhead

37 7/28/98SPAA/PODC Clusters37 Comparative Impact

38 7/28/98SPAA/PODC Clusters38 Sensitivity to bulk BW (1/G)

39 7/28/98SPAA/PODC Clusters39 Cluster Communication Performance Overhead, Overhead, Overhead –hypersensitive due to increased serialization Sensitivity to gap reflects bursty communication Surprisingly latency tolerant Plenty of room for overhead improvement - How sensitive are distributed systems?

40 7/28/98SPAA/PODC Clusters40 Extrapolating to Low Overhead

41 7/28/98SPAA/PODC Clusters41 Direct Memory Messaging Send region and receive region for each end of communication channel Write through send region into remote rcv region

42 7/28/98SPAA/PODC Clusters42 Direct Memory Interconnects DEC Memory Channels –3 us end-to-end –~ 1us o, L SCI SGI Shrimp (Princeton) 100 MB/s

43 7/28/98SPAA/PODC Clusters43 Scalability, Availability, and Performance Scale disk, memory, proc independently Random node serves query, all search On (hw or sw) failure, lose random cols of index On overload, lose random rows PPPPPPPP Myrinet FE 100 Million Document Index Inktomi

44 7/28/98SPAA/PODC Clusters44 Summary Performance => Generality (see Part 2) From Technology “Shift” to Technology “Trend” Cluster communication becoming cheap –gigabit ethernet System Area Networks becoming commodity –Myricom OEM, Tandem/Compaq ServerNet, SGI, HAL, Sun Improvements in interconnect BW –gigabyte per second and beyond Bus connections improving –PCI, ePCI, Pentium II cluster slot, … Operating system out of the way –VIA

45 7/28/98SPAA/PODC Clusters45 Advice Clusters are cheap, easy to build, flexible, powerful, general purpose and fun Everybody doing SPAA or PODC should have one to try out their ideas Can use Berkeley NOW through npaci –www.npaci.edu


Download ppt "High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998."

Similar presentations


Ads by Google