Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June.

Similar presentations


Presentation on theme: "6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June."— Presentation transcript:

1 6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998

2 6/28/98SPAA/PODC2 What’s Different about Clusters? Commodity parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Fast Scalable Communication? => Complete System on every node –virtual memory –scheduler –file system –...

3 6/28/98SPAA/PODC3 Topics: Part 2 Virtual Networks –communication meets virtual memory Scheduling Parallel I/O Clusters of SMPs VIA

4 6/28/98SPAA/PODC4 General purpose requirements Many timeshared processes –each with direct, protected access User and system Client/Server, Parallel clients, parallel servers –they grow, shrink, handle node failures Multiple packages in a process –each may have own internal communication layer Use communication as easily as memory

5 6/28/98SPAA/PODC5 Virtual Networks Endpoint abstracts the notion of “attached to the network” Virtual network is a collection of endpoints that can name each other. Many processes on a node can each have many endpoints, each with own protection domain.

6 6/28/98SPAA/PODC6 Process 3 How are they managed? How do you get direct hardware access for performance with a large space of logical resources? Just like virtual memory –active portion of large logical space is bound to physical resources Process n Process 2 Process 1 *** Host Memory Processor NIC Mem Network Interface P

7 6/28/98SPAA/PODC7 Endpoint Transition Diagram COLD Paged Host Memory WARM R/O Paged Host Memory HOT R/W NIC Memory Read Evict Swap Write Msg Arrival

8 6/28/98SPAA/PODC8 Network Interface Support NIC has endpoint frames Services active endpoints Signals misses to driver –using a system endpont Frame 0 Frame 7 Transmit Receive EndPoint Miss

9 6/28/98SPAA/PODC9 Solaris System Abstractions Segment Driver manages portions of an address space Device Driver manages I/O device Virtual Network Driver

10 6/28/98SPAA/PODC10 LogP Performance Competitive latency Increased NIC processing Difference mostly –ack processing –protection check –data structures –code quality Virtualization cheap

11 6/28/98SPAA/PODC11 Bursty Communication among many Client Server Msg burst work

12 6/28/98SPAA/PODC12 Multiple VN’s, Single-thread Server

13 6/28/98SPAA/PODC13 Multiple VNs, Multithreaded Server

14 6/28/98SPAA/PODC14 Perspective on Virtual Networks Networking abstractions are vertical stacks –new function => new layer –poke through for performance Virtual Networks provide a horizontal abstraction –basis for building new, fast services Open questions –What is the communication “working set” ? –What placement, replacement, … ?

15 6/28/98SPAA/PODC15 Beyond the Personal Supercomputer Able to timeshare parallel programs –with fast, protected communication Mix with sequential and interactive jobs Use fast communication in OS subsystems –parallel file system, network virtual memory, … Nodes have powerful, local OS scheduler Problem: local schedulers do not know to run parallel jobs in parallel

16 6/28/98SPAA/PODC16 Local Scheduling Schedulers act independently w/o global control Program waits while trying communicate with its peers that are not running 10 - 100x slowdowns for fine-grain programs! => need coordinated scheduling

17 6/28/98SPAA/PODC17 Explicit Coscheduling Global context switch according to precomputed schedule How do you build it? Does it work?

18 6/28/98SPAA/PODC18 Typical Cluster Subsystem Structures A LS AA A A A Master A LS A GS A LS GS A LS A GS LS A GS Local service Applications Communication Global Service Communication Master-Slave Peer-to-Peer

19 6/28/98SPAA/PODC19 Ideal Cluster Subsystem Structure Obtain coordination without explicit subsystem interaction, only the events in the program –very easy to build –potentially very robust to component failures –inherently “service on-demand” –scalable Local service component can evolve. A LS A GS A LS GS A LS A GS LS A GS

20 6/28/98SPAA/PODC20 Three approaches examined in NOW GLUNIX explicit master-slave (user level) –matrix algorithm to pick PP –uses stops & signals to try to force desired PP to run Explicit peer-peer scheduling assist with VNs –co-scheduling daemons decide on PP and kick the solaris scheduler Implicit –modify the parallel run-time library to allow it to get itself co- scheduled with standard scheduler A LS AA A A A M A A GS A LS GS A LS A GS LS A GS A LS A GS A LS GS A LS A GS LS A GS

21 6/28/98SPAA/PODC21 Problems with explicit coscheduling Implementation complexity Need to identify parallel programs in advance Interacts poorly with interactive use and load imbalance Introduces new potential faults Scalability

22 6/28/98SPAA/PODC22 Why implicit coscheduling might work Active message request-reply model Infer non-local state from local observations; react to maintain coordination observationimplication action fast response partner scheduledspin delayed response partner not scheduledblock WS 1 Job A WS 2 Job BJob A WS 3 Job BJob A WS 4 Job BJob A sleep spin requestresponse

23 6/28/98SPAA/PODC23 Obvious Questions Does it work? How long do you spin? What are the requirements on the local scheduler?

24 6/28/98SPAA/PODC24 How Long to Spin? Answer: round trip time + context switch + msg processing –round-trip to stay scheduled together –plus wake-up to get scheduled together –keep spinning if serving messages »interval of 3 x wake-up

25 6/28/98SPAA/PODC25 Does it work?

26 6/28/98SPAA/PODC26 Synthetic Bulk-synchronous Apps Range of granularity and load imbalance –spin wait 10x slowdown

27 6/28/98SPAA/PODC27 With mixture of reads Block-immediate 4x slowdown

28 6/28/98SPAA/PODC28 Timesharing Split-C Programs

29 6/28/98SPAA/PODC29 Many Questions What about –mix of jobs? –sequential jobs? –unbalanced placement? –Fairness? –Scalability? How broadly can implicit coordination be applied in the design of cluster subsystems? Can resource management be completely decentralized? –Computational economies, ecologies

30 6/28/98SPAA/PODC30 A look at Serious File I/O Traditional I/O system NOW I/O system Benchmark Problem: sort large number of 100 byte records with 10 byte keys –start on disk, end on disk –accessible as files (use the file system) –Datamation sort: 1 million records –Minute sort: quantity in a minute Proc- Mem P-M

31 6/28/98SPAA/PODC31 NOW-Sort Algorithm Read –N/P records from disk -> memory Distribute –scatter keys to processors holding result buckets –gather keys from all processors Sort –partial radix sort on each bucket Write –write records to disk (2 pass: gather data runs onto disk, then local, external merge sort)

32 6/28/98SPAA/PODC32 Key Implementation Techniques Performance Isolation: highly tuned local disk- to-disk sort –manage local memory –manage disk striping –memory mapped I/O with m-advise, buffering –manage overlap with threads Efficient Communication –completely hidden under disk I/O –competes for I/O bus bandwidth Self-tuning Software –probe available memory, disk bandwidth, trade-offs

33 6/28/98SPAA/PODC33 World-Record Disk-to-Disk Sort Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth but only in the wee hours of the morning

34 6/28/98SPAA/PODC34 Towards a Cluster File System Remote disk system built on a virtual network Client RDlib RD server Active msgs

35 6/28/98SPAA/PODC35 Streaming Transfer Experiment

36 6/28/98SPAA/PODC36 Results Data distribution affects resource utilization Not delivered bandwidth

37 6/28/98SPAA/PODC37 I/O Bus crossings

38 6/28/98SPAA/PODC38 Opportunity: PDISK Producers dump data into I/O river Consumers pull it out Hash data records across disks Match producers to consumers Integrated with work scheduling PPPP PPPP Fast Communication - Remote Queues Fast I/O - Streaming Disk Queues

39 6/28/98SPAA/PODC39 What will be the building block? network cloud memory interconnect SMP memory network cards  SMP memory Processors per node Nodes SMPs NOWs Clusters of SMPs

40 6/28/98SPAA/PODC40 Multi-Protocol Communication Uniform Prog. Model is key Multiprotocol Messaging –careful layout of msg queues –concurrent objects –polling network hurts memory Shared Virtual Memory –relies on underlying msgs Pooling vs Contention Send / Write shared memory network communication layer Rcv / Read

41 6/28/98SPAA/PODC41 LogP analysis of shared mem AM

42 6/28/98SPAA/PODC42 Virtual Interface Architecture Application VI User Agent (“libvia”) Open, Connect, Map Memory Descriptor Read, Write VI-Capable NIC Sockets, MPI, Legacy, etc. Host NIC Requests Completed VI C SSS COMPCOMP RRR Doorbells Undetermined VIA Kernel Driver (Slow) User-Level (Fast)

43 6/28/98SPAA/PODC43 VIA Implementation Overview Host NIC...... Mapped Doorbells Descriptor Queues Data Buffers Kernel Memory Mapped to Application Doorbell Pages Desc Buffer Tx/Rx Buffers VI Request Block Xfer 1 Write 2 DMAReq 3 DMARd 4 DMAReq 5 DMARd 7 DMAWrt

44 6/28/98SPAA/PODC44 Current VIA Performance

45 6/28/98SPAA/PODC45 VIA ahead You will be able to buy decent clusters Virtualization in host memory is easy –will it go beyond pinned regions –still need to manage active endpoints (doorbells) Complex descriptor queues will hinder low latency short messages –NICs will chew on them, but many instructions on host Need to re-examine where error handling, flow control, retry are performed Interactions with scheduling, I/O, locking etc. will dominate application speed-up –will demand new development methodologies

46 6/28/98SPAA/PODC46 Conclusions Complete system on every node makes clusters a very powerful architecture –can finally get serious about I/O Extend the system globally –virtual memory systems, –schedulers, –file systems,... Efficient communication enables new solutions to classic systems challenges Opens a rich set of issues for parallel processing beyond the personal supercomputer or LAN –where SPAA and PDOC meet


Download ppt "6/28/98SPAA/PODC1 High-Performance Clusters part 2: Generality David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June."

Similar presentations


Ads by Google