Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000.

Similar presentations


Presentation on theme: "The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000."— Presentation transcript:

1 The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000

2 9th ORAP Forum. March 21, 2000 Distributed and Parallel Systems Distributed systems hetero- geneous Massively parallel systems homo- geneous Legion\Globus Beowulf Berkley NOW Cplant Internet ASCI Red Tflops Gather (unused) resources Steal cycles System SW manages resources System SW adds value 10% - 20% overhead is OK Resources drive applications Time to completion is not critical Time-shared Bounded set of resources Apps grow to consume all cycles Application manages resources System SW gets in the way 5% overhead is maximum Apps drive purchase of equipment Real-time constraints Space-shared Tech Report SAND98-2221 SETI@home

3 9th ORAP Forum. March 21, 2000 Massively Parallel Processors Intel Paragon 1,890 compute nodes 3,680 i860 processors 143/184 GFLOPS 175 MB/sec network SUNMOS lightweight kernel Intel TeraFLOPS 4,576 compute nodes 9,472 Pentium II processors 2.38/3.21 TFLOPS 400 MB/sec network Puma/Cougar lightweight kernel

4 9th ORAP Forum. March 21, 2000 Cplant Goals Production system Multiple users Scalable (easy to use buzzword) Large scale (proof of the above) General purpose for scientific applications (not Beowulf dedicated to a single user) 1st step: Tflops look and feel for users

5 9th ORAP Forum. March 21, 2000 Cplant Strategy Hybrid approach combining commodity cluster technology with MPP technology Build on the design of the TFLOPS: –large systems should be built from independent building blocks –large systems should be partitioned to provide specialized functionality –large systems should have significant resources dedicated to system maintenance

6 9th ORAP Forum. March 21, 2000 Why Cplant? Modeling and simulation, essential to stockpile stewardship, require significant computing power Commercial supercomputer are a dying breed Pooling of SMPs is expensive and more complex Commodity PC market is closing the performance gap WEB services and e-commerce are driving high-performance interconnect technology

7 9th ORAP Forum. March 21, 2000 Cplant Approach Emulate the ASCI Red environment –Partition model (functional decomposition) –Space sharing (reduce turnaround time) –Scalable services (allocator, loader, launcher) –Ephemeral user environment –Complete resource dedication Use Existing Software when possible –Red Hat distribution, Linux/Alpha –Software developed for ASCI Red

8 9th ORAP Forum. March 21, 2000 Conceptual Partition View Net I/O System Support Service Sys Admin Users File I/O Compute /home

9 9th ORAP Forum. March 21, 2000 User View alaska0alaska1alaska2alaska3alaska4 /home Load balancing daemon Service partition rlogin alaska Net I/O File I/O

10 9th ORAP Forum. March 21, 2000 System Support Hierarchy sss1 Admin access sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 Master copy of system software

11 9th ORAP Forum. March 21, 2000 Scalable Unit powerserialEthernet Myrinet To system support network 100BaseT hub 16 port Myrinet switch compute service 8 Myrinet LAN cables sss0 Terminal server Power controller 100BaseT hub 16 port Myrinet switch compute service Terminal server Power controller

12 9th ORAP Forum. March 21, 2000 sss1 Uses rdist to push system software down sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 Beta Alpha Production SU configuration database “Virtual Machines”

13 9th ORAP Forum. March 21, 2000 Runtime Environment yod - Service node parallel job launcher bebopd - Compute node allocator PCT - Process control thread, compute node daemon pingd - Compute node status tool fyod - Independent parallel I/O

14 9th ORAP Forum. March 21, 2000 Phase I - Prototype (Hawaii) 128 Digital PWS 433a (Miata) 433 MHz 21164 Alpha CPU 2 MB L3 Cache 128 MB ECC SDRAM 24 Myrinet dual 8-port SAN switches 32-bit, 33 MHz LANai-4 NIC Two 8-port serial cards per SSS- 0 for console access I/O - Six 9 GB disks Compile server - 1 DEC PWS 433a Integrated by SNL

15 9th ORAP Forum. March 21, 2000 Phase II Production (Alaska) 400 Digital PWS 500a (Miata) 500 MHz Alpha 21164 CPU 2 MB L3 Cache, 192 MB RAM 16-port Myrinet switch 32-bit, 33 MHz LANai-4 NIC 6 DEC AS1200, 12 RAID (.75 Tbyte) || file server 1 DEC AS4100 compile & user file server Integrated by Compaq 125.2 GFLOPS on MPLINPACK (350 nodes) –would place 53rd on June 1999 Top 500

16 9th ORAP Forum. March 21, 2000 Phase III Production (Siberia) 624 Compaq XP1000 (Monet) 500 MHz Alpha 21264 CPU 4 MB L3 Cache 256 MB ECC SDRAM 16-port Myrinet switch 64-bit, 33 MHz LANai-7 NIC 1.73 TB disk I/O Integrated by Compaq and Abba Technologies 247.6 GFLOPS on MPLINPACK (572 nodes) –would place 40th on Nov 1999 Top 500

17 9th ORAP Forum. March 21, 2000 Phase IV (Antarctica?) ~1350 DS10 Slates (NM+CA) 466MHz EV6, 256MBRAM Myrinet 33MHz 64bit LANai 7.x Will be combined with Siberia for a ~1600-node system Red, black, green switchable

18 9th ORAP Forum. March 21, 2000 Myrinet Switch Based on 64-port Clos switch 8x2 16-port switches in a 12U rack-mount case 64 LAN cables to nodes 64 SAN cables (96 links) to mesh 4 nodes 16-port switch

19 9th ORAP Forum. March 21, 2000 One Switch Rack = One Plane 4 Clos switches in one rack 256 nodes per plane (8 racks) Wrap-around in x and y direction 128+128 links in z direction x z y 4 nodes per not shown

20 9th ORAP Forum. March 21, 2000 Cplant 2000: “Antarctica” Wrap-around and z links and nodes not shown Connected to open network Connected to classified network Compute nodes swing between red, black, or green Connected to unclassified network

21 9th ORAP Forum. March 21, 2000 Cplant 2000: “Antarctica” cont. 1056 + 256 + 256 nodes  1600 nodes  1.5TFlops 320 “64-port” switches + 144 16-port switches from Siberia 40 + 16 system support stations

22 9th ORAP Forum. March 21, 2000 Portals Data movement layer from SUNMOS and PUMA Flexible building blocks for supporting many protocols Elementary constructs that support MPI semantics well Low-level message-passing layer (not a wire protocol) API intended for library writers, not application programmers Tech report SAND99-2959

23 9th ORAP Forum. March 21, 2000 Interface Concepts One-sided operations –Put and Get Zero copy message passing –Increased bandwidth OS Bypass –Reduced latency Application Bypass –No polling, no threads –Reduced processor utilization –Reduced software complexity

24 9th ORAP Forum. March 21, 2000 MPP Network: Paragon and Tflops Memory Processor Memory Bus Network Message passing or computational co- processor Network interface is on the memory bus

25 9th ORAP Forum. March 21, 2000 Commodity: Myrinet Memory Bus PCI Bus Bridge ProcessorMemory NIC Network Network is far from the memory OS Bypass

26 9th ORAP Forum. March 21, 2000 “Must” Requirements Common protocols (MPI, system protocols) Portability Scalability to 1000’s of nodes High-performance Multiple process access Heterogeneous processes (binaries) Runtime independence Memory protection Reliable message delivery Pairwise message ordering

27 9th ORAP Forum. March 21, 2000 “Will” Requirements Operational API Zero-copy MPI Myrinet Sockets implementation Unrestricted message size OS Bypass, Application Bypass Put/Get

28 9th ORAP Forum. March 21, 2000 “Will” Requirements Packetized implementations Receive uses start and length Receiver managed Sender managed Gateways Asynchronous operations Threads

29 9th ORAP Forum. March 21, 2000 “Should” requirements No message alignment restrictions Striping over multiple channels Socket API Implement on ST Implement on VIA No consistency/coherency Ease of use Topology information

30 9th ORAP Forum. March 21, 2000 Portal Addressing Portal Table Match List Memory Descriptors Event Queue Memory Region Portal API SpaceApplication Space Operational Boundary

31 9th ORAP Forum. March 21, 2000 Portal Address Translation More Match Entries? Match? First MD Accepts? Unlink MD? Empty & Unlink ME? Get Next Match Entry Perform Operation Unlink MD Unlink ME Record Event Discard Message Event Queue? Exit Enter Increment Drop Count No Yes

32 9th ORAP Forum. March 21, 2000 Implementing MPI Short message protocol –Send message (expect receipt) –Unexpected messages Long message protocol –Post receive –Send message –On ACK or Get, release message Event includes the memory descriptor

33 9th ORAP Forum. March 21, 2000 Implementing MPI Match none Match any short,unlink 0, trunc, no ACK Pre-posted Mark Event Queue buffer

34 9th ORAP Forum. March 21, 2000 Flow Control Basic flow control –Drop messages that receiver is not prepared for Long messages might waste network resources –Good performance for well-behaved MPI apps Managing the network – packets –Packet size big enough to hold a short message –First packet is an implicit RTS –Flow control ACK can indicate that message will be dropped

35 9th ORAP Forum. March 21, 2000 Portals 3.0 Status Currently testing Cplant Release 0.5 –Portals 3.0 kernel module using the RTS/CTS module over Myrinet –Port of MPICH 1.2.0 over Portals 3.0 TCP/IP reference implementation ready Port to LANai begun

36 9th ORAP Forum. March 21, 2000 http://www.cs.sandia.gov/cplant


Download ppt "The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000."

Similar presentations


Ads by Google