Download presentation
Presentation is loading. Please wait.
1
The Computational Plant 9 th ORAP Forum Paris (CNRS) Rolf Riesen Sandia National Laboratories Scalable Computing Systems Department March 21, 2000
2
9th ORAP Forum. March 21, 2000 Distributed and Parallel Systems Distributed systems hetero- geneous Massively parallel systems homo- geneous Legion\Globus Beowulf Berkley NOW Cplant Internet ASCI Red Tflops Gather (unused) resources Steal cycles System SW manages resources System SW adds value 10% - 20% overhead is OK Resources drive applications Time to completion is not critical Time-shared Bounded set of resources Apps grow to consume all cycles Application manages resources System SW gets in the way 5% overhead is maximum Apps drive purchase of equipment Real-time constraints Space-shared Tech Report SAND98-2221 SETI@home
3
9th ORAP Forum. March 21, 2000 Massively Parallel Processors Intel Paragon 1,890 compute nodes 3,680 i860 processors 143/184 GFLOPS 175 MB/sec network SUNMOS lightweight kernel Intel TeraFLOPS 4,576 compute nodes 9,472 Pentium II processors 2.38/3.21 TFLOPS 400 MB/sec network Puma/Cougar lightweight kernel
4
9th ORAP Forum. March 21, 2000 Cplant Goals Production system Multiple users Scalable (easy to use buzzword) Large scale (proof of the above) General purpose for scientific applications (not Beowulf dedicated to a single user) 1st step: Tflops look and feel for users
5
9th ORAP Forum. March 21, 2000 Cplant Strategy Hybrid approach combining commodity cluster technology with MPP technology Build on the design of the TFLOPS: –large systems should be built from independent building blocks –large systems should be partitioned to provide specialized functionality –large systems should have significant resources dedicated to system maintenance
6
9th ORAP Forum. March 21, 2000 Why Cplant? Modeling and simulation, essential to stockpile stewardship, require significant computing power Commercial supercomputer are a dying breed Pooling of SMPs is expensive and more complex Commodity PC market is closing the performance gap WEB services and e-commerce are driving high-performance interconnect technology
7
9th ORAP Forum. March 21, 2000 Cplant Approach Emulate the ASCI Red environment –Partition model (functional decomposition) –Space sharing (reduce turnaround time) –Scalable services (allocator, loader, launcher) –Ephemeral user environment –Complete resource dedication Use Existing Software when possible –Red Hat distribution, Linux/Alpha –Software developed for ASCI Red
8
9th ORAP Forum. March 21, 2000 Conceptual Partition View Net I/O System Support Service Sys Admin Users File I/O Compute /home
9
9th ORAP Forum. March 21, 2000 User View alaska0alaska1alaska2alaska3alaska4 /home Load balancing daemon Service partition rlogin alaska Net I/O File I/O
10
9th ORAP Forum. March 21, 2000 System Support Hierarchy sss1 Admin access sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 Master copy of system software
11
9th ORAP Forum. March 21, 2000 Scalable Unit powerserialEthernet Myrinet To system support network 100BaseT hub 16 port Myrinet switch compute service 8 Myrinet LAN cables sss0 Terminal server Power controller 100BaseT hub 16 port Myrinet switch compute service Terminal server Power controller
12
9th ORAP Forum. March 21, 2000 sss1 Uses rdist to push system software down sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 Beta Alpha Production SU configuration database “Virtual Machines”
13
9th ORAP Forum. March 21, 2000 Runtime Environment yod - Service node parallel job launcher bebopd - Compute node allocator PCT - Process control thread, compute node daemon pingd - Compute node status tool fyod - Independent parallel I/O
14
9th ORAP Forum. March 21, 2000 Phase I - Prototype (Hawaii) 128 Digital PWS 433a (Miata) 433 MHz 21164 Alpha CPU 2 MB L3 Cache 128 MB ECC SDRAM 24 Myrinet dual 8-port SAN switches 32-bit, 33 MHz LANai-4 NIC Two 8-port serial cards per SSS- 0 for console access I/O - Six 9 GB disks Compile server - 1 DEC PWS 433a Integrated by SNL
15
9th ORAP Forum. March 21, 2000 Phase II Production (Alaska) 400 Digital PWS 500a (Miata) 500 MHz Alpha 21164 CPU 2 MB L3 Cache, 192 MB RAM 16-port Myrinet switch 32-bit, 33 MHz LANai-4 NIC 6 DEC AS1200, 12 RAID (.75 Tbyte) || file server 1 DEC AS4100 compile & user file server Integrated by Compaq 125.2 GFLOPS on MPLINPACK (350 nodes) –would place 53rd on June 1999 Top 500
16
9th ORAP Forum. March 21, 2000 Phase III Production (Siberia) 624 Compaq XP1000 (Monet) 500 MHz Alpha 21264 CPU 4 MB L3 Cache 256 MB ECC SDRAM 16-port Myrinet switch 64-bit, 33 MHz LANai-7 NIC 1.73 TB disk I/O Integrated by Compaq and Abba Technologies 247.6 GFLOPS on MPLINPACK (572 nodes) –would place 40th on Nov 1999 Top 500
17
9th ORAP Forum. March 21, 2000 Phase IV (Antarctica?) ~1350 DS10 Slates (NM+CA) 466MHz EV6, 256MBRAM Myrinet 33MHz 64bit LANai 7.x Will be combined with Siberia for a ~1600-node system Red, black, green switchable
18
9th ORAP Forum. March 21, 2000 Myrinet Switch Based on 64-port Clos switch 8x2 16-port switches in a 12U rack-mount case 64 LAN cables to nodes 64 SAN cables (96 links) to mesh 4 nodes 16-port switch
19
9th ORAP Forum. March 21, 2000 One Switch Rack = One Plane 4 Clos switches in one rack 256 nodes per plane (8 racks) Wrap-around in x and y direction 128+128 links in z direction x z y 4 nodes per not shown
20
9th ORAP Forum. March 21, 2000 Cplant 2000: “Antarctica” Wrap-around and z links and nodes not shown Connected to open network Connected to classified network Compute nodes swing between red, black, or green Connected to unclassified network
21
9th ORAP Forum. March 21, 2000 Cplant 2000: “Antarctica” cont. 1056 + 256 + 256 nodes 1600 nodes 1.5TFlops 320 “64-port” switches + 144 16-port switches from Siberia 40 + 16 system support stations
22
9th ORAP Forum. March 21, 2000 Portals Data movement layer from SUNMOS and PUMA Flexible building blocks for supporting many protocols Elementary constructs that support MPI semantics well Low-level message-passing layer (not a wire protocol) API intended for library writers, not application programmers Tech report SAND99-2959
23
9th ORAP Forum. March 21, 2000 Interface Concepts One-sided operations –Put and Get Zero copy message passing –Increased bandwidth OS Bypass –Reduced latency Application Bypass –No polling, no threads –Reduced processor utilization –Reduced software complexity
24
9th ORAP Forum. March 21, 2000 MPP Network: Paragon and Tflops Memory Processor Memory Bus Network Message passing or computational co- processor Network interface is on the memory bus
25
9th ORAP Forum. March 21, 2000 Commodity: Myrinet Memory Bus PCI Bus Bridge ProcessorMemory NIC Network Network is far from the memory OS Bypass
26
9th ORAP Forum. March 21, 2000 “Must” Requirements Common protocols (MPI, system protocols) Portability Scalability to 1000’s of nodes High-performance Multiple process access Heterogeneous processes (binaries) Runtime independence Memory protection Reliable message delivery Pairwise message ordering
27
9th ORAP Forum. March 21, 2000 “Will” Requirements Operational API Zero-copy MPI Myrinet Sockets implementation Unrestricted message size OS Bypass, Application Bypass Put/Get
28
9th ORAP Forum. March 21, 2000 “Will” Requirements Packetized implementations Receive uses start and length Receiver managed Sender managed Gateways Asynchronous operations Threads
29
9th ORAP Forum. March 21, 2000 “Should” requirements No message alignment restrictions Striping over multiple channels Socket API Implement on ST Implement on VIA No consistency/coherency Ease of use Topology information
30
9th ORAP Forum. March 21, 2000 Portal Addressing Portal Table Match List Memory Descriptors Event Queue Memory Region Portal API SpaceApplication Space Operational Boundary
31
9th ORAP Forum. March 21, 2000 Portal Address Translation More Match Entries? Match? First MD Accepts? Unlink MD? Empty & Unlink ME? Get Next Match Entry Perform Operation Unlink MD Unlink ME Record Event Discard Message Event Queue? Exit Enter Increment Drop Count No Yes
32
9th ORAP Forum. March 21, 2000 Implementing MPI Short message protocol –Send message (expect receipt) –Unexpected messages Long message protocol –Post receive –Send message –On ACK or Get, release message Event includes the memory descriptor
33
9th ORAP Forum. March 21, 2000 Implementing MPI Match none Match any short,unlink 0, trunc, no ACK Pre-posted Mark Event Queue buffer
34
9th ORAP Forum. March 21, 2000 Flow Control Basic flow control –Drop messages that receiver is not prepared for Long messages might waste network resources –Good performance for well-behaved MPI apps Managing the network – packets –Packet size big enough to hold a short message –First packet is an implicit RTS –Flow control ACK can indicate that message will be dropped
35
9th ORAP Forum. March 21, 2000 Portals 3.0 Status Currently testing Cplant Release 0.5 –Portals 3.0 kernel module using the RTS/CTS module over Myrinet –Port of MPICH 1.2.0 over Portals 3.0 TCP/IP reference implementation ready Port to LANai begun
36
9th ORAP Forum. March 21, 2000 http://www.cs.sandia.gov/cplant
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.