Douglas Doerfler Sandia National Labs April 13th, 2004 SOS8 Charleston, SC “Big” and “Not so Big” Iron at SNL.

Douglas Doerfler Sandia National Labs April 13th, 2004 SOS8 Charleston, SC “Big” and “Not so Big” Iron at SNL

SNL CS R&D Accomplishment Pathfinder for MPP Supercomputing NCUBE PARAGON ASCI-Red Cplant Sandia successfully led the DOE/DP revolution into MPP supercomputing through CS R&D –nCUBE-10 –nCUBE-2 –IPSC-860 –Intel Paragon –ASCI Red –Cplant … and gave DOE a strong, scalable parallel platforms effort Computing at SNL is an Applications success (i.e., uniquely-high scalability & reliability among FFRDC’s) because CS R&D paved the way Note: There was considerable skepticism in the community that MPP computing would be a success ICC Red Storm

Our Approach Large systems with a few processors per node Message passing paradigm Balanced architecture Efficient systems software Critical advances in parallel algorithms Real engineering applications Vertically integrated technology base Emphasis on scalability & reliability in all aspects

A Scalable Computing Architecture

ASCI Red 4,576 compute nodes –9,472 Pentium II processors 800 MB/sec bi-directional interconnect 3.21 Peak TFlops –2.34 TFlops on Linpack –74% of peak –9632 Processors TOS on Service Nodes Cougar LWK on Compute Nodes 1.0 GB/sec Parallel File System

Computational Plant Antarctica - 2,376 Nodes Antarctica has 4 “heads” with a switchable center section –Unclassified Restricted Network –Unclassified Open Network –Classified Network Compaq (HP) DS10L “Slates” –466MHz EV6, 1GB RAM –600Mhz EV67, 1GB RAM Re-deployed Siberia XP1000 Nodes –500Mhz EV6, 256MB RAM Myrinet –3D Mesh Topology –33MHz 64bit –A mix of 1,280 and 2,000 Mbit/sec technology –LANai 7.x and 9.x Runtime Software –Yod - Application loader –Pct - Compute node process control –Bebopd - Allocation –OpenPBS - Batch scheduling –Portals Message Passing API Red Hat Linux 7.2 w/2.4.x Kernel Compaq (HP) Fortran, C, C++ MPICH over Portals

Institutional Computing Clusters Two (classified/unclassified), 256 Node Clusters in NM –236 compute nodes: –Dual 3.06GHz Xeon processors, 2GB memory –Myricom Myrinet PCI NIC (XP, REV D, 2MB) –2 Admin nodes –4 Login nodes –2 MetaData Server (NDS) nodes –12 Object Store Target (OST) nodes –256 port Myrinet Switch 128 node (unclassified) and a 64 Node (classified) Clusters in CA Compute nodes RedHat Linux 7.3 Application Directory –MKL math library –TotalView client –VampirTrace client –MPICH-GM OpenPBS client PVFS client Myrinet GM Login nodes RedHat Linux 7.3 Kerberos Intel Compilers –C, C++ –Fortran Open Source Compilers –Gcc –Java TotalView VampirTrace Myrinet GM Administrative Nodes Red Hat Linux 7.3 OpenPBS Myrinet GM w/Mapper SystemImager Ganglia Mon CAP Tripwire

Red Squall Development Cluster Hewlett Packard Collaboration –Integration, Testing, System SW support –Lustre and Quadrics Expertise RackSaver BladeRack Nodes –High Density Compute Server Architecture 66 Nodes (132 processors) per Rack –2.0GHz AMD Opteron Same as Red Storm but w/commercial Tyan motherboards –2 Gbytes of main memory per node (same as RS) Quadrics QsNetII (Elan4) Interconnect –Best in Class (commercial cluster interconnect) Performance I/O subsystem uses DDN S2A8500 Couplets with Fiber Channel Disk Drives (same as Red Storm) –Best in Class Performance Located in the new JCEL facility

Douglas Doerfler Sandia National Labs April 13th, 2004 SOS8 Charleston, SC

Red Storm Goals Balanced System Performance - CPU, Memory, Interconnect, and I/O. Usability - Functionality of hardware and software meets needs of users for Massively Parallel Computing. Scalability - System Hardware and Software scale, single cabinet system to ~20,000 processor system. Reliability - Machine stays up long enough between interrupts to make real progress on completing application run (at least 50 hours MTBI), requires full system RAS capability. Upgradability - System can be upgraded with a processor swap and additional cabinets to 100T or greater. Red/Black Switching - Capability to switch major portions of the machine between classified and unclassified computing environments. Space, Power, Cooling - High density, low power system. Price/Performance - Excellent performance per dollar, use high volume commodity parts where feasible.

Red Storm Architecture True MPP, designed to be a single system. Distributed memory MIMD parallel supercomputer. Fully connected 3-D mesh interconnect. Each compute node and service and I/O node processor has a high bandwidth, bi-directional connection to the primary communication network. 108 compute node cabinets and 10,368 compute node processors. (AMD Opteron @ 2.0 GHz) ~10 TB of DDR memory @ 333 MHz Red/Black switching - ~1/4, ~1/2, ~1/4. 8 Service and I/O cabinets on each end (256 processors for each color). 240 TB of disk storage (120 TB per color). Functional hardware partitioning - service and I/O nodes, compute nodes, and RAS nodes. Partitioned Operating System (OS) - LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes. Separate RAS and system management network (Ethernet). Router table based routing in the interconnect. Less than 2 MW total power and cooling. Less than 3,000 square feet of floor space.

Red Storm Layout Less than 2 MW total power and cooling. Less than 3,000 square feet of floor space. Separate RAS and system management network (Ethernet). 3D Mesh: 27 x 16 x 24 (x, y, z) Red/Black split 2688 : 4992 : 2688 Service & I/O: 2 x 8 x 16

Red Storm Cabinet Layout Compute Node Cabinet –3 Card Cages per Cabinet –8 Boards per Card Cage –4 Processors per Board –4 NIC/Router Chips per Board –N+1 Power Supplies –Passive Backplane Service and I/O Node Cabinet –2 Card Cages per Cabinet –8 Boards per Card Cage –2 Processors per Board –2 NIC/Router Chips per Board –PCI-X for each Processor –N+1 Power Supplies –Passive Backplane

Red Storm Software Operating Systems –LINUX on service and I/O nodes –LWK (Catamount) on compute nodes –LINUX on RAS nodes File Systems –Parallel File System - Lustre (PVFS) –Unix File System - Lustre (NFS) Run-Time System –Logarithmic loader –Node allocator –Batch system - PBS –Libraries - MPI, I/O, Math Programming Model –Message Passing –Support for Heterogeneous Applications Tools –ANSI Standard Compilers - Fortran, C, C++ –Debugger - TotalView –Performance Monitor System Management and Administration –Accounting –RAS GUI Interface –Single System View

Red Storm Performance Based on application code testing on production AMD Opteron processors we are now expecting that Red Storm will deliver around 10 X performance improvement over ASCI Red on Sandia’s suite of application codes. Expected MP-Linpack performance - ~30 TF. Processors –2.0 GHz AMD Opteron (Sledgehammer) –Integrated dual DDR memory controllers @ 333 MHz Page miss latency to local processor memory is ~80 nano-seconds. Peak bandwidth of ~5.3 GB/s for each processor. –Integrated 3 Hyper Transport Interfaces @ 3.2 GB/s each direction Interconnect performance –Latency <2 µs (neighbor) <5 µs (full machine) –Peak Link bandwidth ~3.84 GB/s each direction –Bi-section bandwidth ~2.95 TB/s Y-Z, ~4.98 TB/s X-Z, ~6.64 TB/s X-Y I/O system performance –Sustained file system bandwidth of 50 GB/s for each color. –Sustained external network bandwidth of 25 GB/s for each color.

HPC R&D Efforts at SNL Advanced Architectures –Next Generation Processor & Interconnect Technologies –Simulation and Modeling of Algorithm Performance Message Passing –Portals –Application characterization of message passing patterns Light Weight Kernels –Project to design a next-generation lightweight kernel (LWK) for compute nodes of a distributed memory massively parallel system –Assess the performance, scalability, and reliability of a lightweight kernel versus a traditional monolithic kernel –Investigate efficient methods of supporting dynamic operating system services Light Weight File System –only critical I/O functionality (storage, metadata mgmt, security) –special functionality implemented in I/O libraries (above LWFS) Light Weight OS –Linux configuration to eliminate the need of a remote /root –“Trimming” the kernel to eliminate unwanted and unnecessary daemons Cluster Management Tools –Diskless Cluster Strategies and Techniques –Operating Systems Distribution and Initialization Log Analysis to improve robustness, reliability and maintainability

More Information Computation, Computers, Information and Mathematics Center http://www.cs.sandia.gov

Douglas Doerfler Sandia National Labs April 13th, 2004 SOS8 Charleston, SC “Big” and “Not so Big” Iron at SNL.

Similar presentations

Presentation on theme: "Douglas Doerfler Sandia National Labs April 13th, 2004 SOS8 Charleston, SC “Big” and “Not so Big” Iron at SNL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Douglas Doerfler Sandia National Labs April 13th, 2004 SOS8 Charleston, SC “Big” and “Not so Big” Iron at SNL.

Similar presentations

Presentation on theme: "Douglas Doerfler Sandia National Labs April 13th, 2004 SOS8 Charleston, SC “Big” and “Not so Big” Iron at SNL."— Presentation transcript:

Similar presentations

About project

Feedback