Presentation is loading. Please wait.

Presentation is loading. Please wait.

Application of General Purpose HPC Systems in HPEC

Similar presentations


Presentation on theme: "Application of General Purpose HPC Systems in HPEC"— Presentation transcript:

1 Application of General Purpose HPC Systems in HPEC
David Alexander Director of Engineering SGI, Inc.

2 SGI HPC System Architecture

3 SGI Scalable ccNUMA Architecture Basic Node Structure and Interconnect
Physical Memory (2-16GB) C A H E CPU Interface Chip Physical Memory (2-16GB) C A H E CPU Interface Chip Processor Interface (>6GB/sec) Processor Interface (>6GB/sec) NUMAlink Interconnect (>6GB/sec) Memory Interface (>8GB/sec) Memory Interface (>8GB/sec) Physical Memory

4 SGI Scalable ccNUMA Architecture Scaling to Large Node Counts
Memory CPU Interface Chip Memory CPU Interface Chip Memory CPU Interface Chip Memory CPU Interface Chip System Interconnect Fabric Scaling to 100’s of processors RT Memory Latency < 600ns worst case (64p config) Bi-section bandwidth >25GB/sec (64p config)

5 SGI Scalable ccNUMA Architecture SGI® NUMAflex™ Modular Design
System Interconnect Scalability: No central bus or switch; just modules and NUMAlink™ cables Investment protection: Add new technologies as they evolve Flexibility: Tailored configurations for different dimensions of scalability Performance: High-bandwidth interconnect with very low latency CPU & Memory System I/O Standard I/O Expansion High BW I/O Expansion Graphics Expansion This slide is the introduction to modular computing. Regardless of the actual system (the SGI Onyx 3000 modules are pictured), individual, specialized modules (or “bricks”) are combined in various ratios to suit a customer’s specific application environment. The main benefits of NUMAflex modular computing are listed on the slide. Performance is key among these. Each corner of our hypercube (the R-brick) offers 3.2 GB/sec of bandwidth. Our round-trip local memory latency (within the C-brick) is only 180 nanoseconds, which is over 30% faster than IBM Regatta. Remote memory latency over the R-bricks is only 50 nanoseconds per hop, compared to about 10 microseconds (10,000 nanoseconds) on a current-generation IBM SP switch. Storage Expansion Modules from SGI® Onyx® 3000 Series

6 SGI Scalable ccNUMA Architecture The Benefits of Shared Memory
Traditional Clusters SGI® Altix™ 3000 node + OS ... Commodity interconnect mem Fast NUMAflex™ interconnect Global shared memory node + OS node + OS node + OS node + OS ... What is shared memory? All nodes operate on one large shared memory space, instead of each node having its own small memory space Shared memory is high-performance All nodes can access one large memory space efficiently, so complex communication and data passing between nodes aren’t needed Big data sets fit entirely in memory; less disk I/O is needed Shared memory is cost-effective and easy to deploy The SGI Altix family supports all major parallel programming models It requires less memory per node, because large problems can be solved in big shared memory Simpler programming means lower tuning and maintenance costs

7 SGI® Altix™ 3000 HPC Product

8 SGI® Altix™ Overview First Linux® node with 64 CPUs in single-OS image First clusters with global shared memory across multiple nodes First Linux solution with HPC system- and data-management tools World-record performance for floating-point calculations, memory performance, I/O bandwidth, and real technical applications

9 SGI® Altix™ 3000 Fusion of Powerhouse Technologies
SGI® supercomputing technology, Intel’s most advanced processors, and open-source computing Global shared memory (64-bit shared memory across cluster nodes) SGI ProPack™ and supercomputing enhancements SGI® NUMAlink™ (Built-in high-bandwidth, low-latency interconnect) SGI® NUMAflex™ (Third-generation modular supercomputing architecture) Intel® Itanium® 2 processor family Industry-standard Linux®

10 SGI® Altix™ 3000 Family Extreme Power, Extreme Potential
Model 3700 Superclusters Scalable performance offering 4–64 Itanium 2 processors in a single node 1.30 GHz/3MB L3 cache 1.50 GHz/6MB L3 cache Shared memory across nodes Scalable to 2,048 processors, 16TB memory Nodes up to 64P, 4TB memory Model 3300 Servers Single-node entry offering 4– GHz Itanium® 2 processors, 3MB L3 cache Up to 96GB memory

11 Multi-Paradigm Architecture

12 Multi-Paradigm Architecture Overview
IO MSE R NUMAlink system interconnect General-purpose compute nodes Peer-attached general purpose I/O Mission-specific accel. and/or I/O (MSE) Integrated graphics/visualization C IO MSE V

13 Multi-Paradigm Architecture Mission Specific Element (MSE)
System Interconnect Fabric MSE Memory TIO ASIC FPGA, DSP, or ASIC >5GB/s I/O NL4 Ports (>5GB/s x 2) Implementation variants: Customer supplied/loaded FPGA algorithms and/or ASIC/DSP Subroutine or standard library acceleration Specific use “appliance”

14 Embedded High Performance Computing (EHPC)

15 Seamless HPC Application
Embedded High Performance Computing Unified Development/Deployment Environment EHPC Today Different Development and Deployment Environments EHPC Vision Unified Development and Deployment Enhanced software development productivity Superior performance and HW utilization Demonstration to deployment in days Benefits from mainstream HPC advances Development Environment HPC methods, tools, hardware Architecture provides freedom to develop new approaches SGI EHPC System Scalable ccNUMA System Architecture Embedded Systems Packaging Seamless HPC Application Development Model Mission-Specific Processing Platform Port Months of work Millions of dollars Significant schedule risk Significant architectural/performance risk Deployment Proprietary hardware and software “Islands of memory” narrows algorithm options

16 Embedded High Performance Computing EHPC Platform
SGI EHPC System Scalable ccNUMA System Architecture Embedded Systems Packaging Seamless HPC Application Development Model Mission-Specific Processing Unified software environment for development, prototype and deployment Full multi-paradigm computing support Field-upgradeable mission specific processing acceleration Highly optimized performance per watt and performance per slot Design to fit into established environmentals, form factors, and interfaces 6U Eurocard form factor Passive, slot configurable backplane PMC module connectivity to standard interfaces Able to address oceanic, ground and airborne environmentals

17 Embedded High Performance Computing Family of 6U Form Factor Modules
NUMAlink4 Ports Diagnostic & L1 Ports uP Cache NUMAlink Controller Memory DDR SDRAM System Control 6U Form Factor NUMAlink4 Ports Diagnostic & L1 Ports System Control PMC Module Site NUMAlink Controller PIC Scalable General Purpose Processor/Memory Blade General Purpose I/O Blade This slide is the introduction to modular computing. Regardless of the actual system (the SGI Onyx 3000 modules are pictured), individual, specialized modules (or “bricks”) are combined in various ratios to suit a customer’s specific application environment. The main benefits of NUMAflex modular computing are listed on the slide. Performance is key among these. The R-brick offers 3.2 GB/sec of bandwidth. Our round-trip local memory latency (within the C-brick) is only 180 nanoseconds, which is over 30% faster than IBM Regatta. Remote memory latency over the R-bricks is only 50 nanoseconds per hop, compared to about 10 microseconds (10,000 nanoseconds) on a current-generation IBM SP switch.

18 Embedded High Performance Computing Family of 6U Form Factor Modules
NUMAlink4 Port Diagnostic & L1 Ports TIO System Control Memory FPGA 6.4GB/s NL4 External I/O (Optional) 6U Form Factor NUMAlink4 Port Diagnostic & L1 Ports TIO System Control Memory NL4 External I/O (Optional) FPGA ASIC This slide is the introduction to modular computing. Regardless of the actual system (the SGI Onyx 3000 modules are pictured), individual, specialized modules (or “bricks”) are combined in various ratios to suit a customer’s specific application environment. The main benefits of NUMAflex modular computing are listed on the slide. Performance is key among these. The R-brick offers 3.2 GB/sec of bandwidth. Our round-trip local memory latency (within the C-brick) is only 180 nanoseconds, which is over 30% faster than IBM Regatta. Remote memory latency over the R-bricks is only 50 nanoseconds per hop, compared to about 10 microseconds (10,000 nanoseconds) on a current-generation IBM SP switch. Mission-Specific Accelerator and/or I/O Blades

19 ©2002 Silicon Graphics, Inc. All rights reserved. Silicon Graphics, SGI, IRIX, Challenge, Origin, Onyx, and the SGI logo are registered trademarks and OpenVault, FailSafe, Altix, XFS, and CXFS are trademarks of Silicon Graphics, Inc., in the U.S. and/or other countries worldwide. MIPS is a registered trademark of MIPS Technologies, Inc., used under license by Silicon Graphics, Inc. UNIX is a registered trademark of The Open Group in the U.S. and other countries. Intel and Itanium are registered trademarks of Intel Corporation. Linux is a registered trademark of Linus Torvalds. Mac is a registered trademark of Apple Computer, Inc. Windows is a registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.Red Hat is a registered trademark of Red Hat, Inc. All other trademarks are the property of their respective owners. (1/03)


Download ppt "Application of General Purpose HPC Systems in HPEC"

Similar presentations


Ads by Google