Considerations for Scalable CAE on the SGI ccNUMA Architecture Stan Posey Applications Market Development Cheng Liao Principal Scientist, FEA Applications.

Considerations for Scalable CAE on the SGI ccNUMA Architecture Stan Posey Applications Market Development Cheng Liao Principal Scientist, FEA Applications Christian Tanasescu CAE Applications Manager

Topics of Discussion Historical Trends of CAE Current Status of Scalable CAE Future Directions in Applications

Workstations and Servers Workstations and Servers Mainframes Economics: Physical prototyping costs continue Increasing Engineer more expensive than simulation tools Cost 1960 2000 Years Cost of CAE Simulation Cost of Physical Prototyping Cost of CAE Engineer MSC/NASTRAN Simulation Costs (Source: General Motors ) MSC/NASTRAN Simulation Costs (Source: General Motors ) CAE Engineer vs. System Costs (Source: Detroit Big3 ) CAE Engineer vs. System Costs (Source: Detroit Big3 ) 1960 $30,000 1960 $30,000 Engineer $36/hr Engineer $36/hr 1999 $0.02 1999 $0.02 System $1.5/hr System $1.5/hr Motivation for CAE Technology

Computer Hardware Advances: Processors:Ability to “hide” system latency Architecture:ccNUMA: Crossbar switch replaces shared bus Recent Technology Achievements Rapid CAE Advancement from 1996 to 1999

Late 1980’s:Shared Memory Parallel Hardware:Bus-based shared memory parallel (SMP) Parallel Model:Compiler enabled loop level (SMP fine grain) Characteristics:Low scalability (2p to 6p) but easy to program Limitations:Expensive memory for vector architectures Early 1990’s:Distributed Memory Parallel Hardware:MPP and cluster distributed memory parallel (DMP) Parallel Model:DMP coarse grain through explicit message passing Characteristics:High scalability (> 64p) but difficult to program Limitations:Commercial CAE applications generally unavailable Late 1990’s:Distributed Shared Memory Parallel Hardware:Physically DMP but logically SMP ccNUMA Parallel Model:SMP fine grain, DMP and SMP coarse grain Characteristics:High scalability and easy to program Recent History of Parallel Computing

Origin ccNUMA Architecture Basics Main Memory Proc. Cache I/O Proc. Cache Local Switch Proc. Cache I/O Proc. Cache Local Switch Global Switch Interconnect Main Memory Dir Features of ccNUMA Multi-purpose Architecture Detail of Two Node (w/Router) Architecture (32p Topology) Node Router

Origin2000 ccNUMA available since 1996 Non-blocking crossbar switch as interconnect fabric High levels of scalability over shared bus SMP Physical DMP but logical SMP (synchronized cache memories) 2 to 512 MIPS R12000/400Mhz processors with 8MB L2 cache High memory bandwidth (1.6Gb/s) and I/O that is scalable Distributed and shared memory (fine and coarse) parallel models Parallel Computing with ccNUMA Origin2000/256 Features of ccNUMA Multi-purpose Architecture

Computer Hardware Advances: Processors:Ability to “hide” system latency Architecture:ccNUMA: Crossbar switch replaces shared bus Application Software Advances: Implicit FEA:Sparse solvers increase performance by 10-fold Explicit FEA:Domain parallel increases performance by 10-fold CFD:Scalability increases performance by 100-fold Meshing:Automatic and robust “tetra” meshing Recent Technology Achievements Rapid CAE Advancement from 1996 to 1999

Compute Intensity Flops/word of memory traffic Degree of Parallelism 0.11101001000 FLUENT ABAQUS PAM-CRASH LS-DYNA MSC.Nastran (101) ADINA ANSYS Cache-friendlyMemory BW Low High STAR-CD RADIOSS MARC MSC.Nastran (108) OVERFLOW CFD Explicit FEA Implicit FEA (Statics) Characterization of CAE Applications MSC.Nastran (103 and 111) Implicit FEA (Modal Freq) Implicit FEA (Direct Freq)

MP SCALAR VECTOR Compute Intensity Flops/word of memory traffic Degree of Parallelism 0.11101001000 FLUENT ABAQUS PAM-CRASH LS-DYNA MSC.Nastran (101) ADINA ANSYS Cache-friendlyMemory BW Low High STAR-CD RADIOSS MARC MSC.Nastran (108) MSC.Nastran (103 and 111) OVERFLOW CFD Explicit FEA Implicit FEA (Statics) Characterization of CAE Applications Implicit FEA (Direct Freq) Implicit FEA (Modal Freq)

CPU1 CPU2 CPU3 CPU4 image Implicit FEA- ABAQUS, ANSYS, MSC.Marc, MSC.Nastran Explicit FEA- LS-DYNA, PAM-CRASH, RADIOSS General CFD- CFX, FLUENT, STAR-CD Domain Parallel Example: Compressible 2D flow over wedge, partitioned as 4 domains for parallel execution on 4 processors 1 3 4 2 System Scalable CAE: Domain Decomposition Parallel Scalability Emerging for all CAE

Parallel Scalability in CAE 512 256 128 64 32 16 8 4 2 1 # CPUs Nastran CFD Codes Crash Codes 108 101 103 108 SMPDMP Usable parallel V70.5 V70.7 Peak parallel

Sources that Inhibit Efficient Parallelism Source Computational load imbalance communication overhead between neighboring partitions data and process placement message passing performance MPICH latency : ~ 31  s Solution Nearly equal sized partitions minimize communication between adjacent cells on different cpus enforce memory-process affinity latency and bandwidth awareness SGI-MPI3.1 latency : ~ 12  s Scaling to 16p only Scaling to 64p !! Considerations for Scalable CAE

Processor-Memory Affinity (Data Placement) R N N R N N R N N R N N R N N R N N R N N R N N Process migrates, data stays Process + Data Theory: system will place data and execution threads together properly, system will migrate that data to follow the executing Real Life: 32p Origin 2000 Considerations for Scalable CAE

CPUs 10 30 60 120 240 SSI 381 1.0 99 3.9 67 5.7 29 13.1 18 21.2 4 x 64 424 1.0 139 3.0 72 5.9 39 10.9 49 8.7 Software:FLUENT 5.1.1 CFD Model:External aerodynamics, 3D, , segregated incompressible, iso-thermal, 29M cells Time per Iteration (seconds) FLUENT Scalability on ccNUMA FLUENT Scalability Study of SSI vs. Cluster Largest FLUENT automotive case achieved near ideal scaling on SGI 2800/256

CPUs 8 16 32 64 128 256 Shared Memory (ns) 528 641 710 796 903 1200 MPI (ns) 19 x 10^3 23 x 10^3 26 x 10^3 29 x 10^3 34 x 10^3 44 x 10^3 Single System Image (SSI) Latency HIPPI osBYPASS 139 x 10^3 Cluster Configuration Latency 256cpu SSI 4 x 64 Cluster SSI Advantage for CFD with MPI

75 60 45 30 15 0 0 128 256 384 512 Number of CPUs Performance (GFLOP/s) 60 GFLOPS, Oct 99 FY98 Milestone C916/16 OVERFLOW Limit Problem: 35M Points 160 Zones NASA Ames Research Center Largest model in NASA history, achieved 60Gflops on SGI 2800/512 with linear scaling OVERFLOW Complete Boeing 747 Aerodynamics Simulation Boeing Commercial Aircraft Grand Scale HPC: NASA and Boeing

Computational Requirements for MSC.Nastran Compute Task Sparse Direct Solver Lanczos Solver Iterative Solver I/O Activity Memory CPU Bandwidth Cycles 7% 93% 60% 40% 83% 17% 100% 0%

MSC/NASTRAN MPI Based Scalability for SOL 108: Independent frequency steps, naturally parallel File and memory space not shared Near linear parallel scalability Improved accuracy over SOL 111 with increasing frequency Released on SGI with v70.7 (Oct 99) MSC/NASTRAN MPI Based Scalability for SOL 103, 111: Typical scalability - 2x to 3x on 8p, less for SOL 111 MSC.Nastran Scalability on ccNUMA

200Hz100Hz 150 modes CPU 1 350 modes CPU 2 300 modes CPU 3 200 modes CPU 4 0Hz 200Hz50Hz100Hz150Hz 1 - 50 CPU 1 51 - 100 CPU 2 101 - 150 CPU 3 151 - 200 CPU 4 0Hz 400Hz300Hz MSC/NASTRAN MPI Based Scalability for SOL 111: MSC/NASTRAN MPI Based Scalability for SOL 108: Freqs CPU Modes CPU Parallel Schematics Parallel Schemes for an excitation frequency of 200Hz on a 4 CPU system MSC.Nastran Scalability on ccNUMA

CPUsElapsed Parallel Time (s) Speed-up 1 1207201.0 2 616802.0 4 321603.8 8 173876.9 16 10387 11.6(*) * measured on populated nodes Cray T90 Baseline Results SOL:111 DOF:525K Eigensolution:2714 modes Freq Steps:96 Elap Time:31610 sec SOL 108 Comparison with Conventional NVH (SOL 111 on T90) MSC.Nastran Scalability on ccNUMA

CPUs Elapsed Parallel Time (h) Speed-up 1 31.71.0 8 4.17.8 16 2.2 14.2 32 1.4 22.6 Model Description Model:BIW SOL:108 DOF:536K Freq Steps:96 Run Statistics (per MPI Process) Memory:340 MB FFIO Cache:128 MB Disk Space: 3.6 GB Process/Node: 2 MSC.Nastran Parallel Scalability for Direct Frequency Response (SOL 108) MSC.Nastran Scalability on ccNUMA The Future of Automotive NVH Modeling

Higher excitation frequencies of interest will increase DOF and modal density beyond SOL 103,111 practical limits Frequency Elap Time Direct Frequency Response: 108 Modal Frequency Response: 103,111 199X Models 200X Models Future Automotive NVH Modeling

Capability Features General Availability IRIX/MIPS SSI Linux/IA-64, Clusters & SSI Functionality Migration UNICOS/ Vector Economics of HPC Rapidly Changing SGI Partnership with HPC Community on Technology Roadmap SGI Partnership with HPC Community on Technology Roadmap

Bandwidth improvement of 2x over Origin2000 System support for IRIX/MIPS or LINUX/IA-64 Modular design allows subsystem upgrades without forklift Latency decrease by 50% over Origin2000 Next Generation IRIX Features and Improvements SN-MIPS: Features of Next Generation ccNUMA Shared memory to 512 processors and beyond RAS enhancements: Resiliency and Hot Swap Data center management: scheduling, accounting HPC clustering: GSN, CXFS shared file system HPC Architecture Roadmap at SGI

Compute Intensity Flops/word of memory traffic Degree of Parallelism 0.11101001000 FLUENT ABAQUS PAM-CRASH LS-DYNA MSC.Nastran (101) ADINA ANSYS Cache-friendlyMemory BW Low High STAR-CD RADIOSS MARC MSC.Nastran (108) OVERFLOW CFD Explicit FEA Implicit FEA (Statics) Characterization of CAE Applications Implicit FEA (Direct Freq) MSC.Nastran (103 and 111) Implicit FEA (Modal Freq) SN-MIPS Benefit SN-MIPS Benefit

Compute Intensity Flops/word of memory traffic Degree of Parallelism 0.11101001000 FLUENT ABAQUS PAM-CRASH LS-DYNA MSC.Nastran (101) ADINA ANSYS Cache-friendlyMemory BW Low High STAR-CD RADIOSS MARC MSC.Nastran (108) OVERFLOW CFD Explicit FEA Implicit FEA (Statics) Characterization of CAE Applications Implicit FEA (Direct Freq) MSC.Nastran (103 and 111) Implicit FEA (Modal Freq) SN-MIPS Benefit SN-MIPS Benefit SN-IA Benefit SN-IA Benefit

Current as of SEP 1999 1999: 2.9 TFlops installed in Automotive OEMs world wide 1997: 1.1 TFlops installed in Automotive OEMs world wide Architecture Mix for Automotive HPC

GM and DaimlerChrysler each grew capacity more than 2x over past year Automotive Industry HPC Investments

Meta-Computing with Explicit FEA Los Alamos and DOE Applied Engineering Analysis “Stochastic Simulation of 18 CPU Years Completed in 3 Days on ASCI Blue Mtn” USDOE supported research achieved “first-ever” full-scale ABAQUS/Explicit simulation of nuclear weapons impact response on Origin/6144 ASCI (Feb 00) Ford Motor SRL and NASA Langley Optimization of a vehicle body for NVH and crash, completed 9 CPU months of RADIOSS and MSC.Nastran overnight with response surface technique (Apr 00) BMW Body Engineering 672 MIPS cpus dedicated to stochastic crash simulation with PAM-CRASH (Jan 00) Non-deterministic methods for improved FEA simulation Future Directions in CAE Applications

Meta-Computing with Explicit FEA Manage design uncertainty from variability – Scatter in materials, loading, test conditions Non-deterministic simulation of vehicle “population” – Meta-computing on SSI or large cluster Improved design space exploration – Moving design towards target parameters Objective: Objective: Approach: Approach: Insight: Insight: Unlikely Performance Most likely Performance

NASA Langley Research Center Achieved overnight BIP optimization on SGI 2800/256, with equivalent yield of 9 months CPU time NVH & Crash Optimization of Vehicle Body Overnight Ford Motor Scientific Research Labs Ford body-in-prime (BIP) model of 390K DOF MSC.Nastran for NVH, 30 design variables RADIOSS for crash, 20 design variables 10 design variables in common Sensitivity based Taylor approx. for NVH Polynomial response surface for crash Grand Scale HPC: NASA and Ford

Crash Model Size Number of Engineers Cost per CPU-hour 1 100 Growth Index 1993 1999 450000 elem. x7x5 X90+ Turnaround time Crash SMP x6 Capacity GFlops #1 564 Gflops x90 x40 Turnaround time Crash, CFD-MPP NVH Model Size 2 Mil. DOF CFDModel Size >10Mil cells x6 Historical Growth of CAE Application Source: Survey of major automotive developers

CAE to evolve into fully scalable, RISC-based technology High resolution models - CFD today, Crash, FEA emerging Deterministic CAE giving way to probability techniques Deployment increases computational requirements 10-fold Visual interaction with models beyond 3M cell/DOF High resolution modeling will strain visualization technology Multi-Discipline optimization (MDO) implementation in earnest Coupling of structure, fluids, acoustics, electromagnetics Future Directions of Scalable CAE

Conclusions For small and medium size problems cluster can be a viable solution in the range of 8 – 16 CPUs In the space of large and extremely large problems SSI architecture provides better parallel performance due to superior characteristics of in-box interconnect In order to increase a single CPU performance developer should put in consideration the correlation between exploited data structure & algorithms and specific memory hierarchy ccNUMA system allows a coupling of various parallel programming paradigms which could benefit a performance of multiphysics applications

Considerations for Scalable CAE on the SGI ccNUMA Architecture Stan Posey Applications Market Development Cheng Liao Principal Scientist, FEA Applications.

Similar presentations

Presentation on theme: "Considerations for Scalable CAE on the SGI ccNUMA Architecture Stan Posey Applications Market Development Cheng Liao Principal Scientist, FEA Applications."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Considerations for Scalable CAE on the SGI ccNUMA Architecture Stan Posey Applications Market Development Cheng Liao Principal Scientist, FEA Applications.

Similar presentations

Presentation on theme: "Considerations for Scalable CAE on the SGI ccNUMA Architecture Stan Posey Applications Market Development Cheng Liao Principal Scientist, FEA Applications."— Presentation transcript:

Similar presentations

About project

Feedback