Presentation on theme: "Parallel CC & Petaflop Applications Ryan Olson Cray, Inc."— Presentation transcript:
Parallel CC & Petaflop Applications Ryan Olson Cray, Inc.
Did you know … ZTeraflop - Current ZPetaflop - Imminent ZWhat’s next? ZExaflop ZZettaflop ZYOTTAflop!
Outline Sanibel Symposium ZProgramming Models ZParallel CC Implementations ZBenchmarks ZPetascale Applications This Talk ZDistributed Data Interface ZGAMESS MP-CCSD(T) ZO vs. V ZLocal & Many- Body Methods
Programming Models The Distributed Data Interface (DDI) ZProgramming Interface, not Programming Model ZChoose the key functionality from the best programming models and provide: ZCommon Interface ZSimple and Portable ZGeneral Implementation ZProvide an interface to: ZSPMD: TCGMSG, MPI ZAMOs: SHMEM, GA ZSMPs: OpenMP, pThreads ZSIMD: GPUs, Vector directives, SSE, etc. Z Use the best models for the underlying hardware.
Overview GAMESS Application Level Distributed Data Interface (DDI) High-Level API Implementation SHMEM / GPSHMEMMPI-2MPI-1 + GAMPI-1TCP/IPSystem V IPC Hardware API Elan, GM, etc. Native ImplementationsNon-Native Implementations
Programming Models The Distributed Data Interface ZOverview ZVirtual Shared-Memory Model (Native) ZCluster Implementation (Non-Native) ZShared Memory/SMP Awareness ZClusters of SMP (DDI versions 2-3) ZGoal: Multilevel Parallelism ZIntra/Inter-node Parallelism ZMaximize Data Locality ZMinimize Latency / Maximize Bandwidth
The Early Days of Parallelism (where we’ve been … where we are going …) ZCompeting Models ZTCGMSG, MPI, SHMEM, Global Arrays, etc. ZScalar vs. Vector Machines ZDistributed vs. Shared Memory ZBig Winners (SPMD): MPI and SHMEM ZTwo very different, yet compelling models. ZDDI/GAMESS - use the best models to match the underlying hardware.
Virtual Shared Memory Model CPU 1 Distributed Memory Storage CPU 0 CPU 2 CPU 3 0123 Distributed Matrix DDI_Create(Handle,NRows,NCols) CPU0CPU1CPU2CPU3 NCols NRows Subpatch Key Point: The physical memory available to each CPU is divided into two parts: replicated storage and distributed storage.
Non-Native Implementations (and lost opportunities … ) Distributed Memory Storage (on separate data servers) GET PUT 0123 4567 Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) Compute Processes Data Servers ACC (+=)
SystemV Shared Memory (Fast Model) 0 4 76 GET PUT ACC (+=) 123 Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) Compute Processes Data Servers Shared Memory Segments 5 Distributed Memory Storage (in SysV Shared Memory Segments)
DDI v2 - Full SMP Awareness Distributed Memory Storage (on separate System V Shared Memory Segments) GET PUT ACC (+=) 01 2 3 45 6 7 Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) Compute Processes Data Servers Shared Memory Segments
Proof of Principle - 2003 816326496 DDI v2 1828312978802450343718 DDI–Fast 274001953414809114249010 DDI v1 Limit1098399562785972N/A UMP2 Gradient Calculation - 380 BFs Dual AMD MP2200 Cluster using SCI network (2003 Results) Note: DDI v1 was especially problematic on the SCI network.
DDI v2 ZThe DDI Library is SMP Aware. Zoffers new interfaces to make application SMP aware. ZDDI programs inherit improvements in the library. ZDDI programs do not automatically become SMP aware, unless they utilize the new interface.
Parallel CC and Threads (Shared Memory Parallelism) ZBentz and Kendall ZParallel BLAS3 ZWOMPAT ‘05 ZOpenMP ZParallelized Remaining Terms ZProof of Principle
Results Au 4 ==> GOOD CCSD = (T) No Disk I/O problems Both CCSD and (T) scale well Au + (C 3 H 6 ) ==> POOR/AVERAGE CCSD scales poorly due to I/O vs. FLOP Balance (T) scales well, overshadowed by bad CCSD performance Au 8 ==> GOOD CCSD scales reasonable (Greater FLOP count, about equal I/O). N 7 (T) step dominates over the relatively small time for CCSD. (T) scales well, so the overall performance is good.
DDI v3 ZMemory Hierarchy ZReplicated, Shared and Distributed ZProgram Models ZTraditional DDI ZMultilevel Model ZDDI Groups (a different talk) ZMultilevel Models ZIntra/Internode Parallelism ZSuperset of MPI/OpenMP and/or MPI/pThreads models ZMPI lacks “true” one-sided messaging
Parallel Coupled Cluster (Topics) ZData Distribution for CCSD(T) ZIntegrals Distributed ZAmplitudes in Shared Memory once per node ZDirect [vv|vv] term ZParallelism based on Data Locality ZFirst Generation Algorithm ZIgnore I/O ZFocus on Data and FLOP parallelism
Important Array Sizes (in GB) v o o v [vv|oo] [vo|vo] T2 [vv|vo]
MO Parallelization 0 1 [vo*|vo*], [vv|o*o*] [vv|v*o*] 2 3 T2SolnT2Soln [vo*|vo*], [vv|o*o*] [vv|v*o*] Goal: Disjoint updates to the solution matrix. Avoid locking/critical sections whenever possible.
Direct [VV|VV] Term 0 1 … processes … P-1 PUT 11 12 13 …atomic orbital indices … N bf 2 do = 1,nshell do = 1,nshell compute: transform: end do transform: contract: PUT and for i j do = 1,nshell do = 1, end do synchronize for each “local” ij column do GET reorder: shell --> AO order transform: STORE in “local” solution vector GET end do 11 21 22…occ indices…(N o N o )*
(T) Parallelism ZTrivial -- in theory Z[vv|vo] distributed Zv 3 work arrays Zat large v -- stored in shared memory Zdisjoint updates where both quantities are shared
Improvements … ZSemi-Direct [vv|vv] term (IKCUT) ZConcurrent MO terms ZGeneralized amplitudes storage
Semi-Direct [VV|VV] Term do = 1,nshell do = 1,nshell compute: transform: end do transform: contract: PUT and for i j do = 1,nshell ! I-SHELL do = 1, ! K-SHELL end do if(iter.eq.1) then - open half transformed integral file else - process half transformed integral file end if do 10 ish = 1,NSHELLS do 10 ksh = 1,ish c skip shell pair if it was saved and processed above if(iter.gt.1.and. len(ish)+len(ksh).gt.IKCUT) goto 10 - dynamically load balance work based on ISH/KSH - calc half transformed integrals c save shell pair if it meets the IKCUT criteria if(iter.eq.1.and len(ish)+len(ksh).gt.IKCUT) then - save half-transformed integrals to disk end if 10 continue
Semi-Direct [VV|VV] Term do = 1,nshell do = 1,nshell compute: transform: end do transform: contract: PUT and for i j do = 1,nshell ! I-SHELL do = 1, ! K-SHELL end do ZDefine IKCUT ZStore if: LEN(I)+LEN(K) > IKCUT ZAutomatic contention avoidance ZAdjustable: Fully direct to fully conventional.
Semi-Direct [vv|vv] Timings However: GPUs generate AOs much faster than they can be read off the disk. Water Tetramer / aug’-cc-pVTZ Storage: Shared NFS mounted (bad example). Local Disk or a higher quality Parallel File System (LUSTRE, etc.) should perform better.
Concurrency ZEverything N-ways parallel ZNO ZBiggest mistake ZParallelizing every MO term over all cores. ZFix ZConcurrency
Concurrent MO terms Nodes MO Terms - Parallelized over the minimum number of nodes while still efficient & fast. [vv|vv] MO nodes join the [vv|vv] term already in progress … dynamic load balancing.
Adaptive Computing ZSelf Adjusting / Self Tuning ZConcurrent MO terms ZValue of IKCUT ZUse the iterations to improve the calculation: ZAdjust initial node assignments ZIncrease IKCUT ZMonte Carlo approach to tuning paramaters.
Conclusions … ZGood First Start … Z[vv|vv] scales perfectly with node count. Zmultilevel parallelism Zadjustable i/o usage ZA lot to do … Zimprove intra-node memory bottlenecks Zconcurrent MO terms Zgeneralized amplitude storage Zadaptive computing ZUse the knowledge from these hand coded methods to refine the CS structure in automated methods.
Acknowledgements People ZMark Gordon ZMike Schmidt ZJonathan Bentz ZRicky Kendall ZAlistair Rendell Funding ZDoE SciDAC ZSCL (Ames Lab) ZAPAC / ANU ZNSF ZMSI
Petaflop Applications (benchmarks, too) ZPetaflop = ~125,000 2.2 GHz AMD Opteron cores. ZO vs. V Zsmall O, big V ==> CBS Limit Zbig O ==> see below ZLocal and Many-Body Methods ZFMO, EE-MB, etc. - use existing parallel methods ZSampling