Parallel CC & Petaflop Applications Ryan Olson Cray, Inc.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Introduction to Openmp & openACC

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

HIGH PERFORMANCE ELECTRONIC STRUCTURE THEORY Mark S. Gordon, Klaus Ruedenberg Ames Laboratory Iowa State University BBG.

Metadata Performance Improvements Presentation for LUG 2011 Ben Evans Principal Software Engineer Terascala, Inc.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,

Thoughts on Shared Caches Jeff Odom University of Maryland.

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

The hybird approach to programming clusters of multi-core architetures.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, Dec 26, 2012outline.1 ITCS 4145/5145 Parallel Programming Spring 2013 Barry Wilkinson Department.

Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.

Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.

The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.

Exploring Parallelism with Joseph Pantoga Jon Simington.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Parallel Computing Presented by Justin Reschke

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

A Parallel Communication Infrastructure for STAPL

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Is System X for Me? Cal Ribbens Computer Science Department

Performance Evaluation of Adaptive MPI

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

User-level Distributed Shared Memory

Hybrid Programming with OpenMP and MPI

Support for Adaptivity in ARMCI Using Migratable Objects

An Implementation of User-level Distributed Shared Memory

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Parallel CC & Petaflop Applications Ryan Olson Cray, Inc.

Did you know … ZTeraflop - Current ZPetaflop - Imminent ZWhat’s next? ZExaflop ZZettaflop ZYOTTAflop!

Outline Sanibel Symposium ZProgramming Models ZParallel CC Implementations ZBenchmarks ZPetascale Applications This Talk ZDistributed Data Interface ZGAMESS MP-CCSD(T) ZO vs. V ZLocal & Many- Body Methods

Programming Models The Distributed Data Interface (DDI) ZProgramming Interface, not Programming Model ZChoose the key functionality from the best programming models and provide: ZCommon Interface ZSimple and Portable ZGeneral Implementation ZProvide an interface to: ZSPMD: TCGMSG, MPI ZAMOs: SHMEM, GA ZSMPs: OpenMP, pThreads ZSIMD: GPUs, Vector directives, SSE, etc. Z Use the best models for the underlying hardware.

Overview GAMESS Application Level Distributed Data Interface (DDI) High-Level API Implementation SHMEM / GPSHMEMMPI-2MPI-1 + GAMPI-1TCP/IPSystem V IPC Hardware API Elan, GM, etc. Native ImplementationsNon-Native Implementations

Programming Models The Distributed Data Interface ZOverview ZVirtual Shared-Memory Model (Native) ZCluster Implementation (Non-Native) ZShared Memory/SMP Awareness ZClusters of SMP (DDI versions 2-3) ZGoal: Multilevel Parallelism ZIntra/Inter-node Parallelism ZMaximize Data Locality ZMinimize Latency / Maximize Bandwidth

The Early Days of Parallelism (where we’ve been … where we are going …) ZCompeting Models ZTCGMSG, MPI, SHMEM, Global Arrays, etc. ZScalar vs. Vector Machines ZDistributed vs. Shared Memory ZBig Winners (SPMD): MPI and SHMEM ZTwo very different, yet compelling models. ZDDI/GAMESS - use the best models to match the underlying hardware.

Virtual Shared Memory Model CPU 1 Distributed Memory Storage CPU 0 CPU 2 CPU Distributed Matrix DDI_Create(Handle,NRows,NCols) CPU0CPU1CPU2CPU3 NCols NRows Subpatch Key Point: The physical memory available to each CPU is divided into two parts: replicated storage and distributed storage.

Non-Native Implementations (and lost opportunities … ) Distributed Memory Storage (on separate data servers) GET PUT Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) Compute Processes Data Servers ACC (+=)

DDI till 2003 …

SystemV Shared Memory (Fast Model) GET PUT ACC (+=) 123 Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) Compute Processes Data Servers Shared Memory Segments 5 Distributed Memory Storage (in SysV Shared Memory Segments)

DDI v2 - Full SMP Awareness Distributed Memory Storage (on separate System V Shared Memory Segments) GET PUT ACC (+=) Node 0 (CPU0 + CPU1) Node 1 (CPU2 + CPU3) Compute Processes Data Servers Shared Memory Segments

Proof of Principle DDI v DDI–Fast DDI v1 Limit N/A UMP2 Gradient Calculation BFs Dual AMD MP2200 Cluster using SCI network (2003 Results) Note: DDI v1 was especially problematic on the SCI network.

DDI v2 ZThe DDI Library is SMP Aware. Zoffers new interfaces to make application SMP aware. ZDDI programs inherit improvements in the library. ZDDI programs do not automatically become SMP aware, unless they utilize the new interface.

Parallel CC and Threads (Shared Memory Parallelism) ZBentz and Kendall ZParallel BLAS3 ZWOMPAT ‘05 ZOpenMP ZParallelized Remaining Terms ZProof of Principle

Results Au 4 ==> GOOD CCSD = (T) No Disk I/O problems Both CCSD and (T) scale well Au + (C 3 H 6 ) ==> POOR/AVERAGE CCSD scales poorly due to I/O vs. FLOP Balance (T) scales well, overshadowed by bad CCSD performance Au 8 ==> GOOD CCSD scales reasonable (Greater FLOP count, about equal I/O). N 7 (T) step dominates over the relatively small time for CCSD. (T) scales well, so the overall performance is good.

Detailed Speedups …

DDI v3 Shared Memory for ALL Compute Processes Data Servers Aggregrate Distributed Storage Replicated Storage ~ 500MB –1GB Shared Memory ~ 1GB – 12GB Distributed Memory ~ 10 – 1000GB

DDI v3 ZMemory Hierarchy ZReplicated, Shared and Distributed ZProgram Models ZTraditional DDI ZMultilevel Model ZDDI Groups (a different talk) ZMultilevel Models ZIntra/Internode Parallelism ZSuperset of MPI/OpenMP and/or MPI/pThreads models ZMPI lacks “true” one-sided messaging

Parallel Coupled Cluster (Topics) ZData Distribution for CCSD(T) ZIntegrals Distributed ZAmplitudes in Shared Memory once per node ZDirect [vv|vv] term ZParallelism based on Data Locality ZFirst Generation Algorithm ZIgnore I/O ZFocus on Data and FLOP parallelism

Important Array Sizes (in GB) v o o v [vv|oo] [vo|vo] T2 [vv|vo]

MO Based Terms

Some code … DO 123 I=1,NU IOFF=NO2U*(I-1)+1 CALL RDVPP(I,NO,NU,TI) CALL DGEMM('N','N',NO2,NU,NU2,ONE,O2,NO2,TI,NU2,ONE, & T2(IOFF),NO2) 123 CONTINUE CALL TRMD(O2,TI,NU,NO,20) CALL TRMD(VR,TI,NU,NO,21) CALL VECMUL(O2,NO2U2,HALF) CALL ADT12(1,NO,NU,O1,O2,4) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,VR,NOU,O2,NOU,ONE,VL,NOU) CALL ADT12(2,NO,NU,O1,O2,4) CALL VECMUL(O2,NO2U2,TWO) CALL TRMD(O2,TI,NU,NO,27) CALL TRMD(T2,TI,NU,NO,28) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU) CALL TRANMD(O2,NO,NU,NU,NO,23) CALL TRANMD(T2,NO,NU,NU,NO,23) CALL DGEMM('N','N',NOU,NOU,NOU,ONEM,O2,NOU,VL,NOU,ONE,T2,NOU)

MO Parallelization 0 1 [vo*|vo*], [vv|o*o*] [vv|v*o*] 2 3 T2SolnT2Soln [vo*|vo*], [vv|o*o*] [vv|v*o*] Goal: Disjoint updates to the solution matrix. Avoid locking/critical sections whenever possible.

Direct [VV|VV] Term 0 1 … processes … P-1 PUT …atomic orbital indices … N bf 2 do  = 1,nshell do = 1,nshell compute: transform: end do transform: contract: PUT and for i  j do = 1,nshell do  = 1, end do synchronize for each “local” ij column do GET reorder: shell --> AO order transform: STORE in “local” solution vector GET end do …occ indices…(N o N o )*

(T) Parallelism ZTrivial -- in theory Z[vv|vo] distributed Zv 3 work arrays Zat large v -- stored in shared memory Zdisjoint updates where both quantities are shared

Timings … (H 2 O) 6 Prism - aug’-cc-pVTZ Fastest timing: < 6 hours on 8x8 Power5

Improvements … ZSemi-Direct [vv|vv] term (IKCUT) ZConcurrent MO terms ZGeneralized amplitudes storage

Semi-Direct [VV|VV] Term do  = 1,nshell do = 1,nshell compute: transform: end do transform: contract: PUT and for i  j do = 1,nshell ! I-SHELL do  = 1, ! K-SHELL end do if(iter.eq.1) then - open half transformed integral file else - process half transformed integral file end if do 10 ish = 1,NSHELLS do 10 ksh = 1,ish c skip shell pair if it was saved and processed above if(iter.gt.1.and. len(ish)+len(ksh).gt.IKCUT) goto 10 - dynamically load balance work based on ISH/KSH - calc half transformed integrals c save shell pair if it meets the IKCUT criteria if(iter.eq.1.and len(ish)+len(ksh).gt.IKCUT) then - save half-transformed integrals to disk end if 10 continue

Semi-Direct [VV|VV] Term do  = 1,nshell do = 1,nshell compute: transform: end do transform: contract: PUT and for i  j do = 1,nshell ! I-SHELL do  = 1, ! K-SHELL end do ZDefine IKCUT ZStore if: LEN(I)+LEN(K) > IKCUT ZAutomatic contention avoidance ZAdjustable: Fully direct to fully conventional.

Semi-Direct [vv|vv] Timings However: GPUs generate AOs much faster than they can be read off the disk. Water Tetramer / aug’-cc-pVTZ Storage: Shared NFS mounted (bad example). Local Disk or a higher quality Parallel File System (LUSTRE, etc.) should perform better.

Concurrency ZEverything N-ways parallel ZNO ZBiggest mistake ZParallelizing every MO term over all cores. ZFix ZConcurrency

Concurrent MO terms Nodes MO Terms - Parallelized over the minimum number of nodes while still efficient & fast. [vv|vv] MO nodes join the [vv|vv] term already in progress … dynamic load balancing.

Adaptive Computing ZSelf Adjusting / Self Tuning ZConcurrent MO terms ZValue of IKCUT ZUse the iterations to improve the calculation: ZAdjust initial node assignments ZIncrease IKCUT ZMonte Carlo approach to tuning paramaters.

Conclusions … ZGood First Start … Z[vv|vv] scales perfectly with node count. Zmultilevel parallelism Zadjustable i/o usage ZA lot to do … Zimprove intra-node memory bottlenecks Zconcurrent MO terms Zgeneralized amplitude storage Zadaptive computing ZUse the knowledge from these hand coded methods to refine the CS structure in automated methods.

Acknowledgements People ZMark Gordon ZMike Schmidt ZJonathan Bentz ZRicky Kendall ZAlistair Rendell Funding ZDoE SciDAC ZSCL (Ames Lab) ZAPAC / ANU ZNSF ZMSI

Petaflop Applications (benchmarks, too) ZPetaflop = ~125, GHz AMD Opteron cores. ZO vs. V Zsmall O, big V ==> CBS Limit Zbig O ==> see below ZLocal and Many-Body Methods ZFMO, EE-MB, etc. - use existing parallel methods ZSampling