PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
SE-292 High Performance Computing
1 PGAS Languages and Halo Updates Will Sawyer, CSCS.
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Introductory Courses in High Performance Computing at Illinois David Padua.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
The hybird approach to programming clusters of multi-core architetures.
Hossein Bastan Isfahan University of Technology 1/23.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
SPMD: Single Program Multiple Data Streams
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Parallelism without Concurrency Charles E. Leiserson MIT.
A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Background Computer System Architectures Computer System Software.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Programming Models for SimMillennium
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
Q: What Does the Future Hold for “Parallel” Languages?
Support for Adaptivity in ARMCI Using Migratable Objects
Programming Parallel Computers
Presentation transcript:

PGAS Language Update Kathy Yelick

PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote data Partitioned: data is designated as local or global On distributed memory: Remote read/write, one-sided communication; never have to say “receive” As scalable as MPI (not cache-coherent shared memory)! On shared memory these are just loads and stores Permits sharing, whereas MPI rules it out! UPC, Titanium and Co-Array Fortran are PGAS languages Version of Co-Array Fortran now in the Fortran Spec UPC has multiple compilers (Cray, Berkeley, gcc, HP,..) Global address space x: 1 y: l: g: x: 5 y: x: 7 y: null p0p1pn A[0]:1 A[1]: 2 A[p-1]: p

Progress Working on Titanium release –Various improvements, e.g., new Java 4.0 libraries –Focus on Infiniband clusters, shared memory machines, and if possible BG/P and Cray XT UPC Land Evolution benchmark Generalization of collectives to teams –Previous UPC and Titanium languages are pure SPMD (no Ocean + Atmos + Land in parallel without this) Work towards Titanium on XT and BG/P –XT requires “registration” for remote memory access –Executable size / lack of shared library support is a problem Multicore using SPMD / PGAS style (not UPC) –Autotuning of stencil operators (from Green Flash project)

Landscape Evolution Starting point: –Serial code, recursive algorithm –Series of tiles done separately “Seams” are visible in output –Parallel algorithm uses address space to remove seams How to evolve landscapes over time? –Erosion, rainfall, river incision –Want to parallelize this for memory footprint reasons

Optimizing PGAS Code (Case Study) Original implementation was very inefficient –Remote variables, upc_locks, non-local queues PPW Performance tool from UFL helpful –UPC functions comprised > 60% of runtime –Speedup inherently limited by topology –Small test problem has limited theoretical parallelism (7.1); achieves 6.1 after optimizations, compared to 2.8 before get wait global heap lock other get wait global heap other

Progress on BG/P (Intrepid), IB (Ranger), Cray XT (Franklin et al) XT4 (Franklin): GASNet Broadcast outperforms MPI BG/P: 3D FFT Performance. Aside: UPC, CAF, Titanium, Chapel on all NSF and DOE machines are running on top of GASNet, including Cray’s!

PGAS on Multicore PGAS offers a single programming model for distributed and shared memory –Partitioning for distributed memory (and multisocket NUMA nodes) –Ability to save memory footprint within multicore of multisocket shared memory New work in UPC project –Process with shared memory support Interoperability with MPI Sometimes it performs better than threads –Autotuned shared memory collectives Reductions on Ranger shown

What’s a stencil ? Nearest neighbor computations on structured grids (1D…ND array) stencils from PDEs are often a weighted linear combination of neighboring values cases where weights vary in space/time stencil can also result in a table lookup stencils can be nonlinear operators caveat: We only examine implementations like Jacobi’s Method (i.e. separate read and write arrays) 8 i,j,ki+1,j,ki-1,j,k i,j+1,k i,j,k+1 i,j,k-1 i,j-1,k

Strategy Engine: Auto-tuning Optimizations Strategy Engine explores a number of auto-tuning optimizations: –loop unrolling/register blocking –cache blocking –constant propagation / common subexpression elimination Future Work: –cache bypass (e.g. movntpd) –software prefetching –SIMD intrinsics –data structure transformations 9

Laplacian Performance On the memory-bound architecture (Barcelona), auto-parallelization doesn’t make a difference. Auto-tuning enables scalability. Barcelona is bandwidth-proportionally faster than the XT4. Nehalem is ~2.5x faster than Barcelona, and 4x faster than the XT4 Auto-parallelization plus tuning significantly outperforms OpenMP. 10 Auto-tuning Auto- parallelization serial reference OpenMP Comparison Auto-NUMA

Possible Paths Forward Progress on PGAS languages will continue Demonstration of PGAS for Climate –Continue UPC Landscape work (independent) –Resurrect Co-Array Fortran POP on XT4/5 –Autotune CCSM kernels: CRM (to be added) More possibilities –Halo updates in CG solver in POP How much time is spent in this? IE at larger scale –Scalability of Ice –Do lookup tables in UPC (across a node) MB currently; might get larger

Priority List POP halo exchange –POP on > 4K processors did not work –Sent a parameter fix for Cray Portals –CG solver can benefit from overlap –LANL Released POP (tripole grid?) CICE solver –Could benefit current IE runs –Also limited by halo updates Lookup table and hybrid programming –Pick up CSU student code –One-sided MPI messaging Improving OpenMP code performance –Autotuning for CRM