Beyond Shared Memory Loop Parallelism in the Polyhedral Model Tomofumi Yuki Ph.D Dissertation 10/30 2012.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

Variations of the Turing Machine

Angstrom Care 培苗社 Quadratic Equation II

3rd Annual Plex/2E Worldwide Users Conference 13A Batch Processing in 2E Jeffrey A. Welsh, STAR BASE Consulting, Inc. September 20, 2007.

1 Concurrency: Deadlock and Starvation Chapter 6.

AP STUDY SESSION 2.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Processes and Operating Systems

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Objectives: Generate and describe sequences. Vocabulary:

David Burdett May 11, 2004 Package Binding for WS CDL.

Introduction to Algorithms 6.046J/18.401J

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.

Chapter 5 Input/Output 5.1 Principles of I/O hardware

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.

Solve Multi-step Equations

Break Time Remaining 10:00.

Factoring Quadratics — ax² + bx + c Topic

Turing Machines.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

Database Performance Tuning and Query Optimization

PP Test Review Sections 6-1 to 6-6

11 Data Structures Foundations of Computer Science ã Cengage Learning.

EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.

Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.

Bellwork Do the following problem on a ½ sheet of paper and turn in.

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

Association Rule Mining

Operating Systems Operating Systems - Winter 2012 Chapter 4 – Memory Management Vrije Universiteit Amsterdam.

Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

Success with ModelSmart3D Pre-Engineering Software Corporation Written by: Robert A. Wolf III, P.E. Copyright 2001, Pre-Engineering Software Corporation,

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Adding Up In Chunks.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

1 Termination and shape-shifting heaps Byron Cook Microsoft Research, Cambridge Joint work with Josh Berdine, Dino Distefano, and.

Artificial Intelligence

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.

Types of selection structures

Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.

Essential Cell Biology

Converting a Fraction to %

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Clock will move after 1 minute

PSSA Preparation.

Chapter 13: Digital Control Systems 1 ©2000, John Wiley & Sons, Inc. Nise/Control Systems Engineering, 3/e Chapter 13 Digital Control Systems.

Immunobiology: The Immune System in Health & Disease Sixth Edition

Physics for Scientists & Engineers, 3rd Edition

Energy Generation in Mitochondria and Chlorplasts

Select a time to count down from the clock above

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Completing the Square Topic

9. Two Functions of Two Random Variables

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

AlphaZ: A System for Design Space Exploration in the Polyhedral Model

Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.

CR18: Advanced Compilers L06: Code Generation Tomofumi Yuki 1.

Presentation transcript:

Beyond Shared Memory Loop Parallelism in the Polyhedral Model Tomofumi Yuki Ph.D Dissertation 10/

The Problem Figure from 2

Parallel Processing A small niche in the past, hot topic today Ultimate Solution: Automatic Parallelization Extremely difficult problem After decades of research, limited success Other solutions: Programming Models Libraries (MPI, OpenMP, CnC, TBB, etc.) Parallel languages (UPC, Chapel, X10, etc.) Domain Specific Languages (stencils, etc.) 3

MPI Code Generation Polyhedral X10 X10 AlphaZ MDE 40+ years of research linear algebra, ILP CLooG, ISL, Omega, PLuTo 40+ years of research linear algebra, ILP CLooG, ISL, Omega, PLuTo Contributions Polyhedral Model 4

Polyhedral State-of-the-art Tiling based parallelization Extensions to parameterized tile sizes First step [Renganarayana2007] Parallelization + Imperfectly nested loops[Hartono2010, Kim2010] PLuTo approach is now used by many people Wave-front of tiles: better strategy than maximum parallelism [Bondhugula2008] Many advances in shared memory context 5

How far can shared memory go? The Memory Wall is still there Does it make sense for 1000 cores to share memory? [Berkley View, Shalf 07, Kumar 05] Power Coherency overhead False sharing Hierarchy? Data volume (tera- peta-bytes) 6

Distributed Memory Parallelization Problems implicitly handled by the shared memory now need explicit treatment Communication Which processors need to send/receive? Which data to send/receive? How to manage communication buffers? Data partitioning How do you allocate memory across nodes? 7

MPI Code Generator Distributed Memory Parallelization Tiling based Parameterized tile sizes C+MPI implementation Uniform dependences as key enabler Many affine dependences can be uniformized Shared memory performance carried over to distributed memory Scales as well as PLuTo but to multiple nodes 8

Related Work (Polyhedral) Polyhedral Approaches Initial idea [Amarasinghe1993] Analysis for fixed sized tiling [Claßen2006] Further optimization [Bondhugula2011] Brute Force polyhedral analysis for handling communication No hope of handling parametric tile size Can handle arbitrarily affine programs 9

Outline Introduction Uniform-ness of Affine Programs Uniformization Uniform-ness of PolyBench MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with PolyBench Conclusions and Future Work 10

Affine vs Uniform Affine Dependences: f = Ax+b Examples (i,j->j,i) (i,j->i,i) (i->0) Uniform Dependences: f = Ix+b Examples (i,j->i-1,j) (i->i-1) 11

Uniformization (i->0) (i->i-1) 12

Uniformization Uniformization is a classic technique solved in the 1980s has been forgotten in the multi-core era Any affine dependence can be uniformized by adding a dimension [Roychowdhury1988] Nullspace pipelining simple technique for uniformization many dependences are uniformized 13

Uniformization and Tiling Uniformization does not influence tilability 14

PolyBench [Pouchet2010] Collection of 30 polyhedral kernels Proposed by Pouchet as a benchmark for polyhedral compilation Goal: Small enough benchmark so that individual results are reported; no averages Kernels from: data mining linear algebra kernels, solvers dynamic programming stencil computations 15

Uniform-ness of PolyBench 5 of them are incorrect and are excluded Embedding: Match dimensions of statements Phase Detection: Separate program into phases Output of a phase is used as inputs to the other StageUniform at Start After Embeddin g After Pipelining After Phase Detection Number of Fully Uniform Programs 8/25 (32%) 13/25 (52%) 21/25 (84%) 24/25 (96%) 16

Outline Introduction Uniform-ness of Affine Programs Uniformization Uniform-ness of PolyBench MPI Code Generation Tiling Uniform-ness simplifies everything Comparison against PLuTo with PolyBench Conclusions and Future Work 17

Basic Strategy: Tiling We focus on tilable programs 18

Dependences in Tilable Space All in the non-positive direction 19

Wave-front Parallelization All tiles with the same color can run in parallel 20

Assumptions Uniform in at least one of the dimensions The uniform dimension is made outermost Tilable space is fully permutable One-dimensional processor allocation Large enough tile sizes Dependences do not span multiple tiles Then, communication is extremely simplified 21

Processor Allocation Outermost tile loop is distributed P0 P1 P2P3 i1 i2 22

Values to be Communicated Faces of the tiles (may be thicker than 1) i1 i2 P0 P1 P2P3 23

Naïve Placement of Send and Receive Codes Receiver is the consumer tile of the values i1 i2 P0 P1 P2P3 S S S S S S R R R R R R 24

Problems in Naïve Placement Receiver is in the next wave-front time i1 i2 P0 P1 P2P3 S S S S S S R R R R R R t=0 t=1 t=2 t=3 25

Problems in Naïve Placement Receiver is in the next wave-front time Number of communications in-flight = amount of parallelism MPI_Send will deadlock May not return control if system buffer is full Asynchronous communication is required Must manage your own buffer required buffer = amount of parallelism i.e., number of virtual processors 26

Proposed Placement of Send and Receive codes Receiver is one tile below the consumer i1 i2 P0 P1 P2P3 S S S S S S R R R R R R 27

Placement within a Tile Naïve Placement: Receive -> Compute -> Send Proposed Placement: Issue asynchronous receive (MPI_Irecv) Compute Issue asynchronous send (MPI_Isend) Wait for values to arrive Overlap of computation and communication Only two buffers per physical processor Overlap Recv Buffer Send Buffer 28

Evaluation Compare performance with PLuTo Shared memory version with same strategy Cray: 24 cores per node, up to 96 cores Goal: Similar scaling as PLuTo Tile sizes are searched with educated guesses PolyBench 7 are too small 3 cannot be tiled or have limited parallelism 9 cannot be used due to PLuTo/PolyBench issue 29

Performance Results 30 Linear extrapolation from speed up of 24 cores Broadcast cost at most 2.5 seconds

AlphaZ System System for polyhedral design space exploration Key features not explored by other tools: Memory allocation Reductions Case studies to illustrate the importance of unexplored design space [LCPC2012] Polyhedral Equational Model [WOLFHPC2012] MDE applied to compilers [MODELS2011] 31

Polyhedral X10 [PPoPP2013?] Work with Vijay Saraswat and Paul Feautrier Extension of array data flow analysis to X10 supports finish/async but not clocks finish/async can express more than doall Focus of polyhedral model so far: doall Dataflow result is used to detect races With polyhedral precision, we can guarantee program regions to be race-free 32

Conclusions Polyhedral Compilation has lots of potential Memory/reductions are not explored Successes in automatic parallelization Race-free guarantee Handling arbitrary affine may be an overkill Uniformization makes a lot of sense Distributed memory parallelization made easy Can handle most of PolyBench 33

Future Work Many direct extensions Hybrid MPI+OpenMP with multi-level tiling Partial uniformization to satisfy pre-condition Handling clocks in Polyhedral X10 More broad applications of polyhedral model Approximations Larger granularity: blocks of computations instead of statements Abstract interpretations [Alias2010] 34

Acknowledgements Advisor: Sanjay Rajopadhye Committee members: Wim Böhm Michelle Strout Edwin Chong Unofficial Co-advisor: Steven Derrien Members of Mélange, HPCM, CAIRN Dave Wonnacott, Haverford students 35

Backup Slides 36

Uniformization and Tiling Tilability is preserved 37

D-Tiling Review [Kim2011] Parametric tiling for shared memory Uses non-polyhedral skewing of tiles Required for wave-front execution of tiles The key equation: where d: number of tiled dimensions ti: tile origins ts: tile sizes 38

D-Tiling Review cont. The equation enables skewing of tiles If one of time or tile origins are unknown, can be computed from the others Generated Code: (tix is d-1th tile origin) 39 for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tid = f(time, ti1, …, tix); //compute tile ti1,ti2,…,tix,tid }

Placement of Receive Code using D-Tiling Slight modification to the use of the equation Visit tiles in the next wave-front time 40 for (time=start:end) for (ti1=ti1LB:ti1UB) … for (tix=tixLB:tixUB) { tidNext = f(time+1, ti1, …, tix); //receive and unpack buffer for //tile ti1,ti2,…,tix,tidNext }

Proposed Placement of Send and Receive codes Receiver is one tile below the consumer i1 i2 P0 P1 P2P3 S S S S S S R R R R R R 41

Extensions to Schedule Independent Mapping Schedule Independent Mapping [Strout1998] Universal Occupancy Vectors (UOVs) Legal storage mapping for any legal execution Uniform dependence programs only Universality of UOVs can be restricted e.g., to tiled execution For tiled execution, shortest UOV can be found without any search 42

LU Decomposition 43

seidel-2d 44

seidel-2d (no 8x8x8) 45

jacobi-2d-imper 46

Related Work (Non-Polyhedral) Global communications [Li1990] Translation from shared memory programs Pattern matching for global communications Paradigm [Banerjee1995] No loop transformations Finds parallel loops and inserts necessary communications Tiling based [Goumas2006] Perfectly nested uniform dependences 47

PLuTo does not scale because the outer loop is not tiled adi.c: Performance 48

Complexity reduction is empirically confirmed UNAfold: Performance 49

Contributions The AlphaZ System Polyhedral compiler with full control to the user Equational view of the polyhedral model MPI Code Generator The first code generator with parametric tiling Double buffering Polyhedral X10 Extension to the polyhedral model Race-free guarantee of X10 programs 50