Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015.

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

Compiler techniques for exposing ILP

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Lecture 8: Three-Level Architectures CS 344R: Robotics Benjamin Kuipers.

INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

OUTLINE Introduction Basics of the Verilog Language Gate-level modeling Data-flow modeling Behavioral modeling Task and function.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science pH: A Parallel Dialect of Haskell Jim Cipar & Jacob Sorber University of Massachusetts.

Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Operational Semantics Semantics with Applications Chapter 2 H. Nielson and F. Nielson

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.

Compilers and Interpreters. Translation to machine language Every high level language needs to be translated to machine code There are different ways.

COBOL for the 21 st Century Stern, Stern, Ley Chapter 1 INTRODUCTION TO STRUCTURED PROGRAM DESIGN IN COBOL.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

1 CS101 Introduction to Computing Lecture 19 Programming Languages.

INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Programming with Shared Memory Introduction to OpenMP

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.

CS101 Introduction to Computing Lecture Programming Languages.

Stephen P. Carl - CS 2421 Recursion Reading : Chapter 4.

Memory Allocations for Tiled Uniform Dependence Programs Tomofumi Yuki and Sanjay Rajopadhye.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.

1 Workshop Topics - Outline Workshop 1 - Introduction Workshop 2 - module instantiation Workshop 3 - Lexical conventions Workshop 4 - Value Logic System.

Web Services Flow Language Guoqiang Wang Oct 7, 2002.

Chapter 5 Control Structure (Repetition). Objectives In this chapter, you will: Learn about repetition (looping) control structures Explore how to construct.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

CR18: Advanced Compilers L04: Scheduling Tomofumi Yuki 1.

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Using Loops. Goals Understand how to create while loops in JavaScript. Understand how to create do/while loops in JavaScript. Understand how to create.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Martin Kruliš by Martin Kruliš (v1.0)1.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Operational Semantics Mooly Sagiv Tel Aviv University Sunday Scrieber 8 Monday Schrieber.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Operational Semantics Mooly Sagiv Reference: Semantics with Applications Chapter 2 H. Nielson and F. Nielson

Functional Processing of Collections (Advanced) 6.0.

Introduction to OpenMP

Loop Parallelism and OpenMP CS433 Spring 2001

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)

Functional Processing of Collections (Advanced)

Multi-core CPU Computing Straightforward with OpenMP

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Shared Memory Programming

Parallelization of An Example Program

Chapter 8: More on the Repetition Structure

Introduction to OpenMP

Introduction to Computer Science

How to improve (decrease) CPI

CMSC 611: Advanced Computer Architecture

CMSC 202 Threads.

Mattan Erez The University of Texas at Austin

Presentation transcript:

Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015

The Problem The Parallelism Challenge cannot escape going parallel parallel programming is hard automatic parallelization is limited There won’t be any Silver Bullet X10 as a partial answer high-level language with parallelism in mind features to control parallelism/locality X10 Workshop

Programming with X10 Small set of parallel constructs finish/async clocks at (places), atomic, when Can be composed freely Interesting for both programmer and compilers also challenging X10 Workshop But, it seems to be under-utilized

This Paper Exploring how to use X10 clocks X10 Workshop performance “usual” way expressivity alternative performance “usual” way

Context: Loop Transformations Key to expose parallelism some times it’s easy but not always X10 Workshop for i for j X[i] +=... for i for j X[i] +=... for i forall j X[i] +=... for i forall j X[i] +=... for i = 1.. 2N+M forall j = /*complex bounds*/ X[j] = foo(X[2*j-i-1], X[2*j-i+1]); for i = 1.. 2N+M forall j = /*complex bounds*/ X[j] = foo(X[2*j-i-1], X[2*j-i+1]); for i = 0.. N for j = 1.. M X[j] = foo(X[j-1], X[j+1]); for i = 0.. N for j = 1.. M X[j] = foo(X[j-1], X[j+1]);

Automatic Parallelization Very sensitive to inputs X10 Workshop for (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; for (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; for (t1=2;t1<=3;t1++) { lbp=1; ubp=t1-1; #pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); } for (t1=4;t1<=min(M,N);t1++) { S1((t1-1),1); lbp=2; ubp=t1-2; #pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } S1(1,(t1-1)); } for (t1=M+1;t1<=N;t1++) { S1((t1-1),1); lbp=2; ubp=M-1; #pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } for (t1=N+1;t1<=M;t1++) { lbp=t1-N+1; ubp=t1-2; #pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } S1(1,(t1-1)); } for (t1=max(M+1,N+1);t1<=N+M-2;t1++) { lbp=t1-N+1; ubp=M-1; #pragma omp parallel for private(lbv,ubv,t3) for (t2=lbp;t2<=ubp;t2++) { S1((t1-t2),t2); S2((t1-t2-1),(t2-1)); } very difficult to understand  trust it or not use it very difficult to understand  trust it or not use it

Goal: retain the original structure Expressing with Clocks X10 Workshop 2015 async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; for (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; for (i=1; i<N; i++) for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; for (i=1; i <N-1; i++) for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; 7

Goal: retain the original structure Expressing with Clocks X10 Workshop 2015 async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; 8

Goal: retain the original structure Expressing with Clocks X10 Workshop 2015 async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; 9 1. make many iterations parallel

Goal: retain the original structure Expressing with Clocks X10 Workshop 2015 async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; make many iterations parallel 2. order them by synchronizations

Goal: retain the original structure Expressing with Clocks X10 Workshop 2015 async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; async for (i=1; i<N; i++) advance; async for (j=1; j<M; j++) x[i][j] = x[i-1][j] + x[i][j-1]; advance; async for (i=1; i <N-1; i++) advance; async for (j=1; j<M-1; j++) y[i][j] = y[i-1][j] + y[i][j-1] + x[i+1][j+1]; advance; make many iterations parallel 2. order them by synchronizations compound effect: parallelism similar to those with loop trans.

Outline Introduction X10 Clocks Examples Discussion X10 Workshop

clocks vs barriers Barriers can easily deadlock Clocks are more dynamic 13 //P2 barrier; S1; //P1 barrier; S0; barrier; //P2 advance; S1; //P1 advance; S0; advance;

clocks vs barriers Barriers can easily deadlock Clocks are more dynamic 14 //P2 barrier; S1; //P1 barrier; S0; barrier; deadlock //P2 advance; S1; //P1 advance; S0; advance;

clocks vs barriers Barriers can easily deadlock Clocks are more dynamic 15 //P2 barrier; S1; //P1 barrier; S0; barrier; deadlock //P2 advance; S1; //P1 advance; S0; advance; OK

Dynamicity of Clocks Implicit Syntax The process creating a clock is also registered 16 clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }  Creation of a clock  Each process is registered  Each process is un-registered  Sync registered processes

Dynamicity of Clocks Implicit Syntax Each process waits until all processes starts The primary process has to terminate first 17 clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; }

Dynamicity of Clocks Implicit Syntax Each process waits until all processes starts The primary process has to terminate first 18 clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; } activity 1

Dynamicity of Clocks Implicit Syntax Each process waits until all processes starts The primary process has to terminate first 19 clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; } activity 1activity 2

Dynamicity of Clocks Implicit Syntax Each process waits until all processes starts The primary process has to terminate first 20 clocked finish for (i=1:N) clocked async { for (j=i:N) advance; S0; } activity 1activity 2activity 3activity 4activity 5activity 6

Dynamicity of Clocks Implicit Syntax The primary process calls advance each time Different synchronization pattern 21 clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; } advance; }

Dynamicity of Clocks Implicit Syntax The primary process calls advance each time Different synchronization pattern 22 clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; } advance; } advance by primary activity

Dynamicity of Clocks Implicit Syntax The primary process calls advance each time Different synchronization pattern 23 clocked finish for (i=1:N) { clocked async { for (j=i:N) advance; S0; } advance; } advance by primary activity activity 1activity 2activity 3activity 4activity 5activity 6

Outline Introduction X10 Clocks Examples Discussion X10 Workshop

Example: Skewing Skewing the loops is not easy for (i=1:N) for (j=1:N) h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]) skewing 25

Example: Skewing Skewing the loops is not easy for (i=1:2N-1) forall (j=max(1,i-N):min(N,i-1)) h[i][j] = foo(h[(i-j)-1][j], h[(i-j)-1][j-1], h[(i-j)][j-1]) skewing 26

Example: Skewing Skewing the loops is not easy for (i=1:2N-1) forall (j=max(1,i-N):min(N,i-1)) h[i][j] = foo(h[(i-j)-1][j], h[(i-j)-1][j-1], h[(i-j)][j-1]) skewing changes to loop bounds and indexing 27

Example: Skewing Equivalent parallelism without changing loops 28 clocked finish for (i=1:N) { clocked async for (j=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; }

Example: Skewing Equivalent parallelism without changing loops 29 clocked finish for (i=1:N) { clocked async for (j=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; } locally sequential the launch of the entire block is deferred

Example: Skewing You can have the same skewing clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; } skewing 30

Example: Skewing You can have the same skewing clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; } skewing note: interchange outer parallel loop with clocks note: interchange outer parallel loop with clocks 31

Example: Skewing You can have the same skewing clocked finish for (j=1:N) { clocked async for (i=1:N) { h[i][j] = foo(h[i-1][j], h[i-1][j-1], h[i][j-1]); advance; } advance; } skewing 32 advance;

Example: Loop Fission Common use of barriers X10 Workshop forall (i=1:N) S1; S2; forall (i=1:N) S1; forall (i=1:N) S2; for (i=1:N) async { S1; S2; } for (i=1:N) async { S1; advance; S2; }

Example: Loop Fusion Removes all the parallelism X10 Workshop for (i=1:N) S1; for (i=1:N) S2; for (i=1:N) S1; S2; async for (i=1:N) S1; advance; advance; advance; async for (i=1:N) S2; advance; advance;

Example: Loop Fusion Sometimes fusion is not too simple X10 Workshop for (i=1:N-1) S1(i); for (i=2:N) S2(i); S1(1); for (i=2:N-1) S1(i); S2(i); S2(N); async for (i=1:N-1) S1; advance; advance; advance; async for (i=2:N) S2; advance; advance; code structure stays # of advance  control

What can be expressed? Limiting factor: parallelism difficult to use for sequential loop nests works for wave-front parallelism Intuition clocks defer execution deferring parent activity has cumulative effect X10 Workshop

Discussion Learning curve behavior of clock takes time to understand How much can you express? 1D affine schedules for sure loop permutation is not possible what if we use multiple clocks? X10 Workshop

Potential Applications It might be easier for some people have multiple ways to write code Detect X10 fragments with such property convert to forall for performance X10 Workshop

X10 Workshop