10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3, 2009 1.

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Data Locality CS 524 – High-Performance Computing.

 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Data Dependences CS 524 – High-Performance Computing.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

09/27/2011CS4961 CS4961 Parallel Programming Lecture 10: Introduction to SIMD Mary Hall September 27, 2011.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

09/24/2010CS4961 CS4961 Parallel Programming Lecture 10: Thread Building Blocks Mary Hall September 24,

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Parallel Computing Presented by Justin Reschke

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

My Coordinates Office EM G.27 contact time:

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

Lecture 38: Compiling for Modern Architectures 03 May 02

Conception of parallel algorithms

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

CS427 Multicore Architecture and Parallel Computing

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

L18: CUDA, cont. Memory Hierarchy and Examples

CS4961 Parallel Programming Lecture 12: Data Locality, cont

Register Pressure Guided Unroll-and-Jam

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 27, /25/2009 CS4961.

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

L15: CUDA, cont. Memory Hierarchy and Examples

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Memory System Performance Chapter 3

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

Administrative Programming assignment 2 is posted (after class) Due, Thursday, October 8 before class -Use the “handin” program on the CADE machines -Use the following command: “handin cs4961 prog2 ” Mailing list set up: Midterm Quiz on Oct. 8 -Covers material in Lectures Brief review on Tuesday -Can take it early on Tuesday, Oct. 6 after class 10/01/2009CS49612

Today’s Lecture Questions on assignment Review for exam More on data locality (and slides from last time) -Begin with dependences -Tests for legality of transformations -A few more transformations 10/01/2009CS49613

Review for Quiz L1: Overview -Technology drivers for multi-core paradigm shift -Concerns in efficient parallelization L2: -Fundamental theorem of dependence (also in today’s lecture) -Reductions L3 -SIMD/MIMD, shared memory, distributed memory -Candidate Type Architecture Model L4 -Task and data parallelism, -Example in Peril-L L5 -Task Parallelism and Task Graphs 10/01/2009CS49614

Review for Quiz L6 -SIMD execution -Challenges with SIMD multimedia extensions L7 -Solution to HW2 -Ghost cells and data partitioning L8 & L9 -OpenMP: constructs, idea, target architecture L10 -TBB (see tutorial and assignment) L11 -Reasoning about Performance 10/01/2009CS49615

Format for Quiz Definitions, 1 from each lecture Problem solving (3 questions) Essay question (pick one from 5) 10/01/2009CS49616

Locality – What does it mean? We talk about locality a lot What are the ways to improve memory behavior in a parallel application? What are the key considerations? Today’s lecture -Mostly about managing caches (and registers) -Targets of optimizations -Abstractions to reason about locality -Data dependences, reordering transformations 10/01/2009CS49617

10/01/2009CS4961 Locality and Parallelism (from Lecture 1) Large memories are slow, fast memories are small Cache hierarchies are intended to provide illusion of large, fast memory Program should do most work on local data! Proc Cache L2 Cache L3 Cache Memory Conventional Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects 8

Lecture 3: Candidate Type Architecture (CTA Model) A model with P standard processors, d degree,λ latency Node == processor + memory + NIC Key Property: Local memory ref is 1, global memory is λ 09/01/2009CS49619

Targets of Memory Hierarchy Optimizations Reduce memory latency - The latency of a memory access is the time (usually in cycles) between a memory request and its completion Maximize memory bandwidth -Bandwidth is the amount of useful data that can be retrieved over a time interval Manage overhead -Cost of performing optimization (e.g., copying) should be less than anticipated gain 10/01/2009CS496110

Reuse and Locality Consider how data is accessed -Data reuse: -Same or nearby data used multiple times -Intrinsic in computation -Data locality: -Data is reused and is present in “fast memory” -Same data or same data transfer If a computation has reuse, what can we do to get locality? -Appropriate data placement and layout -Code reordering transformations 10/01/2009 CS

12 Cache basics: a quiz Cache hit: -in-cache memory access—cheap Cache miss: -non-cached memory access—expensive -need to access next, slower level of hierarchy Cache line size: -# of bytes loaded together in one entry -typically a few machine words per entry Capacity: -amount of data that can be simultaneously in cache Associativity -direct-mapped: only 1 address (line) in a given range in cache -n-way: n  2 lines w/ different addresses can be stored Parameters to optimization 10/01/2009CS4961

Temporal Reuse in Sequential Code Same data used in distinct iterations I and I’ for (i=1; i<N; i++) for (j=1; j<N; j++) A[j]= A[j+1]+A[j-1] A[j] has self-temporal reuse in loop i 10/01/200913CS4961

Spatial Reuse (Ignore for now) Same data transfer (usually cache line) used in distinct iterations I and I’ ∙A[j] has self-spatial reuse in loop j Multi-dimensional array note: C arrays are stored in row-major order for (i=1; i<N; i++) for (j=1; j<N; j++) A[j]= A[j+1]+A[j-1]; 10/01/200914CS4961

Group Reuse Same data used by distinct references A[j],A[j+1] and A[j-1] have group reuse (spatial and temporal) in loop j for (i=1; i<N; i++) for (j=1; j<N; j++) A[j]= A[j+1]+A[j-1]; 10/01/200915CS4961

16 Can Use Reordering Transformations! Analyze reuse in computation Apply loop reordering transformations to improve locality based on reuse With any loop reordering transformation, always ask -Safety? (doesn’t reverse dependences) -Profitablity? (improves locality) 10/01/2009CS4961

Loop Permutation: An Example of a Reordering Transformation for (j=0; j<6; j++) for (i= 0; i<3; i++) A[i][j+1]=A[i][j]+B[j] ; for (i= 0; i<3; i++) for (j=0; j<6; j++) A[i][j+1]=A[i][j]+B[j] ; i j new traversal order! i j Permute the order of the loops to modify the traversal order Which one is better for row-major storage? 10/01/2009CS496117

Permutation has many goals Locality optimization Particularly, for spatial locality (like in your SIMD assignment) Rearrange loop nest to move parallelism to appropriate level of granularity -Inward to exploit fine-grain parallelism (like in your SIMD assignment) -Outward to exploit coarse-grain parallelismx Also, to enable other optimizations 10/01/2009CS496118

19 Tiling (Blocking): Another Loop Reordering Transformation Blocking reorders loop iterations to bring iterations that reuse data closer in time Goal is to retain in cache between reuse J I J I 10/01/2009CS4961

Tiling is Fundamental! Tiling is very commonly used to manage limited storage -Registers -Caches -Software-managed buffers -Small main memory Can be applied hierarchically 10/01/2009CS496120

21 Tiling Example for (j=1; j<M; j++) for (i=1; i<N; i++) D[i] = D[i] +B[j,i] for (j=1; j<M; j++) for (i=1; i<N; i+=s) for (ii=i; ii<min(i+s-1,N); ii++) D[ii] = D[ii] +B[j,ii] Strip mine for (i=1; i<N; i+=s) for (j=1; j<M; j++) for (ii=i; ii<min(i+s-1,N); ii++) D[ii] = D[ii] +B[j,ii] Permute 10/01/2009CS4961

How to Determine Safety and Profitability? Safety -A step back to Lecture 2 -Notion of reordering transformations -Based on data dependences Profitability -Reuse analysis (and other cost models) -Also based on data dependences, but simpler 10/01/2009CS496122

Key Control Concept: Data Dependence Question: When is parallelization guaranteed to be safe? Answer: If there are no data dependences across reordered computations. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the accesses is a write. Bernstein’s conditions (1966): I j is the set of memory locations read by process P j, and O j the set updated by process P j. To execute P j and another process P k in parallel, I j ∩ O k = ϕ write after read I k ∩ O j = ϕ read after write O j ∩ O k = ϕ write after write 10/01/2009CS496123

Data Dependence and Related Definitions Actually, parallelizing compilers must formalize this to guarantee correct code. Let’s look at how they do it. It will help us understand how to reason about correctness as programmers. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a write. A data dependence can either be between two distinct program statements or two different dynamic executions of the same program statement. Source: “Optimizing Compilers for Modern Architectures: A Dependence-Based Approach”, Allen and Kennedy, 2002, Ch. 2. (not required or essential) 10/01/200924CS4961

Some Definitions (from Allen & Kennedy) Definition 2.5: -Two computations are equivalent if, on the same inputs, -they produce identical outputs -the outputs are executed in the same order Definition 2.6: -A reordering transformation -changes the order of statement execution -without adding or deleting any statement executions. Definition 2.7: -A reordering transformation preserves a dependence if -it preserves the relative execution order of the dependences’ source and sink. 10/01/200925CS4961

Fundamental Theorem of Dependence Theorem 2.2: -Any reordering transformation that preserves every dependence in a program preserves the meaning of that program. 10/01/2009CS496126

27 Forall or Doall loops: Loops whose iterations can execute in parallel (a particular reordering transformation) Example forall (i=1; i<=n; i++) A[i] = B[i] + C[i]; Meaning? Brief Detour on Parallelizable Loops as a Reordering Transformation Each iteration can execute independently of others Free to schedule iterations in any order Why are parallelizable loops an important concept? Source of scalable, balanced work Common to scientific, multimedia, graphics & other domains 10/01/2009CS4961

Data Dependence for Arrays Recognizing parallel loops (intuitively) -Find data dependences in loop -No dependences crossing iteration boundary  parallelization of loop’s iterations is safe for (i=2; i<5; i++) A[i] = A[i-2]+1; for (i=1; i<=3; i++) A[i] = A[i]+1; Loop- Carried dependence Loop- Independent dependence 10/01/200928CS4961

1. Characterize Iteration Space Iteration instance: represented as coordinates in iteration space -n-dimensional discrete cartesian space for n deep loop nests Lexicographic order: Sequential execution order of iterations [1,1], [1,2],..., [1,6],[1,7], [2,2], [2,3],..., [2,6],... Iteration I (a vector) is lexicographically less than I’, I<I’, iff there exists c ( i 1, …, i c-1 ) = (i’ 1, …, i’ c-1 ) and i c < i’ c. j for (i=1;i<=5; i++) for (j=i;j<=7; j++)... 1 ≤ i ≤ 5 i ≤ j ≤ 7 i 10/01/200929CS4961

2. Compare Memory Accesses across Dynamic Instances in Iteration Space N = 6; for (i=1; i<N; i++) for (j=1; j<N; j++) A[i+1,j+1] = A[i,j] * 2.0; i j How to describe relationship between two dynamic instances? e.g., I=[1,1] and I’=[2,2] I=[1,1], Write A[2,2] I’=[2,2], Read A[2,2] 10/01/200930CS4961

Distance Vectors Distance vector = [1,1] A loop has a distance vector D if there exists data dependence from iteration vector I to a later vector I’, and D = I’ - I. Since I’ > I, D >= 0. (D is lexicographically greater than or equal to 0). N = 6; for (i=1; i<N; i++) for (j=1; j<N; j++) A[i+1,j+1] = A[i,j] * 2.0; 10/01/200931CS4961

32 Distance and Direction Vectors Distance vectors: (infinitely large set) Direction vectors: (realizable if 0 or lexicographically positive) ([=,=],[=, ], [<,=], [<.<]) Common notation: 0 = + < - > +/- * 10/01/2009CS4961

33 Parallelization Test: 1-Dimensional Loop Examples: for (j=1; j<N; j++) for (j=1; j<N; j++) A[j] = A[j] + 1; B[j] = B[j-1] + 1; Dependence (Distance and Direction) Vectors? Test for parallelization: -A 1-d loop is parallelizable if for all data dependences D ε D, D = 0 10/01/2009CS4961

34 n-Dimensional Loop Nests for (i=1; i<=N; i++) for (j=1; j<=N; j++) A[i,j] = A[i,j-1]+1; for (i=1; i<=N; i++) for (j=1; j<=N; j++) A[i,j] = A[i-1,j-1]+1; Distance and direction vectors? Definition: D = (d 1, … d n ) is loop-carried at level i if d i is the first nonzero element. 10/01/2009CS4961

35 Test for Parallelization The i th loop of an n -dimensional loop is parallelizable if there does not exist any level i data dependences. The i th loop is parallelizable if for all dependences D = (d 1, … d n ), either (d 1, … d i-1 ) > 0 or (d 1, … d i ) = 0 10/01/2009CS4961

36 Back to Locality: Safety of Permutation Intuition: Cannot permute two loops i and j in a loop nest if doing so reverses the direction of any dependence. Loops i through j of an n-deep loop nest are fully permutable if for all dependences D, either (d 1, … d i-1 ) > 0 or forall k, i ≤ k ≤ j, d k ≥ 0 Stated without proof: Within the affine domain, n-1 inner loops of n-deep loop nest can be transformed to be fully permutable. 10/01/2009CS4961

37 Simple Examples: 2-d Loop Nests Distance vectors Ok to permute? for (i= 0; i<3; i++) for (j=0; j<6; j++) A[i][j+1]=A[i][j]+B[j] for (i= 0; i<3; i++) for (j=0; j<6; j++) A[i+1][j-1]=A[i][j] +B[j] 10/01/2009CS4961

38 Legality of Tiling Tiling = strip-mine and permutation -Strip-mine does not reorder iterations -Permutation must be legal OR - strip size less than dependence distance 10/01/2009CS4961

10/01/2009CS Summary of Lecture Motivation for Locality Optimization Discussion of dependences and reuse analysis Brief examination of loop transformations -Specifically, permutation and tiling -More next lecture