1 Optimizing compilers Managing Cache Bercovici Sivan.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Scheduling and Performance Issues for Programming using OpenMP
Memory Operation and Performance To understand the memory architecture so that you could write programs that could take the advantages and make the programs.
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Data Locality CS 524 – High-Performance Computing.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Memory Management 2010.
Memory Organization.
Register Allocation (via graph coloring)
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Microprocessor-based systems Curse 7 Memory hierarchies.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
1 Lecture 5a: CPU architecture 101 boris.
Lecture 38: Compiling for Modern Architectures 03 May 02
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CSE 351 Section 9 3/1/12.
Reducing Hit Time Small and simple caches Way prediction Trace caches
The Hardware/Software Interface CSE351 Winter 2013
The University of Adelaide, School of Computer Science
Chapter 9 – Real Memory Organization and Management
5.2 Eleven Advanced Optimizations of Cache Performance
CSCI1600: Embedded and Real Time Software
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Siddhartha Chatterjee
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Cache Performance Improvements
Introduction to Optimization
Optimizing single thread performance
CSCI1600: Embedded and Real Time Software
Presentation transcript:

1 Optimizing compilers Managing Cache Bercovici Sivan

2 Overview Motivation Cache structure Important observations Techniques covered in the book – Loop interchange – Blocking – Unaligned Data – Pre-fetching

3 Overview (Cont.) Issues and techniques not covered in the book – Instruction Cache – Dynamic profiling driven cache optimization

4 Motivation Shorten fetch time Processor-DRAM Performance Gap: (grows 50% / year) DRAM CPU Performance Time

5 Motivation (cont.) Solution: Cache – Faster memory Software problems – Maximize cache performance

6 Memory structure Hierarchical Registers Cache Memory Disk Instructions Blocks Pages Larger Faster 100s Bytes, <10s ns K Bytes, ns M Bytes, 100ns-1us Capacity, Access Time

7 Cache structure Specialized – Instruction – Data – What about stack cache?

8 Cache Structure (cont.) Organized into blocks – Multiple machine-words Maps entire memory Most use LRU replacement strategy Tag Line Tag Array Tag = Block# Address Fields 0431 Data array = == hit data Line Offset

9 Observations Temporal locality – If a variable is referenced, it tends to be referenced again soon. Spatial locality – If a variable is referenced, nearby variables tend to referenced soon

10 Observations (cont.) Temporal locality example – Variables used inside a loop Spatial locality example – Iterating on array items

11 Cache exploiting observations Temporal locality – Cache attempts to keep recently accessed data Spatial locality – The cache brings blocks of data from memory

12 Loop interchange

13 Example DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO I iterates on rows, J on columns Fortran arrays are column major – The first column is stored first, then the second column.. – C is the other way around (row major) No spatial reuse

14 Example - visually B DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO Cache mapping Cache miss A

15 Example analysis DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO 2*N*M misses – Due to the fact the innermost loop iterates on the non-contiguous dimension

16 Example fixed (loop interchange) DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO We process column by column spatial reuse

17 Example - visually DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO Cache mapping AB

18 Analyzing fixed example DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO 2*N*M/b misses – b is the cache-block size

19 A harder example DO I = 1, N DO J = 1, M D(I) = D(I) + B(I,J) ENDDO NM for B, N/b for D After interchange: NM/b for B, NM/b for D When should interchange? – N/b+NM - 2NM/b > 0

20 Loop interchange Determine which loop should be innermost – Strive to increase locality Heuristic approach – Compute cost function for each loop – Order loops: Cheapest loop innermost Most expensive, outermost

21 Cost assignment Cost is 1 for references that do not depend on loop induction variables Cost is N for references based on induction variables over a non-contiguous space Cost is N/b for induction variables based references over contiguous space Multiply the cost by the loop trip count if the reference varies with the loop index

22 Loop interchange) cont.) Special notes – Avoid over counting references Don’t overcount references that are available due to temporal reuse (available to next iterations) References can still be in the same cache block as other references (references in the same iteration) – Not all loops order are possible due to data dependency Find permutation that is both legal and best suits minimal score

23 Blocking

24 Back to the example DO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDO 2NM/b misses

25 Example - visually DB DO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDO Cache block size Cache miss

26 Back to the example (cont.) DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Work on smaller strips (s)

27 Example - visually DB DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Cache block size Second strip

28 Analysis Cost of B does not change: NM/b Cost of D effect due to reuse: N/b – No misses during iterations on J Conclude: (1+1/M)NM/b DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO

29 Unaligned data What if B is not aligned on cache block boundary? At most an additional penalty for each sub- column iteration – Additional NM/S Conclude: (1+1/M+b/S)NM/b DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO

30 Unaligned data - visually DB DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Strip size Cache alignment Additional miss

31 Unaligned data (cont.) What can be done? – Enforce data alignment – Refine our score for loop interchange include these misses as well in the score DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO

32 Blocking - legality Split to strips - Legal Interchange – not always legal procedure StripMineAndInterchange (L, m, k, o, S) // L = {L 1, L 2,..., L m }is the loop nest to be transformed // L k is the loop to be strip mined // L o is the outer loop which is to be just inside the by-strip loop //after interchange // S is the variable to use as strip size; it’s value must be positive let the header of L k be DO I = L, N, D; split the loop into two loops, a by-strip loop: DO I = L, N, S*D and a within-strip loop: DO i = I, MAX(I+S*D-D,N), D around the loop body; interchange the by-strip loop to the position just outside of L o ; end StripMineAndInterchange

33 Blocking – a harder example DO I = 1, N DO J = 1, M A(J+1) = (A(J) + A(J+1))/2 ENDDO Due to dependence, loops can not be interchanged Statements Dependencies between statements

34 Blocking – a closer look DO I = 1, N DO J = 1, M A(J+1) = (A(J) + A(J+1))/2 ENDDO Not computable due to dependencies Bad performance due to low cache reuse

35 Harder example – skew it DO I = 1, N DO j = I, M+I-1 A(j-I+2) = (A(j-I+1) + A(j-I+2))/2 ENDDO j = 1..M 2..M+1 3..M+2

36 Harder example - Strip it DO I = 1, N DO j = I, M+I-1, S DO jj = j, MAX(j+S-1, M+I-1) A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2 ENDDO

37 Hard example – interchange loops DO j = 1, M+N-1, S DO I = MAX(1, j-M+1), MIN(j, N) DO jj = j, MAX(j+S-1, M+I-1) A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2 ENDDO

38 Harder example - Comparison NM/b misses(M+N)*(1/b+1/S) misses

39 Triangular blocking DO I = 2, N DO J = 1, I-1 A(I, J) = A(I, I) + A(J, J) ENDDO Explicitly: …..5

40 Triangular – Strip it DO I = 2, N, K DO ii = I, I+K-1 DO J = 1, ii – 1 A(ii, J) = A(ii, I) + A(J, J) ENDDO K-size strips Nothing important changed yet..

41 Triangular – transform! DO I = 2, N, K DO J = 1, I+K-1 DO ii = MAX(J+1, I), I+K-1 A(ii, J) = A(ii, I) + A(J, J) ENDDO Triangular loop interchange Working on the k-strips Preserving correct triangular loop limits

42 Blocking with parallelization Dimension of parallelism is that of the sequential access – Solution: If multiple parallelization dimensions are available, avoid the stride-one dimension False sharing – Data used by different processors is on the same cache-line, but not the exact same data – Solution: Language extension - expressing data division to processors. Memory data layout accordingly

43 Prefetch

44 “…And I don’t want to miss a thing…” - AeroSmith, 98’ Optimization Seminar Problematic misses: – Data used for the first time – Data re-used in ways that can not be predicted at compile-time DO I=1,N A(I) = B(LOC(I)) ENDDO

45 Prefetch Brings a line into the cache Typically does not cause stall – Loads the line parallel to continues execution Introduced by programmer/compiler

46 Prefetch Advantages – Miss latencies can be avoided Assuming we can introduce a prefetch far enough Assuming the cache is large enough Disadvantages – Number of instruction to execute increases – May cause useful data inside cache to be evacuated prematurely – Data brought by prefetch might be evicted prematurely.

47 Minimizing disadvantages impact The number of added prefetches must be close to what is needed Prefetches should not arrive “too early”

48 Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO Group generator is not contained in a dependence cycle – a miss is expected on each iteration unless references to the generator on subsequent iterations display temporal locality Generate miss on every new cache line Use prefetch to before references to generators RAW Identify prefetch opportunities

49 Acyclic name partitioning Two cases: – references to the generator do not iterate sequentially within the loop – references have spatial locality within the loop Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDO Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO

50 Acyclic name partitioning Case I: references to the generator do not iterate sequentially within the loop  insert prefetch before each reference to the generator Final positioning of the prefetches will be determined by the instruction scheduler Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDO Prefetch

51 Case II: references have spatial locality within the loop – Determine i 0 of the first iteration after the initial iteration that causes a miss on the access to the generator – Determine iteration delta between misses in the cache Acyclic name partitioning Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO pre-loop Main loop

52 Acyclic with spatial reuse Partition the loop into two parts – initial subloop running from 1 to i o -1 – remainder running from i o to the end In the example, i o =4 DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 1, M A(I, J) = A(I, J) + A(I-1, J) ENDDO

53 Acyclic with spatial reuse Strip mine the second loop to have subloops of length delta In the example, delta=4 DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M A(I, J) = A(I, J) + A(I-1, J) ENDDO

54 Acyclic with spatial reuse Insert a prefetch before the initial loop Insert prefetches before the inner-loop prefetch(A(0,J)) DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO

55 Acyclic with spatial reuse prefetch(A(0,J)) DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, M A(I, J) = A(I, J) + A(I-1, J) ENDDO

56 Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO Group generator is contained in a dependence cycle – a miss is expected only on the first few iterations of the carrying loop prefetch to the reference can be placed before the loop carrying the dependence Identify prefetch opportunities Input Dependence

57 Put it all together Rearrange the loop nest so that the loop iterating sequentially over cache lines is innermost Split the innermost loop into two – – Pre-loop to the first iteration of the innermost loop containing a generator reference beginning on a new cache line and – Main loop that begins with the iteration containing the new cache reference. Insert the prefetch, as previously explained

58 Example DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO prefetch(B(2)) DO I = 5, 33, 4 prefetch(B(I)) ENDDO DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO prefetch(B(2)) DO I = 5, 33, 4 prefetch(B(I)) ENDDO DO J = 1, M prefetch(A(2,J)) DO I = 2, 4 A(I, J) = A(I, J) * B(I) ENDDO DO I = 5, 33, 4 prefetch(A(I, J)) A(I, J) = A(I, J) * B(I) A(I+1, J) = A(I+1, J) * B(I+1) A(I+2, J) = A(I+2, J) * B(I+2) A(I+3, J) = A(I+3, J) * B(I+3) ENDDO prefetch(A(33, J)) A(33, J) = A(33, J) * B(33) ENDDO

59 Effectiveness of prefetching

60 What did we miss?

61 “..Sometimes, is never quite enough..” - Alanis Morissette, 95’ optimization seminar Static analysis often ineffective – missing information: – Run-time cache miss – Miss address information

62 Profiling based optimization Dynamic optimization systems – Collect information dynamically – Optimize according to profile Use collected information for re-compilation, optimizing accordingly. Use collected information to perform run-time optimization (Code modification on run-time to perform pre-fetch) – Example: ADORE

63