1 Optimizing compilers Managing Cache Bercovici Sivan.

1 Optimizing compilers Managing Cache Bercovici Sivan

2 Overview Motivation Cache structure Important observations Techniques covered in the book – Loop interchange – Blocking – Unaligned Data – Pre-fetching

3 Overview (Cont.) Issues and techniques not covered in the book – Instruction Cache – Dynamic profiling driven cache optimization

4 Motivation Shorten fetch time Processor-DRAM Performance Gap: (grows 50% / year) 1 10 100 1000 198019811982 1983 19841985198619871988198919901991199219931994 1995 1996 1997199819992000 DRAM CPU Performance Time

5 Motivation (cont.) Solution: Cache – Faster memory Software problems – Maximize cache performance

6 Memory structure Hierarchical Registers Cache Memory Disk Instructions Blocks Pages Larger Faster 100s Bytes, <10s ns K Bytes,10-100 ns M Bytes, 100ns-1us Capacity, Access Time

7 Cache structure Specialized – Instruction – Data – What about stack cache?

8 Cache Structure (cont.) Organized into blocks – Multiple machine-words Maps entire memory Most use LRU replacement strategy Tag Line Tag Array Tag = Block# Address Fields 0431 Data array 03131 = == hit data Line Offset

9 Observations Temporal locality – If a variable is referenced, it tends to be referenced again soon. Spatial locality – If a variable is referenced, nearby variables tend to referenced soon

10 Observations (cont.) Temporal locality example – Variables used inside a loop Spatial locality example – Iterating on array items

11 Cache exploiting observations Temporal locality – Cache attempts to keep recently accessed data Spatial locality – The cache brings blocks of data from memory

12 Loop interchange

13 Example DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO I iterates on rows, J on columns Fortran arrays are column major – The first column is stored first, then the second column.. – C is the other way around (row major) No spatial reuse

14 Example - visually B DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO Cache mapping Cache miss A

15 Example analysis DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDO 2*N*M misses – Due to the fact the innermost loop iterates on the non-contiguous dimension

16 Example fixed (loop interchange) DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO We process column by column spatial reuse

17 Example - visually DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO Cache mapping AB

18 Analyzing fixed example DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDO 2*N*M/b misses – b is the cache-block size

19 A harder example DO I = 1, N DO J = 1, M D(I) = D(I) + B(I,J) ENDDO NM for B, N/b for D After interchange: NM/b for B, NM/b for D When should interchange? – N/b+NM - 2NM/b > 0

20 Loop interchange Determine which loop should be innermost – Strive to increase locality Heuristic approach – Compute cost function for each loop – Order loops: Cheapest loop innermost Most expensive, outermost

21 Cost assignment Cost is 1 for references that do not depend on loop induction variables Cost is N for references based on induction variables over a non-contiguous space Cost is N/b for induction variables based references over contiguous space Multiply the cost by the loop trip count if the reference varies with the loop index

22 Loop interchange) cont.) Special notes – Avoid over counting references Don’t overcount references that are available due to temporal reuse (available to next iterations) References can still be in the same cache block as other references (references in the same iteration) – Not all loops order are possible due to data dependency Find permutation that is both legal and best suits minimal score

23 Blocking

24 Back to the example DO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDO 2NM/b misses

25 Example - visually DB DO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDO Cache block size Cache miss

26 Back to the example (cont.) DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Work on smaller strips (s)

27 Example - visually DB DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Cache block size Second strip

28 Analysis Cost of B does not change: NM/b Cost of D effect due to reuse: N/b – No misses during iterations on J Conclude: (1+1/M)NM/b DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO

29 Unaligned data What if B is not aligned on cache block boundary? At most an additional penalty for each sub- column iteration – Additional NM/S Conclude: (1+1/M+b/S)NM/b DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO

30 Unaligned data - visually DB DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO Strip size Cache alignment Additional miss

31 Unaligned data (cont.) What can be done? – Enforce data alignment – Refine our score for loop interchange include these misses as well in the score DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO

32 Blocking - legality Split to strips - Legal Interchange – not always legal procedure StripMineAndInterchange (L, m, k, o, S) // L = {L 1, L 2,..., L m }is the loop nest to be transformed // L k is the loop to be strip mined // L o is the outer loop which is to be just inside the by-strip loop //after interchange // S is the variable to use as strip size; it’s value must be positive let the header of L k be DO I = L, N, D; split the loop into two loops, a by-strip loop: DO I = L, N, S*D and a within-strip loop: DO i = I, MAX(I+S*D-D,N), D around the loop body; interchange the by-strip loop to the position just outside of L o ; end StripMineAndInterchange

33 Blocking – a harder example DO I = 1, N DO J = 1, M A(J+1) = (A(J) + A(J+1))/2 ENDDO Due to dependence, loops can not be interchanged Statements Dependencies between statements

34 Blocking – a closer look DO I = 1, N DO J = 1, M A(J+1) = (A(J) + A(J+1))/2 ENDDO Not computable due to dependencies Bad performance due to low cache reuse

35 Harder example – skew it DO I = 1, N DO j = I, M+I-1 A(j-I+2) = (A(j-I+1) + A(j-I+2))/2 ENDDO j = 1..M 2..M+1 3..M+2

36 Harder example - Strip it DO I = 1, N DO j = I, M+I-1, S DO jj = j, MAX(j+S-1, M+I-1) A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2 ENDDO

37 Hard example – interchange loops DO j = 1, M+N-1, S DO I = MAX(1, j-M+1), MIN(j, N) DO jj = j, MAX(j+S-1, M+I-1) A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2 ENDDO

38 Harder example - Comparison NM/b misses(M+N)*(1/b+1/S) misses

39 Triangular blocking DO I = 2, N DO J = 1, I-1 A(I, J) = A(I, I) + A(J, J) ENDDO Explicitly: 1..2 1...3 1....4 1…..5

40 Triangular – Strip it DO I = 2, N, K DO ii = I, I+K-1 DO J = 1, ii – 1 A(ii, J) = A(ii, I) + A(J, J) ENDDO K-size strips Nothing important changed yet..

41 Triangular – transform! DO I = 2, N, K DO J = 1, I+K-1 DO ii = MAX(J+1, I), I+K-1 A(ii, J) = A(ii, I) + A(J, J) ENDDO Triangular loop interchange Working on the k-strips Preserving correct triangular loop limits

42 Blocking with parallelization Dimension of parallelism is that of the sequential access – Solution: If multiple parallelization dimensions are available, avoid the stride-one dimension False sharing – Data used by different processors is on the same cache-line, but not the exact same data – Solution: Language extension - expressing data division to processors. Memory data layout accordingly

43 Prefetch

44 “…And I don’t want to miss a thing…” - AeroSmith, 98’ Optimization Seminar Problematic misses: – Data used for the first time – Data re-used in ways that can not be predicted at compile-time DO I=1,N A(I) = B(LOC(I)) ENDDO

45 Prefetch Brings a line into the cache Typically does not cause stall – Loads the line parallel to continues execution Introduced by programmer/compiler

46 Prefetch Advantages – Miss latencies can be avoided Assuming we can introduce a prefetch far enough Assuming the cache is large enough Disadvantages – Number of instruction to execute increases – May cause useful data inside cache to be evacuated prematurely – Data brought by prefetch might be evicted prematurely.

47 Minimizing disadvantages impact The number of added prefetches must be close to what is needed Prefetches should not arrive “too early”

48 Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO Group generator is not contained in a dependence cycle – a miss is expected on each iteration unless references to the generator on subsequent iterations display temporal locality Generate miss on every new cache line Use prefetch to before references to generators RAW Identify prefetch opportunities

49 Acyclic name partitioning Two cases: – references to the generator do not iterate sequentially within the loop – references have spatial locality within the loop Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDO Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO

50 Acyclic name partitioning Case I: references to the generator do not iterate sequentially within the loop  insert prefetch before each reference to the generator Final positioning of the prefetches will be determined by the instruction scheduler Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDO Prefetch

51 Case II: references have spatial locality within the loop – Determine i 0 of the first iteration after the initial iteration that causes a miss on the access to the generator – Determine iteration delta between misses in the cache Acyclic name partitioning Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO pre-loop Main loop

52 Acyclic with spatial reuse Partition the loop into two parts – initial subloop running from 1 to i o -1 – remainder running from i o to the end In the example, i o =4 DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 1, M A(I, J) = A(I, J) + A(I-1, J) ENDDO

53 Acyclic with spatial reuse Strip mine the second loop to have subloops of length delta In the example, delta=4 DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M A(I, J) = A(I, J) + A(I-1, J) ENDDO

54 Acyclic with spatial reuse Insert a prefetch before the initial loop Insert prefetches before the inner-loop prefetch(A(0,J)) DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO

55 Acyclic with spatial reuse prefetch(A(0,J)) DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J) ENDDO DO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDO DO I = 1, M A(I, J) = A(I, J) + A(I-1, J) ENDDO

56 Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDO Group generator is contained in a dependence cycle – a miss is expected only on the first few iterations of the carrying loop prefetch to the reference can be placed before the loop carrying the dependence Identify prefetch opportunities Input Dependence

57 Put it all together Rearrange the loop nest so that the loop iterating sequentially over cache lines is innermost Split the innermost loop into two – – Pre-loop to the first iteration of the innermost loop containing a generator reference beginning on a new cache line and – Main loop that begins with the iteration containing the new cache reference. Insert the prefetch, as previously explained

58 Example DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO prefetch(B(2)) DO I = 5, 33, 4 prefetch(B(I)) ENDDO DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO prefetch(B(2)) DO I = 5, 33, 4 prefetch(B(I)) ENDDO DO J = 1, M prefetch(A(2,J)) DO I = 2, 4 A(I, J) = A(I, J) * B(I) ENDDO DO I = 5, 33, 4 prefetch(A(I, J)) A(I, J) = A(I, J) * B(I) A(I+1, J) = A(I+1, J) * B(I+1) A(I+2, J) = A(I+2, J) * B(I+2) A(I+3, J) = A(I+3, J) * B(I+3) ENDDO prefetch(A(33, J)) A(33, J) = A(33, J) * B(33) ENDDO

59 Effectiveness of prefetching

60 What did we miss?

61 “..Sometimes, is never quite enough..” - Alanis Morissette, 95’ optimization seminar Static analysis often ineffective – missing information: – Run-time cache miss – Miss address information

62 Profiling based optimization Dynamic optimization systems – Collect information dynamically – Optimize according to profile Use collected information for re-compilation, optimizing accordingly. Use collected information to perform run-time optimization (Code modification on run-time to perform pre-fetch) – Example: ADORE

1 Optimizing compilers Managing Cache Bercovici Sivan.

Similar presentations

Presentation on theme: "1 Optimizing compilers Managing Cache Bercovici Sivan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Optimizing compilers Managing Cache Bercovici Sivan.

Similar presentations

Presentation on theme: "1 Optimizing compilers Managing Cache Bercovici Sivan."— Presentation transcript:

Similar presentations

About project

Feedback