Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using the Compiler to Improve Cache Replacement Decisions Zhenlin Wang, UMass Amherst Kathryn S. McKinley, UT Austin Arnold L. Rosenberg, UMass Amherst.

Similar presentations


Presentation on theme: "Using the Compiler to Improve Cache Replacement Decisions Zhenlin Wang, UMass Amherst Kathryn S. McKinley, UT Austin Arnold L. Rosenberg, UMass Amherst."— Presentation transcript:

1 Using the Compiler to Improve Cache Replacement Decisions Zhenlin Wang, UMass Amherst Kathryn S. McKinley, UT Austin Arnold L. Rosenberg, UMass Amherst Charles C. Weems, UMass Amherst

2 Improving Cache Replacement Decisions 2 Motivation and Background l LRU is not always effective l Optimal cache replacement must peek into the future l Compiler locality analysis determines data access pattern for numeric applications l Cache line tag bit(s) and ISA extension control cache replacement explicitly l Replacement logic augments LRU with compiler hints

3 Improving Cache Replacement Decisions 3 LRU vs. Compiler Control A[1]B[1] C[1] Set 1 Set 2 Set 3 Set 128 SUBROUTINE TEST(N) INTEGER A[N],B[N],C[N] DO I = 1,N C[I] = A[I] + B[I] ENDDO DO I = 1,N A[I] = C[I] * 5 ENDDO END 2-way cache with LRU ( N=128 ) A[1] C[2] B[2] C[3] B[3] …... … … C[128] B[128] A[2] A[3] A[128]

4 Improving Cache Replacement Decisions 4 Compiler Locality Analysis A[I] at N1 [1:N:1] C[I] at N1 [1:N:1] B[I] at N1 [1:N:1] C[I] at N2 [1:N:1] A[I] at N2 [1:N:1] Cross-loop [1:N:1] Spatial (=,<) Locality Graph Spatial (=,<) temporal SUBROUTINE TEST(N) INTEGER A[N],B[N],C[N] DO I = 1,N C[I] = A[I] + B[I] ENDDO DO I = 1,N A[I] = C[I] * 5 ENDDO END

5 Improving Cache Replacement Decisions 5 An Abstract Model l An optimal algorithm uses exact reuse distances –Given trace a b c d a c d e b f a, reuse distance of a is 4 l Reuse level: a range in which the next reuse will occur –[i,j] < [k,l], if j < k –For example, a reuse level of a is [3,5]. (a b c d a c d e b f a) l We combine data dependences with loop iteration point to compute reuse levels –For example, (=, <) < ( <, =) 4

6 Improving Cache Replacement Decisions 6 The Architecture: Evict-Me bit l Inspired by the Alpha 21264 prefetch-and-evict-next and evict instruction l Each cache line has an extra evict-me bit –On a replacement, choose the cache line with the evict-me bit set –Use LRU policy if no evict-me bits are set –Extend ISA with load/store instructions that set the evict-me bit

7 Improving Cache Replacement Decisions 7 Heuristics for Setting Evict-me Bits l On a replacement, evict the cache line if its evict-me bit is set, otherwise, use the LRU bits l Compiler heuristics: – Set evict-me bit if the reuse distance of a reference is greater than cache size »Intuition: even a fully set associative cache can not exploit the reuse »Reuse level: [1, cache size], [cache size+1, ] l Volume based heuristics –Its reuse crosses nests whose data volume is greater than 2*cache size –Or reuse crosses nests of nesting level >=2

8 Improving Cache Replacement Decisions 8 Algorithm for Setting Evict-me Bits l Mark evict-me bit for an array reference if –It has no temporal locality in its nest –Its reuse crosses nests whose data volume > 2*cache size l Spatial locality is resolved by run time address calculation or loop unrolling Do I = 1 : N …A(I)… ENDDO A[1]A[3]A[2]01 Do I = 1 : N …A(I)… …A(I+1)… …A(I+2)… …A(I+3)… ENDDO

9 Improving Cache Replacement Decisions 9 B[1] 1 Evict-me: An Example Set 1 Set 2 Set 3 Set 128 2-way cache with evict-me SUBROUTINE TEST(N) INTEGER A[N],B[N],C[N] DO I = 1,N C[I] = A[I] + B[I] ENDDO DO I = 1,N A[I] = C[I] * 5 ENDDO END A[1] 0C[1] 0 ( N=128 ) A[2] 0 C[2] 0 A[3] 0 C[3] 0 …... … A[128] 0 C[128] 0 Cache size = 256 words Nest 1 volume = 384 words < 2*256

10 Improving Cache Replacement Decisions 10 Experimental Framework l Implemented in Scale, a compiler infrastructure developed at UMass –Scale includes optimizations such as partial redundancy elimination, scalar replacement, value numbering, sparse conditional constant propagation, register allocation, etc. –Generates SPARC Assembly l Simulate the evict-Me cache with URSIM –Out of order execution –Lock up free cache –SDRAM Source code Scale SPARC Assembly Native Assembler linker SPARC executable URSIM

11 Improving Cache Replacement Decisions 11 Cache configurations l Both levels are lock-up free with 8 MSHRs each Conf. 1Conf. 2Conf. 3 Level 18K, 2-way32K, 2-way64K, 4-way Level 2128K, 2-way256K, 4-way512K, 2-way Current5 year projection Level 122 Level 2820 Memory48 -200200-500 Size and associativity Latencies (cycles)

12 Improving Cache Replacement Decisions 12 Miss reduction (level 1)

13 Improving Cache Replacement Decisions 13 Miss reduction (level 2)

14 Improving Cache Replacement Decisions 14 Performance Impact of Evict-me (Conf. 2)

15 Improving Cache Replacement Decisions 15 Evict-me and Prefetching Combined (Conf. 3)

16 Improving Cache Replacement Decisions 16 Summary l Compiler can improve cache replacement decisions l Evict-me algorithm seldom degrades performance l Architectural support for evict-me is practical l Effectiveness depends on cache configuration, data set size, and access patterns


Download ppt "Using the Compiler to Improve Cache Replacement Decisions Zhenlin Wang, UMass Amherst Kathryn S. McKinley, UT Austin Arnold L. Rosenberg, UMass Amherst."

Similar presentations


Ads by Google