Presentation is loading. Please wait.

Presentation is loading. Please wait.

CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip.

Similar presentations


Presentation on theme: "CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip."— Presentation transcript:

1 CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip cache CPU On-chip cache Off-chip cache

2 Example Program transformations for cache locality : Tiling for i = 1 to n for j = 1 to n for k = 1, n A[i,j ] = A[i,j] + B[i,k] * C[k,j] For a matrix multiplication of 1000 x 1000 … X= X= X= Data accessed 1000 32 1000 32 1000 1002000 65024

3 But it’s hard for the compiler to analyse indirect accesses.. void myfunc( int source_arr[], int key_arr[], int result_arr[], int n) { for(I=0;I< n ;I++) { result_arr[I] += source_arr[ key_arr[ I] ] ; // Indirection ! } The data access pattern of the function depends on the contents of key_arr[]. So the data access pattern cannot be determined at compile time, but only at run-time. Cacheminer is especially useful for such scenarios.

4 Targetted Model For ( i1 = lower_1 ; i1 < upper_1 ; i1 ++) For ( i2 = lower_2 ; i2 < upper_2 ; i2 ++) For ( i3 = lower_3 ; i3 < upper_3 ; i3 ++) For ( ik = lower_k ; ik < upper_k ; ik ++) { Task B = block of statements; } k nested loops Let B ( t1, t2…tk) : task B where t1, t2..tk represent particular values of variables i1, i2..ik respectively The tasks need to be data independent of each other i.e : Out (B1)Out (B2) = { empty set } Out (B1) In (B2) = { empty set } In (B1)Out (B2) = { empty set }

5 System Overview program Hint Addition Access Pattern Estimation Task Grouping Task Partitioning Task Scheduling Compiler Library C program Add calls to library functions which provide hints to the run-time system Use Hints to estimate the pattern of accesses. Group together tasks which access closely placed data into bins. Partition total bins among P processors to maximize data locality and also loadsharing. Schedule Tasks on the processor. Ensure overall load-balancing 1 2 3 4

6 Step 1 : Estimating Memory Accesses Assumption : Task B accesses only chunks of elements in multiple arrays 4 Hints provided to the module : a. Number of Arrays accessed : n (Compile Time) b. Size in bytes of each array : Vector (s1,s2…sn) (Compile Time) c. Number of processors : p (Compile Time). d. Access footprint B(a1,a2,….an) : starting access address for n arrays for the Task B. (Run Time). Each Task can then be a point B(a1,a2,a3..an) in n -dimensional space.

7 Example : int P [ 100 ] and int Q[ 200]. Memory Layout of P : size = 100 * sizeof(int) = 400 : starting address : &P[0] = 1000 Memory Layout of Q : size = 200 * sizeof(int) = 800 : starting address : &Q[0] = 100. Access dimension in P --> Access dimension in Q --> 1000 1400 100 900 Each Task B(x,y) is a point in the 2-dimensional grid x : starting access address of array1 (P) for Task y : starting access address of array2 (Q) for Task B1 ( 1000, 900 ) B2 ( 1000, 100 )

8 Step 2 : Grouping Tasks By Locality A. Shift to Origin. Access dimension in P --> Access dimension in Q --> 1000 1400 100 900 B1 ( 1000, 900) B2 ( 1000, 100) 4000 800 B. Shrink the Dimensions by (C/n) : In example : n = 2, cache size = 200 So shrink dimension by 200/2 = 100 40 8 Bins

9 Step 3 : Partitioning Bins among ‘P’ Processors Need to form ‘P’ groups of bins such that the sharing between them is minimized. Problem is NP-complete, so use a heuristic method to divide up the bin space. i. Form prime factors of ‘P’ and divide each dimension of bin-space into Rj chunks, for each Prime factor Rj. Example : Suppose we have 6 processors : 6 = 2 x 3 So ‘x’ dimension divided into 2 parts. ‘y’ dimension divided into 3 parts. Thus, a total of 2 x 3 = 6 distinct regions ! (all bins in 1 region are processed by one processor). 8 40 Distinct regions

10 Step 4 : Adaptive Scheduling of Task Groups Bin Processor Task List Bin Take ‘K’ bins at a time Local Scheduling Local Scheduling : Each processor processes bins from its own Task-list. Global Scheduling Global Scheduling : When a processor finishes its task list, it starts processing the task list of the most heavily loaded processor. Adaptive Control Adaptive Control : Processor takes ‘K’ bins at a time to process. K changes depending on no. of remaining bins max ( p /2, Ki - 1) if few bins remain in tasklist (light load) min ( 2p, Ki + 1) if lots of bins remain in tasklist (heavy load) Ki =

11 Results Static Access Pattern Dynamic Access Pattern Manually optimized With Cacheminer

12 Framework to exploit Run-Time Cache Locality on SMPs Targetted at nested-loop structures accessing number of arrays. Especially useful for indirect accesses where data access pattern cannot be determined till run-time. Overall phases : Summary program Hint Addition Access Pattern Estimation Task Grouping Task Partitioning Task Scheduling Compiler Library


Download ppt "CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip."

Similar presentations


Ads by Google