Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational Science June 2004

Overview 1. Introduction 2. Reuse Distance 3. Optimizing Cache Behavior: By Hardware, Compiler or Programmer? 4. Visualization 5. Case Studies 6. Conclusion

1.a Introduction Anti-law of Moore ( 2003: half of execution time is lost due to data cache misses ) 1 10 100 1000 19801985199019952000 Relatieve snelheid vergeleken met 1980 PROCESSOR MEMORY Speed Gap Relatieve speed versus 1980

1.b Observation: Capacity misses dominate 3 kinds of cache misses (3 C’s): Cold, Conflict, Capacity

2.a Reuse Distance Definition: The reuse distance of a memory access is the number of unique memory locations accessed since the previous access to the same data. 2 C 1022∞∞∞distance ABBACBAaddress

2.b Reuse Distance - property Lemma: In a fully. assoc. LRU cache with n lines, an access hits the cache  reuse distance < n. Consequence: In every cache with n lines, a cache miss with distance d is: Cold missd = ∞ Capacity missn ≤ d < ∞ Conflict missd < n

2.c Reuse distance histogram Spec95fp

2.d Classifying cache misses SPEC95fp Cache size ConflictCapacity

2.e Reuse distance vs. cache hit probability

3a. Removing Capacity misses 1. Hardware Enlarge cache CS Reuse distance must be smaller than cache size 1. Compiler –Loop tiling –Loop fusion 2. Algorithm CS

3.b Compiler optimizations: SGIpro for Itanium (spec95fp) ConflictCapacity 30% conflict misses eliminated, 1% capacity misses eliminated.

4.a Objectives for cache visualization Cache behavior is shown in the source code. Cache behavior is presented accurately and concisely. Independent of specific cache parameters (e.g. size, associativity,…).  Reuse Distance allows to meet the above objectives

4.b Example: MCF

for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost ident == AT_LOWER || red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } 68% of capacity misses 4.c Example: MCF 68.3% / sl=21%

5.a Optimization: classification 1. Eliminate memory access with poor locality (+++) 2. Reduce reuse distance (keep data in cache between use and reuse) (++) 3. Increase spatial locality (++) 4. Hide latency by prefetching (+)

5.b 3 case studies From Spec2000: With large memory bottleneck: Mcf (90%) – optimization of bus schedule Art (87%) – simulation of neural network Equake (66%) – simulation of earthquake Percentage of execution time that the processor is stalled waiting for data from memory and cache. (Itanium1 733Mhz)

5.c Equake For every time step: Sparse matrix-vector multiplication Vector rescaling Optimizations: 1.Long reuse distance between consecutive time steps: Shorten distance by performing multiple time steps on limited part of matrix. Eliminated memory accesses: 1.K[Anext][i][j] (3 accesses)  K[Anext*N*9 + 3*i + j] (1 access)

5.d Art (neural network) Poor spatial locality (0% - 20%) Neuron is C- structure containing 8 fields. Every loop updates one field, for each neuron. typedef struct { double I; double* I; …  … double R; double* R; } f1_neuron;} f1_neurons; f1_neuron[N]  f1_neurons f1_layer; f1_layer; F1_layer[y].W  f1_layer.W[y]

5.e Mcf Reordering of accesses is hard. Therefore: prefetching for( ; arc < stop_arcs; arc += nr_group ) { #define PREFETCH_DISTANCE 8 PREFETCH(arc+nr_group*PREFETCH_DISTANCE) if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost ident == AT_LOWER || red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); }

5.f Measurements 18M264KAlpha 21264 42M696K416KItanium 16256K264KAthlonXP assocsizeassocsizeassocsize L3L2L1processor cc –O5Alpha ecc –O3Itanium icc –O3AthlonXP CompilerProcessor

5.g Reuse Distance Histograms

Reuse distance predicts cache behaviour accurately. Compiler-optimizations are not powerful enough to remove a substantial portion of the capacity misses. The programmer often has a global overview of program behaviour. However, cache behavior is invisible in source code.  Visualisation Mcf, Art, Equake: 3x faster on average, on different CISC/RISC/EPIC platforms, with identical source code optimisations. Visualization of reuse distance enables portable and platform-independent cache optimisations.

Questions?

Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Similar presentations

Presentation on theme: "Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Similar presentations

Presentation on theme: "Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational."— Presentation transcript:

Similar presentations

About project

Feedback