Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational.

Slides:



Advertisements
Similar presentations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Kristof Beyls, Erik D’Hollander, Frederik Vandeputte ICCS 2005 – May 23 RDVIS: A Tool That Visualizes the Causes of Low Locality and Hints Program Optimizations.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Reuse distance as a metric for cache behavior - pdcs2001 [1] Characterization and Optimization of Cache Behavior Kristof Beyls, Yijun Yu, Erik D’Hollander.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.
Data Locality CS 524 – High-Performance Computing.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
Register Allocation (via graph coloring)
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Copyright 1998 UC, Irvine1 Miss Stride Buffer Department of Information and Computer Science University of California, Irvine.
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Experiences with Enumeration of Integer Projections of Parametric Polytopes Sven Verdoolaege, Kristof Beyls, Maurice Bruynooghe, Francky Catthoor Compiler.
Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Performance Tuning John Black CS 425 UNR, Fall 2000.
Sunpyo Hong, Hyesoon Kim
What is it and why do we need it? Chris Ward CS147 10/16/2008.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
Code Optimization.
Introduction To Computer Systems
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Probabilistic Miss Equations: Evaluating Memory Hierarchy Performance
Department of Electrical & Computer Engineering
Bojian Zheng CSCD70 Spring 2018
Memory Hierarchies.
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Lecture 08: Memory Hierarchy Cache Performance
A Practical Stride Prefetching Implementation in Global Optimizer
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer System Design Lecture 9
CS 3410, Spring 2014 Computer Science Cornell University
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg
Cache - Optimization.
Cache Performance Improvements
6- General Purpose GPU Programming
Introduction to Optimization
Presentation transcript:

Slide 1 Platform-Independent Cache Optimization by Pinpointing Low-Locality Reuse Kristof Beyls and Erik D’Hollander International Conference on Computational Science June 2004

Overview 1. Introduction 2. Reuse Distance 3. Optimizing Cache Behavior: By Hardware, Compiler or Programmer? 4. Visualization 5. Case Studies 6. Conclusion

1.a Introduction Anti-law of Moore ( 2003: half of execution time is lost due to data cache misses ) Relatieve snelheid vergeleken met 1980 PROCESSOR MEMORY Speed Gap Relatieve speed versus 1980

1.b Observation: Capacity misses dominate 3 kinds of cache misses (3 C’s): Cold, Conflict, Capacity

Overview 1. Introduction 2. Reuse Distance 3. Optimizing Cache Behavior: By Hardware, Compiler or Programmer? 4. Visualization 5. Case Studies 6. Conclusion

2.a Reuse Distance Definition: The reuse distance of a memory access is the number of unique memory locations accessed since the previous access to the same data. 2 C 1022∞∞∞distance ABBACBAaddress

2.b Reuse Distance - property Lemma: In a fully. assoc. LRU cache with n lines, an access hits the cache  reuse distance < n. Consequence: In every cache with n lines, a cache miss with distance d is: Cold missd = ∞ Capacity missn ≤ d < ∞ Conflict missd < n

2.c Reuse distance histogram Spec95fp

2.d Classifying cache misses SPEC95fp Cache size ConflictCapacity

2.e Reuse distance vs. cache hit probability

Overview 1. Introduction 2. Reuse Distance 3. Optimizing Cache Behavior: By Hardware, Compiler or Programmer? 4. Visualization 5. Case Studies 6. Conclusion

3a. Removing Capacity misses 1. Hardware Enlarge cache CS Reuse distance must be smaller than cache size 1. Compiler –Loop tiling –Loop fusion 2. Algorithm CS

3.b Compiler optimizations: SGIpro for Itanium (spec95fp) ConflictCapacity 30% conflict misses eliminated, 1% capacity misses eliminated.

Overview 1. Introduction 2. Reuse Distance 3. Optimizing Cache Behavior: By Hardware, Compiler or Programmer? 4. Visualization 5. Case Studies 6. Conclusion

4.a Objectives for cache visualization Cache behavior is shown in the source code. Cache behavior is presented accurately and concisely. Independent of specific cache parameters (e.g. size, associativity,…).  Reuse Distance allows to meet the above objectives

4.b Example: MCF

for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost ident == AT_LOWER || red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); } 68% of capacity misses 4.c Example: MCF 68.3% / sl=21%

Overview 1. Introduction 2. Reuse Distance 3. Optimizing Cache Behavior: By Hardware, Compiler or Programmer? 4. Visualization 5. Case Studies 6. Conclusion

5.a Optimization: classification 1. Eliminate memory access with poor locality (+++) 2. Reduce reuse distance (keep data in cache between use and reuse) (++) 3. Increase spatial locality (++) 4. Hide latency by prefetching (+)

5.b 3 case studies From Spec2000: With large memory bottleneck: Mcf (90%) – optimization of bus schedule Art (87%) – simulation of neural network Equake (66%) – simulation of earthquake Percentage of execution time that the processor is stalled waiting for data from memory and cache. (Itanium1 733Mhz)

5.c Equake For every time step: Sparse matrix-vector multiplication Vector rescaling Optimizations: 1.Long reuse distance between consecutive time steps: Shorten distance by performing multiple time steps on limited part of matrix. Eliminated memory accesses: 1.K[Anext][i][j] (3 accesses)  K[Anext*N*9 + 3*i + j] (1 access)

5.d Art (neural network) Poor spatial locality (0% - 20%) Neuron is C- structure containing 8 fields. Every loop updates one field, for each neuron. typedef struct { double I; double* I; …  … double R; double* R; } f1_neuron;} f1_neurons; f1_neuron[N]  f1_neurons f1_layer; f1_layer; F1_layer[y].W  f1_layer.W[y]

5.e Mcf Reordering of accesses is hard. Therefore: prefetching for( ; arc < stop_arcs; arc += nr_group ) { #define PREFETCH_DISTANCE 8 PREFETCH(arc+nr_group*PREFETCH_DISTANCE) if( arc->ident > BASIC ) { red_cost = bea_compute_red_cost( arc ); if( red_cost ident == AT_LOWER || red_cost>0 && arc->ident == AT_UPPER ) { basket_size++; perm[basket_size]->a = arc; perm[basket_size]->cost = red_cost; perm[basket_size]->abs_cost = ABS(red_cost); }

5.f Measurements 18M264KAlpha M696K416KItanium 16256K264KAthlonXP assocsizeassocsizeassocsize L3L2L1processor cc –O5Alpha ecc –O3Itanium icc –O3AthlonXP CompilerProcessor

5.g Reuse Distance Histograms

Overview 1. Introduction 2. Reuse Distance 3. Optimizing Cache Behavior: By Hardware, Compiler or Programmer? 4. Visualization 5. Case Studies 6. Conclusion

Reuse distance predicts cache behaviour accurately. Compiler-optimizations are not powerful enough to remove a substantial portion of the capacity misses. The programmer often has a global overview of program behaviour. However, cache behavior is invisible in source code.  Visualisation Mcf, Art, Equake: 3x faster on average, on different CISC/RISC/EPIC platforms, with identical source code optimisations. Visualization of reuse distance enables portable and platform-independent cache optimisations.

Questions?