Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation.

Similar presentations


Presentation on theme: "A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation."— Presentation transcript:

1 A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Technische Universität München ACM SIGPLAN Workshop on Memory System Performance Berlin, Germany, 16.6.2002 SMiLE

2 © Martin Schulz, MSP 2002, 16th June, 2002 2 Why Monitoring?  Cache optimizations increasingly important  Increasing gap between CPU and memory speed  Need to use caches efficiently  Temporal and Spatial Locality  Burden of optimization often on the user  Requires in-depth understanding of memory behavior  Cache/Memory monitoring can provide this  Current Solutions insufficient  Few counters in current CPUs (without addresses)  Slow cache simulation / SW monitoring systems

3 © Martin Schulz, MSP 2002, 16th June, 2002 3 A New Approach  Address-related cache monitoring  Record all events in correlation with address  Enable observation of individual data regions  Core: associative counter array  Result: Memory access histogram  Cumulated access behavior for each address  Split into access on the various busses  Concept based on previous research in SMiLE  Hardware monitoring in NUMA environments  FPGA-based prototype shows feasibility

4 © Martin Schulz, MSP 2002, 16th June, 2002 4 Outline  Address-related cache monitoring  Associative counter principle  Multi-layer monitoring  First Experiments  Simulation Setup  Access behavior and cache miss rates  Temporal information and phases  Tradeoff overhead vs. accuracy  System integration issues  Conclusions & Future work

5 © Martin Schulz, MSP 2002, 16th June, 2002 5 Requirements  Observe all events individually / no sampling  Interesting events might be overseen  Keep address information of observed events  Important for optimization  Allows observation of different patterns within a single applications  Essential: Keep hardware complexity low  Maintain feasibility of proposal  Recording of full trace certainly not useful  Need for preprocessing in hardware

6 © Martin Schulz, MSP 2002, 16th June, 2002 6 Associative Counter Array  Observable events: addresses of accesses  Limited number of counters (feasibility)  Dynamic association of counters and events  Associations updated at each event  Cache like behavior for counters  Existing associations are simply counted  Dynamic eviction of counters when necessary  Not enough free counters anymore  Counter overflow  Stored as intermediate results in ring buffer  Final result = SW aggregation of ring buffer

7 © Martin Schulz, MSP 2002, 16th June, 2002 7 Principle #1 #2 #3 #4 #CAddr-TagCounter Addr 845 Addr 21254 Addr 16134 Not used--- Associatve Counter Array Code / Execution Load Addr 21 Load Addr 12 Load Addr 26 Load Addr 21 Load Addr 16... AddrValue Ring Buffer 255 Addr 21 Addr 16 135 255Addr 21 Addr 121 135Addr 16 Addr 121 1Addr 26 Addr 21 1 Addr 261Addr 211 84521255 Access Histogram

8 © Martin Schulz, MSP 2002, 16th June, 2002 8 Adjusting the Granularity  Tradeoff granularity vs. overhead  Many events triggered by counting for each address individually  Can lead to a large number of counter swap outs  Finest granularity often not useful/needed  On memory busses: accesses to cache lines  Details not needed for first overview  User-definable granularity  Neighboring events are dynamically aggregated  Maximal aggregation distance can be controlled

9 © Martin Schulz, MSP 2002, 16th June, 2002 9 Multilayer Monitoring (1)  Single monitor in the system insufficient  Only capable of seeing events on one bus  No global relation of results  Impossible to compute miss rates  One monitor for each memory hierarchy  Monitored by individual & independent monitors  Results combined in software  Resulting memory histogram globalized  Misses can be computed by comparing before/after  Allows computation of miss rates for each address

10 © Martin Schulz, MSP 2002, 16th June, 2002 10 Multilayer Monitoring (2) Monitor Eval.- software Main Memory L1 Cache CPU L2 Cache

11 © Martin Schulz, MSP 2002, 16th June, 2002 11 Experimental Setup  Design requires CPU integration  Evaluation using simulation  SIMT / NUMA multiprocessor simulator  Has been used to evaluate SMiLE monitor  Flexible memory system  Based on Augmint  Can also execute sequential code  Monitor Setup  Finest granularity = 1 cache line  Varying number of counters

12 © Martin Schulz, MSP 2002, 16th June, 2002 12 Experimental Parameters  Memory Setup  2-level cache hierarchy with 3 monitors  L1: 8 KB / direct mapped / 32 byte cache lines  L2: 64 KB / 2-way assoc. / 32 byte cache lines  Applications (SPLASH-2)  WATER: Molecular-dynamics code  64 molecules  RADIX: Sorting kernel  65538 keys to sort  LU: Matrix decomposition  Dense matrix of size 128x128

13 © Martin Schulz, MSP 2002, 16th June, 2002 13 First Results (WATER)

14 © Martin Schulz, MSP 2002, 16th June, 2002 14 Cache Miss Rates (RADIX)

15 © Martin Schulz, MSP 2002, 16th June, 2002 15 Temporal Information  Current approach has no temporal information  Data aggregated over complete run  Approach 1: Phase-based monitoring  Add monitor barriers after each phase  Full flush and histogram generation  Useful to separate startup and post-processing  Monitor individual iterations  Approach 2: Finer grain temporal divisions  Monitor barriers cause overheads  Alternative: enable/disable monitor repeatedly

16 © Martin Schulz, MSP 2002, 16th June, 2002 16 Phase-based Monitoring (LU)

17 © Martin Schulz, MSP 2002, 16th June, 2002 17 Monitoring Granularity  Larger granularity can lead to inaccuracy  Experimental evaluation using the LU code  Different granularities as multiples of cache lines  Error as sum of differences for each address divided by total number of memory accesses Gran.123456 Error0.00%2.05%4.61%4.97%6.34%6.56%  Small / acceptable error in accuracy  But: limited resolution may hide some peaks

18 © Martin Schulz, MSP 2002, 16th June, 2002 18 Swap-Out Overhead  Number of swap outs for LU  Various number of counters and granularities  32-64 counters seem sufficient  Low number of swap outs at fine granularities

19 © Martin Schulz, MSP 2002, 16th June, 2002 19 Integration  Proposal requires an implementation inside CPU  Direct access to memory busses  Recording speed must be equal to L1 clock  Hardware integration  Main issue: How to store results?  Ring buffer has to be in main memory (due to size)  Goal: No/Minimal influence on system  Separate storage pipeline  Bypass write buffers and caches  Low impact on overall design

20 © Martin Schulz, MSP 2002, 16th June, 2002 20 Software/OS Integration  Monitor access  Kernel driver and user library  Configuration: special registers or memory maps  Ring buffer evaluation  Periodic read-out (idle-loop / daemon)  Associated to monitor events (e.g. barriers)  Address Translation  Monitor records physical, user needs virtual addresses  Needed: integration in OS VMM  Monitor barrier at every change in page table

21 © Martin Schulz, MSP 2002, 16th June, 2002 21 Conclusions & Future Work  Cache monitoring crucial for cache optimizations  Current solutions limited or insufficient  Proposal: Address related monitoring in hardware  Core: Associative counter array  Counters dynamically associated with addresses  If necessary: counters evicted to a ring buffer  Result: memory access histograms  Future work: Software to evaluate results  Tools to visualize and analyze results  Long term: automatic/dynamic adaptation

22 © Martin Schulz, MSP 2002, 16th June, 2002 22 For the curious...  SMiLE Hardware Monitor: http://smile.in.tum.de/  EP-Cache project: Performance Analysis and Tools based on the SMiLE hardware monitor http://www.scai.fhg.de/EP-CACHE/  Email contact: schulzm@in.tum.de


Download ppt "A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation."

Similar presentations


Ads by Google