A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation.

A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Technische Universität München ACM SIGPLAN Workshop on Memory System Performance Berlin, Germany, 16.6.2002 SMiLE

© Martin Schulz, MSP 2002, 16th June, 2002 2 Why Monitoring?  Cache optimizations increasingly important  Increasing gap between CPU and memory speed  Need to use caches efficiently  Temporal and Spatial Locality  Burden of optimization often on the user  Requires in-depth understanding of memory behavior  Cache/Memory monitoring can provide this  Current Solutions insufficient  Few counters in current CPUs (without addresses)  Slow cache simulation / SW monitoring systems

© Martin Schulz, MSP 2002, 16th June, 2002 3 A New Approach  Address-related cache monitoring  Record all events in correlation with address  Enable observation of individual data regions  Core: associative counter array  Result: Memory access histogram  Cumulated access behavior for each address  Split into access on the various busses  Concept based on previous research in SMiLE  Hardware monitoring in NUMA environments  FPGA-based prototype shows feasibility

© Martin Schulz, MSP 2002, 16th June, 2002 4 Outline  Address-related cache monitoring  Associative counter principle  Multi-layer monitoring  First Experiments  Simulation Setup  Access behavior and cache miss rates  Temporal information and phases  Tradeoff overhead vs. accuracy  System integration issues  Conclusions & Future work

© Martin Schulz, MSP 2002, 16th June, 2002 5 Requirements  Observe all events individually / no sampling  Interesting events might be overseen  Keep address information of observed events  Important for optimization  Allows observation of different patterns within a single applications  Essential: Keep hardware complexity low  Maintain feasibility of proposal  Recording of full trace certainly not useful  Need for preprocessing in hardware

© Martin Schulz, MSP 2002, 16th June, 2002 6 Associative Counter Array  Observable events: addresses of accesses  Limited number of counters (feasibility)  Dynamic association of counters and events  Associations updated at each event  Cache like behavior for counters  Existing associations are simply counted  Dynamic eviction of counters when necessary  Not enough free counters anymore  Counter overflow  Stored as intermediate results in ring buffer  Final result = SW aggregation of ring buffer

© Martin Schulz, MSP 2002, 16th June, 2002 7 Principle #1 #2 #3 #4 #CAddr-TagCounter Addr 845 Addr 21254 Addr 16134 Not used--- Associatve Counter Array Code / Execution Load Addr 21 Load Addr 12 Load Addr 26 Load Addr 21 Load Addr 16... AddrValue Ring Buffer 255 Addr 21 Addr 16 135 255Addr 21 Addr 121 135Addr 16 Addr 121 1Addr 26 Addr 21 1 Addr 261Addr 211 84521255 Access Histogram

© Martin Schulz, MSP 2002, 16th June, 2002 8 Adjusting the Granularity  Tradeoff granularity vs. overhead  Many events triggered by counting for each address individually  Can lead to a large number of counter swap outs  Finest granularity often not useful/needed  On memory busses: accesses to cache lines  Details not needed for first overview  User-definable granularity  Neighboring events are dynamically aggregated  Maximal aggregation distance can be controlled

© Martin Schulz, MSP 2002, 16th June, 2002 9 Multilayer Monitoring (1)  Single monitor in the system insufficient  Only capable of seeing events on one bus  No global relation of results  Impossible to compute miss rates  One monitor for each memory hierarchy  Monitored by individual & independent monitors  Results combined in software  Resulting memory histogram globalized  Misses can be computed by comparing before/after  Allows computation of miss rates for each address

© Martin Schulz, MSP 2002, 16th June, 2002 11 Experimental Setup  Design requires CPU integration  Evaluation using simulation  SIMT / NUMA multiprocessor simulator  Has been used to evaluate SMiLE monitor  Flexible memory system  Based on Augmint  Can also execute sequential code  Monitor Setup  Finest granularity = 1 cache line  Varying number of counters

© Martin Schulz, MSP 2002, 16th June, 2002 12 Experimental Parameters  Memory Setup  2-level cache hierarchy with 3 monitors  L1: 8 KB / direct mapped / 32 byte cache lines  L2: 64 KB / 2-way assoc. / 32 byte cache lines  Applications (SPLASH-2)  WATER: Molecular-dynamics code  64 molecules  RADIX: Sorting kernel  65538 keys to sort  LU: Matrix decomposition  Dense matrix of size 128x128

© Martin Schulz, MSP 2002, 16th June, 2002 15 Temporal Information  Current approach has no temporal information  Data aggregated over complete run  Approach 1: Phase-based monitoring  Add monitor barriers after each phase  Full flush and histogram generation  Useful to separate startup and post-processing  Monitor individual iterations  Approach 2: Finer grain temporal divisions  Monitor barriers cause overheads  Alternative: enable/disable monitor repeatedly

© Martin Schulz, MSP 2002, 16th June, 2002 17 Monitoring Granularity  Larger granularity can lead to inaccuracy  Experimental evaluation using the LU code  Different granularities as multiples of cache lines  Error as sum of differences for each address divided by total number of memory accesses Gran.123456 Error0.00%2.05%4.61%4.97%6.34%6.56%  Small / acceptable error in accuracy  But: limited resolution may hide some peaks

© Martin Schulz, MSP 2002, 16th June, 2002 18 Swap-Out Overhead  Number of swap outs for LU  Various number of counters and granularities  32-64 counters seem sufficient  Low number of swap outs at fine granularities

© Martin Schulz, MSP 2002, 16th June, 2002 19 Integration  Proposal requires an implementation inside CPU  Direct access to memory busses  Recording speed must be equal to L1 clock  Hardware integration  Main issue: How to store results?  Ring buffer has to be in main memory (due to size)  Goal: No/Minimal influence on system  Separate storage pipeline  Bypass write buffers and caches  Low impact on overall design

© Martin Schulz, MSP 2002, 16th June, 2002 20 Software/OS Integration  Monitor access  Kernel driver and user library  Configuration: special registers or memory maps  Ring buffer evaluation  Periodic read-out (idle-loop / daemon)  Associated to monitor events (e.g. barriers)  Address Translation  Monitor records physical, user needs virtual addresses  Needed: integration in OS VMM  Monitor barrier at every change in page table

© Martin Schulz, MSP 2002, 16th June, 2002 21 Conclusions & Future Work  Cache monitoring crucial for cache optimizations  Current solutions limited or insufficient  Proposal: Address related monitoring in hardware  Core: Associative counter array  Counters dynamically associated with addresses  If necessary: counters evicted to a ring buffer  Result: memory access histograms  Future work: Software to evaluate results  Tools to visualize and analyze results  Long term: automatic/dynamic adaptation

© Martin Schulz, MSP 2002, 16th June, 2002 22 For the curious...  SMiLE Hardware Monitor: http://smile.in.tum.de/  EP-Cache project: Performance Analysis and Tools based on the SMiLE hardware monitor http://www.scai.fhg.de/EP-CACHE/  Email contact: schulzm@in.tum.de

A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation.

Similar presentations

Presentation on theme: "A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation.

Similar presentations

Presentation on theme: "A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation."— Presentation transcript:

Similar presentations

About project

Feedback