Download presentation
Presentation is loading. Please wait.
Published byAubrey Blair Modified over 9 years ago
1
A Proposal for a New Hardware Cache Monitoring Architecture Martin Schulz, Jie Tao, Jürgen Jeitner, Wolfgang Karl Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Technische Universität München ACM SIGPLAN Workshop on Memory System Performance Berlin, Germany, 16.6.2002 SMiLE
2
© Martin Schulz, MSP 2002, 16th June, 2002 2 Why Monitoring? Cache optimizations increasingly important Increasing gap between CPU and memory speed Need to use caches efficiently Temporal and Spatial Locality Burden of optimization often on the user Requires in-depth understanding of memory behavior Cache/Memory monitoring can provide this Current Solutions insufficient Few counters in current CPUs (without addresses) Slow cache simulation / SW monitoring systems
3
© Martin Schulz, MSP 2002, 16th June, 2002 3 A New Approach Address-related cache monitoring Record all events in correlation with address Enable observation of individual data regions Core: associative counter array Result: Memory access histogram Cumulated access behavior for each address Split into access on the various busses Concept based on previous research in SMiLE Hardware monitoring in NUMA environments FPGA-based prototype shows feasibility
4
© Martin Schulz, MSP 2002, 16th June, 2002 4 Outline Address-related cache monitoring Associative counter principle Multi-layer monitoring First Experiments Simulation Setup Access behavior and cache miss rates Temporal information and phases Tradeoff overhead vs. accuracy System integration issues Conclusions & Future work
5
© Martin Schulz, MSP 2002, 16th June, 2002 5 Requirements Observe all events individually / no sampling Interesting events might be overseen Keep address information of observed events Important for optimization Allows observation of different patterns within a single applications Essential: Keep hardware complexity low Maintain feasibility of proposal Recording of full trace certainly not useful Need for preprocessing in hardware
6
© Martin Schulz, MSP 2002, 16th June, 2002 6 Associative Counter Array Observable events: addresses of accesses Limited number of counters (feasibility) Dynamic association of counters and events Associations updated at each event Cache like behavior for counters Existing associations are simply counted Dynamic eviction of counters when necessary Not enough free counters anymore Counter overflow Stored as intermediate results in ring buffer Final result = SW aggregation of ring buffer
7
© Martin Schulz, MSP 2002, 16th June, 2002 7 Principle #1 #2 #3 #4 #CAddr-TagCounter Addr 845 Addr 21254 Addr 16134 Not used--- Associatve Counter Array Code / Execution Load Addr 21 Load Addr 12 Load Addr 26 Load Addr 21 Load Addr 16... AddrValue Ring Buffer 255 Addr 21 Addr 16 135 255Addr 21 Addr 121 135Addr 16 Addr 121 1Addr 26 Addr 21 1 Addr 261Addr 211 84521255 Access Histogram
8
© Martin Schulz, MSP 2002, 16th June, 2002 8 Adjusting the Granularity Tradeoff granularity vs. overhead Many events triggered by counting for each address individually Can lead to a large number of counter swap outs Finest granularity often not useful/needed On memory busses: accesses to cache lines Details not needed for first overview User-definable granularity Neighboring events are dynamically aggregated Maximal aggregation distance can be controlled
9
© Martin Schulz, MSP 2002, 16th June, 2002 9 Multilayer Monitoring (1) Single monitor in the system insufficient Only capable of seeing events on one bus No global relation of results Impossible to compute miss rates One monitor for each memory hierarchy Monitored by individual & independent monitors Results combined in software Resulting memory histogram globalized Misses can be computed by comparing before/after Allows computation of miss rates for each address
10
© Martin Schulz, MSP 2002, 16th June, 2002 10 Multilayer Monitoring (2) Monitor Eval.- software Main Memory L1 Cache CPU L2 Cache
11
© Martin Schulz, MSP 2002, 16th June, 2002 11 Experimental Setup Design requires CPU integration Evaluation using simulation SIMT / NUMA multiprocessor simulator Has been used to evaluate SMiLE monitor Flexible memory system Based on Augmint Can also execute sequential code Monitor Setup Finest granularity = 1 cache line Varying number of counters
12
© Martin Schulz, MSP 2002, 16th June, 2002 12 Experimental Parameters Memory Setup 2-level cache hierarchy with 3 monitors L1: 8 KB / direct mapped / 32 byte cache lines L2: 64 KB / 2-way assoc. / 32 byte cache lines Applications (SPLASH-2) WATER: Molecular-dynamics code 64 molecules RADIX: Sorting kernel 65538 keys to sort LU: Matrix decomposition Dense matrix of size 128x128
13
© Martin Schulz, MSP 2002, 16th June, 2002 13 First Results (WATER)
14
© Martin Schulz, MSP 2002, 16th June, 2002 14 Cache Miss Rates (RADIX)
15
© Martin Schulz, MSP 2002, 16th June, 2002 15 Temporal Information Current approach has no temporal information Data aggregated over complete run Approach 1: Phase-based monitoring Add monitor barriers after each phase Full flush and histogram generation Useful to separate startup and post-processing Monitor individual iterations Approach 2: Finer grain temporal divisions Monitor barriers cause overheads Alternative: enable/disable monitor repeatedly
16
© Martin Schulz, MSP 2002, 16th June, 2002 16 Phase-based Monitoring (LU)
17
© Martin Schulz, MSP 2002, 16th June, 2002 17 Monitoring Granularity Larger granularity can lead to inaccuracy Experimental evaluation using the LU code Different granularities as multiples of cache lines Error as sum of differences for each address divided by total number of memory accesses Gran.123456 Error0.00%2.05%4.61%4.97%6.34%6.56% Small / acceptable error in accuracy But: limited resolution may hide some peaks
18
© Martin Schulz, MSP 2002, 16th June, 2002 18 Swap-Out Overhead Number of swap outs for LU Various number of counters and granularities 32-64 counters seem sufficient Low number of swap outs at fine granularities
19
© Martin Schulz, MSP 2002, 16th June, 2002 19 Integration Proposal requires an implementation inside CPU Direct access to memory busses Recording speed must be equal to L1 clock Hardware integration Main issue: How to store results? Ring buffer has to be in main memory (due to size) Goal: No/Minimal influence on system Separate storage pipeline Bypass write buffers and caches Low impact on overall design
20
© Martin Schulz, MSP 2002, 16th June, 2002 20 Software/OS Integration Monitor access Kernel driver and user library Configuration: special registers or memory maps Ring buffer evaluation Periodic read-out (idle-loop / daemon) Associated to monitor events (e.g. barriers) Address Translation Monitor records physical, user needs virtual addresses Needed: integration in OS VMM Monitor barrier at every change in page table
21
© Martin Schulz, MSP 2002, 16th June, 2002 21 Conclusions & Future Work Cache monitoring crucial for cache optimizations Current solutions limited or insufficient Proposal: Address related monitoring in hardware Core: Associative counter array Counters dynamically associated with addresses If necessary: counters evicted to a ring buffer Result: memory access histograms Future work: Software to evaluate results Tools to visualize and analyze results Long term: automatic/dynamic adaptation
22
© Martin Schulz, MSP 2002, 16th June, 2002 22 For the curious... SMiLE Hardware Monitor: http://smile.in.tum.de/ EP-Cache project: Performance Analysis and Tools based on the SMiLE hardware monitor http://www.scai.fhg.de/EP-CACHE/ Email contact: schulzm@in.tum.de
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.