Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Similar presentations


Presentation on theme: "1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer."— Presentation transcript:

1 1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer Science and Engineering University of California, Riverside **Also with the Center for Embedded Computer Systems at UC Irvine This work was supported by the National Science Foundation and the Semiconductor Research Corporation

2 2 Caches Consume Much Power >50% ARM920T and M*CORE : Caches consume 50% of total processor system power (Segars 01,Lee et.al. 99) Caches are frequently accessed Consume Dynamic Power Caches accounts for the most of the transistors on a die Consume Static Power We showed that a configurable cache can reduce that power nearly in half on average (Zhang et.al. ISCA 03,ISVLSI 03)

3 3 Configurable Cache Architecture W1 Four Way Set Associative Base Cache W2W3W4 W1 Two Way Set Associative W2W3W4 W1 Direct mapped cache W2W3W4 W1 Shut down two ways W2W3W4 Gnd Vdd Bitline Gated-Vdd Control Way Concatenation Way Shutdown Counter bus One Way 16 bytes 4 physical lines are filled when line size is 64 bytes Off Chip Memory Use sleep transistor method (Powell et. al. ISLPED 2000) (Zhang et. al. ISVLSI 03) (Zhang et.al. ISCA 03) Way prediction unit can be turned on/off. Line Concatenation

4 4 Computing Total Memory-Related Energy Considers CPU stall energy and off-chip memory energy Excludes CPU active energy Thus, represents all memory-related energy energy_mem = energy_dynamic + energy_static energy_miss = k_miss_energy * energy_hit energy_static_per_cycle = k_static * energy_total_per_cycle (we varied the k’s to account for different system implementations ) energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill energy_static = cycles * energy_static_per_cycle Underlined – measured quantities SimpleScalar (cache_hits, cache_misses, cycles) Our layout or data sheets (others)

5 5 Best Configuration Varies Across Applications

6 6 Cache Self-tuning Hardware Simulation-based methods Drawback: slowness. Seconds of real-time work may take tens of hours to simulate Simulation tools set up may be difficult Self-tuning method Incorporates a cache parameter tuner on a SoC platform Detect the lowest energy dissipation cache parameters The tuner sits to the side and collects information used to calculate the energy D$ I$ Tuner Processor Offchip Memory Heuristic algorithm is needed Search all possible cache configurations are time consuming. Considering other configurable parameters: voltage levels, bus width, etc. the search space will increase very quickly to millions Cache flushing should be avoided

7 7 Designing a Search Heuristic: Evaluating Impact of Cache Parameters on Miss Rate and Energy Average Instruction Cache Miss Rate and Normalized Energy of the Benchmarks. One Way Line Size 32B Line Size 32B One Way

8 8 Energy Dissipation of On-Chip Cache and Off Chip Memory

9 9 Heuristic: Searching for the least-energy cache configuration The least-energy cache configuration Search Cache SizeSearch Line SizeSearch Associativity Way prediction W1W2W3 W4

10 10 Implementing the Heuristic in Hardware input hit energies miss energies static energies hit num miss num multiplier adder register FSM comparator lowest energy control com_out configure register mux exe time FSM and Data Path of the Cache Explorer Total size of the tuner. About 4,200 gates, or mm 2 in 0.18 micron CMOS technology. Area overhead Compared to the reported size of the MIPS 4Kp with cache, this represents just over a 3% area overhead. Power consumption: 2.69 mW at 200 MHz. The power overhead compared with the MIPS 4Kp would be less than 0.5%. Furthermore, the exploring hardware is used only during the exploring stage, and can be shut down after the best configuration is determined.

11 11 Heuristic time-complexity and effectiveness Time complexity: Search all space: O(m x n x l x p) Heuristic : O(m + n + l + p) m:number of associativities, n :number of cache size l : number of cache line size, p :way prediction on/off Efficiency On average 5 searching instead of 27 total searching 2 out of 19 benchmarks miss the lowest power cache configuration. Use a different searching heuristic: line size, associativity, way prediction and cache size. 11 out 19 benchmarks miss the best configuration

12 12 Energy Savings On average, 40% energy reductions. Conventional direct mapped cache may consume unacceptable energy 70% energy reductions Energy savings when way concatenation, way shut down, and cache line size concatenation are implemented. cnv: Conventional Cache, cfg: configurable cache; wc:way concatenation; ws:way shut down; lc:line concatenation. (C. Zhang TECS ACM To Appear) 100% stands for the energy consumption of a conventional four way set associative cache

13 13 Conclusions A highly configurable cache architecture Reduces on average 40% of memory access related energy A self-tuning mechanism is proposed A special cache parameter explorer A heuristic algorithm to search the parameter space Cache flushing is avoided


Download ppt "1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer."

Similar presentations


Ads by Google