Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Slides:

Advertisements

Similar presentations

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Advertisements

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Leakage Energy Management in Cache Hierarchies L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and A. Sivasubramaniam Penn State.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.

Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.

Improving Cache Performance by Exploiting Read-Write Disparity

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

On the Limits of Leakage Power Reduction in Caches Yan Meng, Tim Sherwood and Ryan Kastner UC, Santa Barbara HPCA-2005.

A Highly Configurable Cache Architecture for Embedded Systems Chuanjun Zhang, Frank Vahid and Walid Najjar University of California, Riverside ISCA 2003.

Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Dept. of Computer Science, UC Irvine

Drowsy Caches: Simple Techniques for Reducing Leakage Power Authors: ARM Ltd Krisztián Flautner, Advanced Computer Architecture Lab, The University of.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.

A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

1 Improved Policies for Drowsy Caches in Embedded Processors Junpei Zushi Gang Zeng Hiroyuki Tomiyama Hiroaki Takada (Nagoya University) Koji Inoue (Kyushu.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

Dynamic Zero Compression for Cache Energy Reduction Luis Villa Michael Zhang Krste Asanovic

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Memory Hierarchy Ideal memory is fast, large, and inexpensive

COSC3330 Computer Architecture

Lecture 12 Virtual Memory.

Alireza Shafaei, Shuang Chen, Yanzhi Wang, and Massoud Pedram

Improving Memory Access 1/3 The Cache and Virtual Memory

SECTIONS 1-7 By Astha Chawla

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Department of Electrical & Computer Engineering

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Tosiron Adegbija and Ann Gordon-Ross+

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture 23: Cache, Memory, Virtual Memory

Ann Gordon-Ross and Frank Vahid*

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Qingbo Zhu, Asim Shankar and Yuanyuan Zhou

Lecture 22: Cache Hierarchies, Memory

Lecture 11: Cache Hierarchies

Cache - Optimization.

Cache Memory Rabi Mahapatra

Automatic Tuning of Two-Level Caches to Embedded Applications

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags Michael Zhang Krste Asanovic {rzhang|krste}@lcs.mit.edu ISLPED ’02, August 12-14, Monterey, CA

Motivation Cache uses 30-60% processor energy in embedded systems 43% for StrongArm-1 [vs. 16% for Alpha 21264] Working set of many applications smaller than cache size Can reduce cache size adaptively to match the current working set Deactivate unused portions of the cache circuitry to save power Active power Leakage power

Related Work Off-Line Techniques – Selective Ways Statically deactivate cache ways according to profiling information before application execution. [Albonesi ’99] On-Line Techniques – DRI-Cache Dynamically keeps the instruction cache miss rate under a preset bound. [Powell et. al. ’01] Line Deactivation – Cache Decay Per cache line counter used to track and turn off not recently used cache lines to reduce leakage. [Kaxiras et. al.’01] Line Deactivation – Adaptive Mode Control Only deactivate lines, not tags. [Zhou et. al. ’01] A hit in the tag of a deactivated line is a sleep miss A miss in the tag is a real miss The ratio of sleep miss to real miss used to adjust resizing intervals. Limit Study Various design choices for RAM-tag caches are studied and compared in [Yang et. al. ’02.]

Miss Tags – The Idea Data Array Tag Miss Tag Array Non- Resizable Resizable Data Array Tag Miss Tag Array Miss Tags : A second set of tags used as predictors Fixed-Size, acts as the tag array of the full-sized cache Checked only during cache miss to see if full sized cache would have avoided the miss Live in the non-critical miss path, can be implemented with smaller, slower, non-leaky transistors

Miss Tag – The Idea Data Array Tag Miss Tag Array Non- Resizable Resizable Data Array Tag Miss Tag Array hit Downsize cache If many hits in regular tags, few accesses to miss tags Application has small working set Smaller cache probably okay Hint to downsize the cache

Miss Tag – The Idea Data Array Tag Miss Resizable Miss Non- miss Downsize cache If miss in regular tags and miss in miss tags Larger cache unlikely to help Example - streaming applications with no temporal locality Hint to downsize the cache

Miss Tag – The Idea Data Array Tag Miss Tag Array Non- Resizable Resizable Data Array Tag Miss Tag Array miss Upsize cache hit If miss in regular tags but hit in miss tags Larger cache likely to help Upsize the cache

Resizing Illustration Cache Size Execution time Initial Period Resizing Interval Resizing Point 1 Downsizing # of Miss Tag Hits < Lower Bound Resizing Point 2 No Action Upper Bound > # of Miss Tag Hits Lower Bound Resizing Point 3 Upsizing # of Miss Tag Hits > Upper Bound Measurement # of Miss Tag Hits Parameters Resizing Interval Upper Bound Lower Bound Algorithm: Adjust cache size such that # of miss tag hits is within upper and lower bounds per resizing interval

CAM-Tag Cache Popular among low-power processors Generally Sub-banked Hit? Data Array Tag Popular among low-power processors ARM3 [’89] StrongArm [’98] Xscale [’01] Generally Sub-banked One sub-bank activated Each bank is a set Each line within a sub-bank is a way All tags searched in parallel Matched tag asserts appropriate word line for data read/write. Tag Bank Offset

MTR with CAM-Tag Cache Tag Array Data Array Miss Tag Array Non- resizable resizable Sub-bank Structure Tag Array Data Array Miss Tag Array MTR cache sub-bank configuration Each sub-bank has 8 equal partitions For upsize, turn on entire partition For downsize, turn off last active line Conservative downsizing Sub-bank resizing Each sub-bank resized individually Resizing spaced out evenly in time No burst of dirty write backs

Hardware Modifications The Miss Tags Only accessed during miss – not on critical path Can be implemented using slow non-leaky transistors, using alternative serial/parallel RAM/CAM structures Turning off cache lines – Leakage Reduction Gated-Vdd: Adding a stacked N-type transistor to reduce leakage energy. [Powell et. al. ’01] Leakage-Biased-Bitlines – Leakage Reduction Turning off the precharge of CAM/RAM bitlines and CAM match lines. Automatically biases the voltage to minimize leakage. [Heo et. al. ’02] Hierarchical Bitlines – Active Energy Reduction Used to turn off portions of the cache block to reduce active energy. [Ghose & Kamble ’99] Minimal cycle time impact < 1.5% (from Gated-Vdd) No cycle time penalty without Gated-Vdd Small area Impact at ~ 10% depending on implementation

Hardware Modification Cont’d Leakage-Biased Bitlines Gated-Vdd Hierarchical Bitlines

Experimental Setup Modified SimpleScalar 3.0 simulator Single-issue in-order processor Baseline cache similar to Intel XScale 32KB implemented in 32 1KB sub-banks 32-way set-associative with 32-Byte cache lines FIFO replacement policy per sub-bank Benchmarks: SpecINT2000 and SpecFP2000 1.5 Billion cycles of reference inputs Baseline resizing scheme implemented for performance comparison Compares to a fixed miss rate If current miss rate > fixed threshold, upsize, otherwise, downsize Similar to DRI-Cache

Dynamic Resizing Illustration D-cache Miss Rate vs. Time D-cache Active Size vs. Time Miss rate and average active cache size obtained by applying MTR to our D-cache

Small Working Set Example Working set is very small small cache sufficient!

Low Temporal Locality Examples High miss rate, no temporal locality Small cache size has similar performance as large cache size

Adaptive Examples

CPI Comparison Each MTR data point, (active cache size, CPI) is obtained by varying resizing interval, upper and lower bound Each Baseline data point, (active cache size, CPI) is obtained by varying the preset miss rate threshold For optimal results in MTR cache, parameters picked to yield average active I-cache size of 12KB and D-Cache size of 8KB

Energy Savings Sensitivity Analysis with two factors L2 = 16xL1 Leakage = 50% of total L2 = 16xL1 Leakage = 50% of total L2 = 128xL1 Leakage = 0% of total L2 = 128xL1 Leakage = 0% of total Sensitivity Analysis with two factors Percentage of leakage energy to total energy (0% to 50%) L2 refill energy in multiples of L1 access energy (16x to 128x) Energy figures from simulation of extracted layout TSMC 0.25 mm technology Writeback energy included

Performance Performance is affected by ratio of upper bound and lower bound to resizing interval [5, 10] / 32k ~= [10, 20] / 64k length of resizing interval Large resizing interval yields less writebacks If resizing interval is too large, lose resizing ability If resizing interval is too small, thrashing Performance consistent across all benchmarks Duplicated tags acts as predictor for each benchmark Easy parameter tuning

MTR – A Summary CAM-Tag cache offers very fine-grained resizing One cache line at a time Avoids writeback bursts Miss Tag Resizing Algorithm Resizes dynamically No delay overhead – resizing operations completely fall into the miss path Resizing determined by the difference between actual miss rate and the predicted miss rate of the system Resizing parameters can be tuned to work well for all benchmarks - no need for application-specific parameter tuning. Reduces both active and leakage energy

Conclusion Proposed MTR, a dynamic cache resizing technique for CAM-Tag caches. Uses a fixed-sized duplicate tag array to keep track of miss rate of full-sized cache. Negligible delay overhead (accessed only on miss) Negligible energy overhead (non-leaky slow transistors used) Achieves 28% to 56% energy saving for D-Cache depending on operating point Achieves 34% to 49% energy saving for I-Cache depending on operating point Minimal cycle time impact at < 1.5% Small area impact at ~ 10%