Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Software Methods to Increase Data Cache Performance Presented by Philip Marshall.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity Prateek Pujara Aneesh Aggarwal Dept of Electrical and Computer Engineering Binghamton.
TOWARDS BANDWIDTH- EFFICIENT PREFETCHING WITH SLIM AMPM June 13 th 2015 DPC-2 Workshop, ISCA-42 Portland, OR Vinson Young Ajit Krisshna.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
CMPE 421 Parallel Computer Architecture
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 2 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Increasing Cache Efficiency by Eliminating Noise Prateek Pujara & Aneesh Aggarwal {prateek,
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.
Improving Cache Performance using Victim Tag Stores
Non Contiguous Memory Allocation
COSC3330 Computer Architecture
Data Prefetching Smruti R. Sarangi.
Cache Performance Samira Khan March 28, 2017.
Associativity in Caches Lecture 25
Improving Memory Access 1/3 The Cache and Virtual Memory
Lecture: Cache Hierarchies
18742 Parallel Computer Architecture Caching in Multi-core Systems
Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.
Morgan Kaufmann Publishers Memory & Cache
Lecture: Cache Hierarchies
Improving cache performance of MPEG video codec
Energy-Efficient Address Translation
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Lecture 22: Cache Hierarchies, Memory
CARP: Compression-Aware Replacement Policies
Adapted from slides by Sally McKee Cornell University
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Data Prefetching Smruti R. Sarangi.
15-740/ Computer Architecture Lecture 14: Prefetching
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Virtual Memory.
10/18: Lecture Topics Using spatial locality
Address-Stride Assisted Approximate Load Value Prediction in GPUs
Presentation transcript:

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall

Outline Background Motivation for Noise Prediction Concepts of Noise Prediction Implementation of Noise Prediction Related Work Prefetching Data Profiling Conclusion

Background Cache Fetch  On Cache Miss  Prefetch Exploiting Spatial Locality  Cache words are fetched in blocks  Fetch neighboring block(s) on a cache miss  Results in fewer cache misses  Fetches words that aren’t needed

Background Cache noise  Words that are fetched into the cache but never used Cache utilization  The fraction of words in the cache that are used  Represents how efficiently the cache is used

Motivation for Noise Prediction Level 1 data cache utilization is ~57% for SPEC2K benchmarks [2] Fetching unused words:  Increases bandwidth requirements between cache levels  Increases hardware and power requirements  Wastes valuable cache space [2] D. Burger et. al., Memory bandwidth limitations of future microprocessors, Proc. ISCA-23, 1996

Motivation for Noise Prediction Cache block size  Larger blocks Exploit spatial locality better Reduce cache tag overhead Increase bandwidth requirements  Smaller blocks Reduced cache noise Any block size results in suboptimal performance

Motivation for Noise Prediction Sub-blocking  Only portions of the cache blocks are fetched  Decreases tag overhead by associating one tag with many sub-blocks  Words fetched must be in contiguous blocks of fixed size  High miss-rate and cache noise for non- contiguous access patterns

Motivation for Noise Prediction By predicting which words will actually be used, cache noise can be reduced But:  Fetching fewer words could increase the number of cache misses

Concepts of Noise Prediction Selective fetching  For each block, fetch only the words that are predicted to be accessed  If no prediction is available, fetch the entire block  Uses a valid bit for each word and a words usage bit to track which words have been used

Concepts of Noise Prediction Cache Noise Predictors  Phase Context Predictor (PCP) Based on the usage pattern of the most recently evicted block  Memory Context Predictor (MCP) Based on the MSBs of the memory address  Code Context Predictor (CCP) Based on the MSBs of the PC

Concepts of Noise Prediction Prediction table size  Larger tables decrease the probability of “no predictions”  Smaller tables use less power A prediction is considered successful if all the needed words are fetched  If extra words are fetched, still considered a success

Concepts of Noise Prediction Improving Prediction  Miss Initiator Based History (MIBH) Keep separate histories according to which word in the block caused the miss Improves predictability if relative position of words accessed is fixed  Example: looping through a struct and accessing only one field

Concepts of Noise Prediction Improving Prediction  OR-ing Previous Two Histories (OPTH) Increases predictability by looking at more than the most recent access Reduces cache utilization OR-ing more than two accesses reduces utilization substantially

Results Empirically, CCP provides the best results MIBH greatly increases predictability OPTH improves predictability only marginally while increasing cache noise Cache utilization increased from 57% to 92%

Results

Related Work Existing work focuses reducing cache misses, not on improving utilization Sub-blocked caches used mainly to decrease tag overhead Some existing work on prediction of which sub- blocks to load in a sub-blocked cache No existing techniques for predicting and fetching non-contiguous words

Related Work

Prefetching Prefetching improves the cache miss rate Commonly, prefetching is implemented by also fetching the next block on a cache miss Prefetching increases noise and increases bandwidth requirements

Prefetching Noise prediction leads to more intelligent prefetching but requires extra hardware On average, prefetching with noise prediction leads to less energy consumption In the worst case, energy requirements increase

Prefetching

Data Profiling For some benchmarks there are a low number of predictions The predictor table is too small to hold all the word usage histories Don’t increasing table size, profile the data Profiling increases prediction rate by ~7% Gains aren’t as high as expected

Data Profiling

Analysis of Noise Prediction Pros  Small increase in miss rate (0.1%)  Decreased power requirements in most cases  Decreased bandwidth requirements between cache levels  Adapts effective block size to access patterns  Dynamic technique but profiling can be used  Scaleable to different predictor sizes

Analysis of Noise Prediction Cons  Increased hardware overhead  Increases power in the worst case  Not all programs benefit  Profiling provides limited improvement

Other Thoughts How were benchmarks chosen?  6 of 12 integer and 8 of 14 floating point SPEC2K benchmarks were used Not all predictors were examined equally  22-bit MCP predictor performed slightly poorer than a 28-bit CCP  28-bit MCP? How can the efficiency of the prediction table be increased?