Program Phase Directed Dynamic Cache Way Reconfiguration

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
® Lihu Rappoport 1 XBC - eXtended Block Cache Lihu Rappoport Stephan Jourdan Yoav Almog Mattan Erez Adi Yoaz Ronny Ronen Intel Corporation.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.
2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
Author : Tzi-Cker Chiueh, Prashant Pradhan Publisher : High-Performance Computer Architecture, Presenter : Jo-Ning Yu Date : 2010/11/03.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
CS203 – Advanced Computer Architecture
Direct Cache Structure
Multilevel Memories (Improving performance using alittle “cash”)
Dynamically Sizing the TAGE Branch Predictor
5.2 Eleven Advanced Optimizations of Cache Performance
Lu Peng, Jih-Kwon Peir, Konrad Lai
Microarchitectural Techniques for Power Gating of Execution Units
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Tosiron Adegbija and Ann Gordon-Ross+
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 14: Reducing Cache Misses
Understanding Performance Counter Data - 1
Address-Value Delta (AVD) Prediction
Ka-Ming Keung Swamy D Ponpandi
Phase Capture and Prediction with Applications
Lecture 10: Branch Prediction and Instruction Delivery
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Cache - Optimization.
What Are Performance Counters?
Phase based adaptive Branch predictor: Seeing the forest for the trees
Ka-Ming Keung Swamy D Ponpandi
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Program Phase Directed Dynamic Cache Way Reconfiguration Subhasis Banerjee Surendra G S.K.Nandy Presented by: Xin Guan Mar. 29, 2010

Outline Introduction to Program Phase Hardware Phase Detector Cache Reconfiguration Experiment Results

Outline Introduction to Program Phase Hardware Phase Detector Cache Reconfiguration Experiment Results 3

Program Phase What is Program Phase? How do we detect phases? Informally, a phase is a period of execution whose characteristics are qualitatively different from those of the neighboring periods. How do we detect phases? Phase boundary Instruction Stream (eg, certain code sections) Data Stream (eg, data access pattern) Asynchronous external events (eg, incoming message)

Program Phase Mpeg2decode phase profile in terms of IPC (Instruction Per Clock), ROB (Reorder Buffer) occupancy, Issue Rate

Program Phase Conflict Miss Insufficient number of cache way Program locality : generating conflict miss patterns Increasing cache associativity does not gain much performance improvements in some cases, so reconfigure it to save power.

Outline Introduction to Program Phase Hardware Phase Detector Cache Reconfiguration Experiment Results 7

Hardware Phase Detector Tag is used to identify the cache block during a conflict miss. Counter is used to get the number of times of conflict miss. Counter Array Cache Sets 01101000 23

Hardware Phase Detector Start to count at the beginning of every interval Interval Vector Cache Sets #1 10 Normalization #2 Interval Vector #3 15 #4 #5 23 #6 14 #7 …… ……

Hardware Phase Detector Clustering

Hardware Phase Detector If the minimum distance d3 < threshold, Clustering y d1 d3 Same cluster x d2

Hardware Phase Detector Phase History Table Every cluster corresponds to a phase Phase ID Phase vector Way Configurati on #1 Geometric centroid of cluster 1 2 way set associative #2 Geometric centroid of cluster 2 4 way set associative

Hardware Phase Detector # of phases VS threshold According to experiments, using threshold 1.1 most benchmarks exhibit 8 phases totally.

Outline Introduction to Program Phase Hardware Phase Detector Cache Reconfiguration Experiment Results 14

Phase Directed Reconfiguration Way select signal enable/disable the pre-charge and sense amp The phase is fed into cache controller, which decide the way configuration. Architecture Way select signal Every 2 million instructions, the internal vector is calculated, and phase is found

Phase Directed Reconfiguration Algorithm If miss rate is low enough, shut down one way to save power If miss rate is too high, enable one more way

Disabled Cache Ways Coherency 3 approaches. Valid cache block should be accessible for future references. Data residing in disabled cache ways should be coherent when the disabled cache way is enabled again. 3 approaches. 1.Flush the disabled way. 2.Fill buffer. 3.Victim buffer.

Disabled Cache Ways Fill Buffer x, y, z Fill buffer can move the data in disabled way to enabled way, with several penalty cycles. x, y, z

Disabled Cache Ways Victim Buffer x, y, z Instead of moving data from disabled way to enabled way, the data can be stored in a victim buffer. This approach is adopted in this implementation. x, y, z Simple, avoid frequent activation of pre-charge and sense amplifier logic.

Outline Introduction to Program Phase Hardware Phase Detector Cache Reconfiguration Experiment Results 20

Experiment Results

Experiment Results

Experiment Results

Experiment Results Average Saving of 32% of L1 data cache power with almost negligible loss of performance.

Conclusion Hardware Program Phase Detector Dynamic Cache Reconfiguration Saving average 32% power consumption with no performance degradation

Questions ?