Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April 2007 2015/4/17.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Performance of Cache Memory

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Register Allocation (via graph coloring)

Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

2013/10/21 Yun-Chung Yang An Energy-Efficient Adaptive Hybrid Cache Jason Cong, Karthik Gururaj, Hui Huang, Chunyue Liu, Glenn Reinman, Yi Zou Computer.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

A S ELF -T UNING C ACHE ARCHITECTURE FOR E MBEDDED S YSTEMS Chuanjun Zhang, Frank Vahid and Roman Lysecky Presented by: Wei Zang Mar. 29, 2010.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Lightweight Runtime Control Flow Analysis for Adaptive Loop Caching + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing Marisha.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

CSC 4250 Computer Architectures

Improving Program Efficiency by Packing Instructions Into Registers

Tosiron Adegbija and Ann Gordon-Ross+

Ann Gordon-Ross and Frank Vahid*

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Program Phase Directed Dynamic Cache Way Reconfiguration

Automatic Tuning of Two-Level Caches to Embedded Applications

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17

Power consumption is of crucial importance to embedded systems. In such systems, the instruction memory hierarchy consumes a large portion of the total energy consumption. A well designed instruction memory hierarchy can greatly decrease the energy consumption and increase performance. The performance of the instruction memory hierarchy is largely determined by the specific application. Different applications achieve better energy-performance with different configurations of the instruction memory hierarchy. Moreover, applications often exhibit different phases during execution, each exacting different demands on the processor and in particular the instruction memory hierarchy. For a given hardware resource budget, an even better energy-performance may be achievable if the memory hierarchy can be reconfigured before each of these phases. Abstract - 2 -

In this paper, we propose a new dynamically reconfigurable instruction memory hierarchy to take advantage of these two characteristics so as to achieve significant energy- performance improvement. Our proposed instruction memory hierarchy, which we called DRIM, consists of four banks of on-chip instruction buffers. Each of these can be configured to function as a cache or as a scratchpad memory (SPM) according to the needs of an application and its execution phases. Our experimental results using six benchmarks from the MediaBench and the MiBench suites show that DRIM can achieve significant energy reduction. Abstract – Cont

The instruction delivery system constitutes a significant portion of the processor energy consumption  As instructions are fetched almost every cycle Scratchpad Memory (SPM) is energy efficient than cache  However, the existing works on instruction SPM 。 Not consider the phased behavior of applications during execution What’s the Problem - 4 -

Related Works Reduce energy consumption in I-caches Use pure SPM or hybrid SPM and cache architecture Reconfigure cache that adapts to application [18, 1] Dynamically reconfigurable instruction memory with $ and SPM This Paper: Reduce energy and instruction conflicts Reduce energy and instruction conflicts Shut down cache ways Static mapping instructions into SPM [16, 9] Dynamic instruction replacement for SPM [7, 4, 14] Reconfigure memory hierarchy ($/SPM) for a given application[11, 15] Static architecture with static mapping Static architecture with dynamic instr. replacemnt Static architecture exploration with static mapping Dynamically reconfigurable data memory with $ and SPM [6] Dynamic architecture tuning (phases during execution) Reconfiguration management algorithm

Reconfigure instruction memory architecture at runtime Idea of the Dynamically Reconfigurable Instruction Memory (DRIM) Exploit the different requirement between phases within an application The four banks can be dynamically reconfigured as cache or SPM

Base on a four way associative cache  Configure four banks dynamically as cache or SPM DRIM Architecture – Part a7…a Set c i to 1 when used as a SPM Tag Bank will be gated when configured as a SPM

DRIM Architecture – Part a7…a Decide whether an instruction is residing in SPM: the upper and lower bound addresses of instruction block that resides in SPM is checked Decide whether an instruction is residing in SPM: the upper and lower bound addresses of instruction block that resides in SPM is checked Perform loading from Mem to SPM 5 5 0x000~ 0x0FF 0x000~ 0x0FF 0x100~ 0x1FF 0x100~ 0x1FF 0x200~ 0x2FF 0x200~ 0x2FF 0x300~ 0x3FF 0x300~ 0x3FF Use [a9:a8] to generate data bank selection D i Suppose size of each data bank is 256 bytes

The SPM_hit controls the gating of the tag and data banks DRIM Architecture – Part a7…a0 Data bank i enable signal 1: enable; 0: disable if (SPM_hit) then all tag banks will be gated; else only the tag banks configured as cache will be searched; if (SPM_hit) then all tag banks will be gated; else only the tag banks configured as cache will be searched; if (SPM_hit) then the SPM bank will be selected by D i else only the data banks configured as cache will be searched; if (SPM_hit) then the SPM bank will be selected by D i else only the data banks configured as cache will be searched;

Compiler Support for Dynamic Reconfiguration & Instruction Load Get the required execution statistics: - Execution counts of edge of CFG - # of procedure invocations Get the required execution statistics: - Execution counts of edge of CFG - # of procedure invocations Optimize inst. layout within each procedure: - Bring the frequently executed basic blocks together Optimize inst. layout within each procedure: - Bring the frequently executed basic blocks together Determine the architectural configuration for different phases: - When & what - Instruction allocation to SPM Determine the architectural configuration for different phases: - When & what - Instruction allocation to SPM Generate code chunk & load into SPM: - Group instruction blocks to SPM - Insert inst. for reconfiguration - Insert inst. for trace loading Generate code chunk & load into SPM: - Group instruction blocks to SPM - Insert inst. for reconfiguration - Insert inst. for trace loading With an optimized inst. layout

Loop Procedure Hierarchy Graph (LPHG) to represent a program  Capture all loops, procedure calls, and their relations Suppose most of energy consumed by inst. fetch occurs inside loop  If (Loop iterations > threshold), then it is beneficial to use SPM  The deeper loop in LPHG has higher execution frequency 。 Start from leaf loops to their parent loops  If (Loop > SPM size), then cache is used to buffer rest of loop Preface of Reconfiguration and Instruction Allocation

Algorithm for Reconfiguration and Instruction Allocation Leaf node Internal node Whether it is beneficial to allocate more SPM space from the free_banks Allocate frequently executed inst. inside loop to SPM Delete all reconfig. points inserted in child loops and add a new reconfig. point to entry of loop Since only one code chunk can reside in SPM

The evaluation function  Consider it is beneficial 。 When reduce cache size does not severely increase the I-cache miss Example of How to Evaluate Conflicts # inside circle: Loop Iterations # beside circle: Loop size $ $ $ $ $ $ $ $ 1.Try to configure one bank as SPM and allocate it to loop E: E Total size of remaining banks (64x3) > each of B, C, D No Conflict -> Safe Total size of remaining banks (64x2) > each of B, C Severe $ Conflict D 2. Configure one more bank as SPM and move loop D: 3. Configure one more bank as SPM and move loop B: Total size of remaining banks (64x1) < C SPMSPM SPMSPM SPMSPM SPMSPM

Goal: reduce the number of reconfiguration  If a loop does not have any sibling loops 。 Hoist the reconfiguration point from inner loop to outer loop Optimization: Hoist Reconfiguration Position Reconfiguration at entry of loop B Load code chunk into SPM whenever execute the child loop Decide reconfig. points & inst allocated to SPM Original Optimized

The DRIM is based on a 4-way associative I-cache  Each bank is size of 256 bytes Model energy consumption using CACTI for 0.13μm technology  The logic that performs address checking and SPM control is also included Experimental Setup Energy Consumption Per Access Energy of the cache portion when DRIM is configured as 1, 2, 3, 4 banks cache and SPM Energy for one data bank + energy overhead for accessing SPM

The average improvement  15.6% in I-cache miss rate  10.2% in execution time The improvement comes from  The frequently executed instructions of important loops are mapped into SPM Performance Improvement

The reduction in energy consumption by DRIM  Range from 14.3% to 65.2%  The average reduction is 41% The reduction comes from  The I-cache miss rate is improved 。 Fewer SDRAM accesses  The energy consumption per access of SPM is lower than that of cache Energy Saving There is actually energy savings even there is no miss rate reduction

This paper proposed a low power Dynamically Reconfigurable Instruction Memory (DRIM)  The I-cache can be configured as SPM for 。 Different applications as well as different phases of application’s execution  Compilation flow to support DRIM 。 Determine reconfiguration point and instructions allocated to SPM Experimental results show that DRIM  Reduce energy consumption up to 65.2% Conclusions

The DRIM architecture is clear and easy to understand  It also shows that the tag bank is not utilized when configured as SPM The complex compiler framework makes it hard to migrate to other Instruction Set Architecture (ISA) Comment for This Paper