Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Page Table Implementation

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Improving Cache Performance by Exploiting Read-Write Disparity

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu*, and Lide Duan Dept. of Electrical & Computer Engineering Louisiana State University.

Caches J. Nelson Amaral University of Alberta. Processor-Memory Performance Gap Bauer p. 47.

Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.

Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

Computer Architecture Part III-C: Memory Access and Management.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.

Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.

Page 19/17/2015 CSE 30341: Operating Systems Principles Optimal Algorithm  Replace page that will not be used for longest period of time  Used for measuring.

Computer Architecture Lecture 28 Fasih ur Rehman.

Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

Lecture Objectives: 1)Define set associative cache and fully associative cache. 2)Compare and contrast the performance of set associative caches, direct.

1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

Computer Architecture Lecture 26 Fasih ur Rehman.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

Virtual Memory 1 1.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

Chapter 9 Memory Organization By Nguyen Chau Topics Hierarchical memory systems Cache memory Associative memory Cache memory with associative mapping.

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Page Table Implementation. Readings r Silbershatz et al:

Lecture 20 Last lecture: Today’s lecture: Types of memory

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 9: Virtual Memory.

نظام المحاضرات الالكترونينظام المحاضرات الالكتروني Cache Memory.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

Translation Lookaside Buffer

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Chapter 1 Computer System Overview

Dynamic Branch Prediction

CSC 4250 Computer Architectures

Basic Performance Parameters in Computer Architecture:

Appendix B. Review of Memory Hierarchy

Consider a Direct Mapped Cache with 4 word blocks

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Lecture 29: Virtual Memory-Address Translation

Module IV Memory Organization.

Performance metrics for caches

Translation Lookaside Buffer

Performance metrics for caches

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Chapter 1 Computer System Overview

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Virtual Memory 1 1.

Presentation transcript:

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter: Tareq Hasan Khan ID: ID: ECE, U of S Literature review-4 (EE 800) Literature review-4 (EE 800)

2 Outline Introduction to Cache and Stack Introduction to Cache and Stack Proposed Dynamic Stack Allocator Proposed Dynamic Stack Allocator Cache Miss Predictor Cache Miss Predictor Stack Pointer Manager Stack Pointer Manager Results Results Conclusion Conclusion

3 Introduction Cache Cache A small and high-speed on-chip memory A small and high-speed on-chip memory Bridges the speed gap between microprocessor and main memory Bridges the speed gap between microprocessor and main memory It is necessary to reduce cache misses without increasing cache associativity for low-power embedded systems It is necessary to reduce cache misses without increasing cache associativity for low-power embedded systems Stack Stack A group of memory location used for local variables, temporary data of an application or return location of function calls A group of memory location used for local variables, temporary data of an application or return location of function calls Last in First Out (LIFO) structure Last in First Out (LIFO) structure Half (49%) of memory access related to stack Half (49%) of memory access related to stack

4 Dynamic Stack Allocator Conventional stack allocation is a method that inserts and extracts data sequentially without the consideration of cache misses Conventional stack allocation is a method that inserts and extracts data sequentially without the consideration of cache misses Proposed hardware - Dynamic Stack Allocator (DSA) Proposed hardware - Dynamic Stack Allocator (DSA) Cache Miss Predictor (CMP) Cache Miss Predictor (CMP) computes a cache miss probability at each cache line using the history of cache misses computes a cache miss probability at each cache line using the history of cache misses Stack Pointer Manager (SPM) Stack Pointer Manager (SPM) select a location for the stack pointer that has the lowest select a location for the stack pointer that has the lowest cache miss probability

5 Dynamic Stack Allocator

6 Outline Introduction to Cache and Stack Introduction to Cache and Stack Proposed Dynamic Stack Allocator Proposed Dynamic Stack Allocator Cache Miss Predictor Cache Miss Predictor Stack Pointer Manager Stack Pointer Manager Results Results Conclusion Conclusion

7 Cache Miss Predictor (CMP) Cache Miss Controller (CMC) Cache Miss (CM) buffer Consists of “index” and “count” register pairs

8 Cache Miss Controller (CMC) Cache controller detects cache misses through comparing the tags in the cache with tag bits of address requested by the processor. When a cache miss is detected, the cache controller transfers cache miss signal to notify CMP that cache miss has occurred and an index of missing line is also supplied. On a cache miss, the index is saved at CM buffer and its corresponding counter is incremented by the CMC. When the CM buffer is full, an entry is replaced according to the interval-based LRU policy

9 Cache Miss (CM) buffer Recent CM buffer (RCM buffer) History CM buffer (HCM buffer)

10 Cache Miss (CM) buffer Recent CM buffer (RCM buffer) On a cache miss to cache line k, an associative lookup into the RCM buffer is performed using k. If there is an entry with index k, then the counter for the line k is incremented. However, if no match occurs and the RCM buffer is not full, the index is recorded in one of the empty lines and the corresponding counter is incremented. History CM buffer (HCM buffer) When the RCM buffer is full, the HCM buffer is replaced with the contents of the RCM buffer according to the LRU policy. The indices in the HCM buffer are replaced with the indices in the RCM buffer with a larger value. In the interval-based LRU policy, the comparison for the replacement doesn’t occur until the RCM buffer is full.

11 Outline Introduction to Cache and Stack Introduction to Cache and Stack Proposed Dynamic Stack Allocator Proposed Dynamic Stack Allocator Cache Miss Predictor Cache Miss Predictor Stack Pointer Manager Stack Pointer Manager Results Results Conclusion Conclusion

12 Stack Pointer Manager (SPM) When an application requires a stack, the SPM looks for a location that has the lowest cache miss probability using the contents of the RCM and HCM buffer

13 Stack Pointer Manager (SPM) When a function is called, the SPM calculates the total cache miss probability within the searching window (R1, R2) of each sub-stack. To calculate the total cache miss probability, SPM looks up and down the RCM and HCM buffer to know whether indices included in the searching window exist or not. If it exists, SPM adds the corresponding value to get the total cache miss probability. After computation, SPM compares the computed probability of a sub-stack with one of other sub-stacks. Then, SPM dynamically selects a sub-stack that has the lowest cache miss probability as the stack for an application.

14 Outline Introduction to Cache and Stack Introduction to Cache and Stack Proposed Dynamic Stack Allocator Proposed Dynamic Stack Allocator Cache Miss Predictor Cache Miss Predictor Stack Pointer Manager Stack Pointer Manager Results Results Conclusion Conclusion

15 Result Implemented within the OpenRISC 1200 microprocessor with 8KB direct- mapped data cache and 8KB direct-mapped instruction cache, each with 16-byte line size The amount of data traffic between cache and main memory according to the size of the RCM and HCM buffer, where the traffic is normalized to one for conventional The amount of traffic of FFT is 42% smaller than one of the conventional scheme. Some cases, traffic increases, e.g., DFT with the DSA configurations of RCM(5) and HCM(8).

16 Result…cont. Variation of the amount of data traffic according to the number of sub-stacks. In all cases, the more the number of sub-stack is, the smaller the amount of traffic. But not a very significant improvement.

17 Result…cont. ASIC implementation of DSA was done The maximum speed was 87MHz The size of DSA is 0.3mm X 0.4mm which is about 1% of total core area

18 Conclusion Proposed a hardware for cache miss-aware dynamic stack allocation to reduce cache misses Proposed a hardware for cache miss-aware dynamic stack allocation to reduce cache misses Based on the history of cache misses, the proposed scheme controls the stack pointer Based on the history of cache misses, the proposed scheme controls the stack pointer to a location expected to cause smaller cache misses. In various benchmarks, it was shown that traffic In various benchmarks, it was shown that traffic between cache and main memory was reduced by DSA from 4% to 42%.

19 Thanks