Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang College of William and Mary.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Modified from notes by Saeid Nooshabadi
CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
TWO FAST AND HIGH-ASSOCIATIVITY CACHE SCHEMES By Amala Nitya Rama Venkata Prasad.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
Low Power Techniques in Processor Design
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Chapter 8 Memory Management Dr. Yingwu Zhu. Outline Background Basic Concepts Memory Allocation.
CMPE 421 Parallel Computer Architecture
Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.
Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang Ohio State University Acknowledgement of Contributions: Chenxi Zhang, Tongji University.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Computer Organization & Programming
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Caching Chapter 7.
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
Lecture 20 Last lecture: Today’s lecture: Types of memory
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Chapter 9 Memory Organization. 9.1 Hierarchical Memory Systems Figure 9.1.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.
Soner Onder Michigan Technological University
CSC 4250 Computer Architectures
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
CS61C : Machine Structures Lecture 6. 2
CSCI206 - Computer Organization & Programming
Chapter 5 Memory CSE 820.
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Cache - Optimization.
Overview Problem Solution CPU vs Memory performance imbalance
Virtual Memory 1 1.
Presentation transcript:

Hardware Caches with Low Access Times and High Hit Ratios Xiaodong Zhang College of William and Mary

Basics of Hardware Caches  A data item is referenced by its memory address.  It is first searched in the cache.  Three questions cover all the cache operations:  How do we know it is in the cache?  If it is (a hit), how do we find it?  If it is not (a miss), how to replace the data if the location is already occupied?

Direct-Mapped Cache  The simplest scheme, but popular and efficient.  Each access is mapped to exactly one location.  The mapping follows:  (memory address) mod (number of cache blocks)  A standard block has 4 bytes (a word), but it is increasingly longer for spacial locality.

An Example of an Direct Mapped-cache 7 10 (00111) 2 Mod 8 10 = (01111) 2 Mod 8 10 = (10111) 2 Mod 8 10 = (11111) 2 Mod 8 10 = Cache Memory

Nature of Direct-mapped Caches  If the number of cache blocks is a power of 2, the mapping is exactly the low-order log 2 (cache size in blocks) bits of the address.  If cache = 2 3 = 8 blocks, the 3 low-order bits are directly mapped addresses.  The lower-order bits are also called cache index.

Tags in Direct-Mapped Caches  A cache index for a cache block is not enough because multiple addresses will be mapped to the same block.  A ``tag” is used to make this distinction. The upper portion of the address forms the tag: 2 bits3 bits TagCache Index  If both the tag and index are matched between memory and cache, a ``hit” happens.  A valid bit is also attached for each cache block.

Allocation of Tag, Index, and Offset Bits  Cache size: 64 Kbytes.  Block size: 4 Bytes (2 bits for offset)  64 Kbytes = 16 K blocks = 2 14 (14 bits for index)  For a 32-bit memory address: 16 bits left for tag.  Each cache line contains a total of 49 bits:  32 bits (data of 4 Bytes)  16 bits (tag)  1 bit for valid bit. 16 bits14 bits2 bits Tag Index for 16K words Byte offset Address

Set-associative Caches  Direct-mapping one location causes high miss rate.  How about increasing the number of possible locations for each mapping?  The number of locations is called ``associativity”. A set contains more than one location.  associativity = 1 for direct-mapped cache  Set-associative cache mapping follows:  (memory address) mod (number of sets)  Associativity = number of blocks for fully associative cache.

Direct-mapped vs. Set-associative Direct-mapped 2-way set-associative Set 0 Set 7 Set 3 Way 0 Way 1 Address Mod #set

Cache Accesses Direct-mapped 2-way set-associative Set 0 Set 7 Set 3 Way 0 Way 1 27A427A A 7 A 4 A A 7 A misses 27A427A A A A A A 7 2 2A 2 4 misses References

CPU address Direct-mapped Cache Operations: Minimum Hit Time tag setoffset CPU tag Cache data Yes! Data Ready =? tag

CPU address Set-associative Cache Operations: Delayed Hit Time tagsetoffset =? To CPU tag way0 data way1way2way3 Mux 4:1

Set-associative Caches Reduce Miss Ratios 172.mgrid: SPEC 2000, multi-grid solver

Trade-offs between High Hit Ratios (SA) and Low Access Times (DM)  Set- associative cache achieves high hit-ratios: 30% higher than direct-mapped cache.  But it suffers high access times due to  Multiplexing logic delay during the selection.  Tag checking, selection, and data dispatching are sequential.  Direct-mapped cache loads data and checks tag in parallel: minimizing the access time.  Can we get both high hit ratios and low access times?  The Key is the Way Prediction: speculatively determine which way is the hit so that only that way is accessed.

Best Case of Way Prediction: First Hit Cost: way prediction only tagsetoffset tag way0 data way1way2way3 Way-prediction =? Mux 4:1 To CPU

Way Prediction: Non-first Hit (in Set) Cost: way prediction + selection in set tagsetoffset tag way0 data way1way2way3 Way-prediction =? Mux 4:1 To CPU

Worst Case of Way-prediction: Miss Cost: way prediction + selection in set + miss tagsetoffset tag way0 data way1way2way3 Way-prediction =? Mux 4:1 Lower level Memory To CPU

MRU Way Predictions Chang, et. al., ISCA’87, (for IBM 370 by IBM) Kessler, et. al., ISCA’89. (Wisconsin) Mark the Most Recent Use (MRU) block in each set. Access this block first. If hits, low access time. If the prediction is wrong, search other blocks. If the search fails in the set, it is a miss.

MRU Way Prediction MRU Table Way 0 Way 1 Way 2 Way Reference Miss Reference Non-first Hit

Limits of MRU Set-Associative Caches  First hit is not equivalent to a direct-mapped hit.  MRU index is fetched before accessing the block (either cache access cycle is lengthened or additional cycle is needed).  The MRU location is the only search entry in the set.  The first hit ratio can be low in cases without many repeated accesses to single data (long reuse distance), such as loops.  MRU cache can reduce access times of set-associative caches by a certain degree but  It still has a big gap with that of direct-mapped cache.  It can be worse than set-associative cache when first-hits to MRU locations are low.

Multi-column Caches: Fast Accesses and High Hit Ratio  Zhang, et. al., IEEE Micro, (W&M)  Objectives:  Each first hit is equivalent to a direct-mapped access.  Maximize the number of first hits.  Minimize the latency of non-first-hits.  Additional hardware is simple and low cost.

Basic Ideas of Multi-column Caches  A major location in a set is the direct-mapped location of MRU.  A selected location is the direct-mapped location but non-MRU.  An selected location index is maintained for each major location.  A ``swap” is used to ensure the block in the major location is always MRU.

Multi-Column Caches: Major Location  The unused bits and ``set bits” generate a direct- mapped location: Major Location.  A major location mapping = direct-mapping.  Major location only contains a MRU block either loaded from memory or just accessed. SetTagOffset Address Unused bits directing to the block (way) in the set

Multi-column Caches: Selected Locations  Multiple blocks can be direct-mapped to the same major location, but only MRU stays.  The non-MRU blocks are stored in other empty locations in the set: Selected Locations.  If ``other locations” are used for their own major locations, there will be no space for selected ones.  Swap:  A block in selected location is swapped to major location as it becomes MRU.  A block in major location is swapped to a selected location after a new block is loaded in it from memory.

Multi-Column: Indexing Selected Locations  The selected locations associated with its major location are indexed for a fast search. Location 0 Location 1 Location 2 Location Bit Vector 3Bit Vector 2Bit Vector 1Bit Vector 0 Major Location 1 has two selected locations at 0 and 2

Multi-column Cache Operations Reference 1 Way 0 Way 1 Way 2 Way Selected location Not at major location ! Reference No selected location! 0001 Place 0001 at the major location ! First Hit!

Performance Summary of Multi-column Caches  Hit ratio to the major locations is about 90%.  The hit ratio is higher than that of direct-mapped cache due to high associativity while keeps low access time of direct-mapped cache in average.  First-hit is equivalent to direct-mapped.  Non-first hits are faster than set-associative caches.  Not only outperform set associative caches but also  Column-associative caches (ISCA’93, MIT).

Comparing First-hit Ratios between Multicolumn and MRU Caches

Some Complexity of Multi-column Caches  Search of the selected locations can be  sequential based on the vector bits of each set, or  parallel with a multiplexor for a selection.  If a mapping finds its major location is occupied by a selected location in another major location group,  Replace it by the major location data,  either search the bit vectors to set the bit to 0, or simply  ignore it for a miss when the selected location is searched.  The index may be omitted by only relying on the swapping.  Partial indexing by only tracing one selected location.

Multi-column Technique is Critical for Low Power Caches Source: Intel.com Pentium Pentium Pro Pentium II Pentium III Pentium 4

Importance of Low-power Designs §Portable systems: l Limited battery lifetime §High-end systems l Cooling and package cost > 40 W:1 W  $1 Air-cooled techniques: reaching limits l Electricity bill l Reliability

Low-power Techniques §Physical (CMOS) level §Circuit level §Logic level §Architectural level §OS level §Compiler level §Algorithm/application level

Tradeoff between Performance and Power §Objects for general-purpose system l Reduce power consumption without degrading performance §Common solution l Access/activate resources only when necessary §Question l When is necessary?

On-chip Caches: Area & Power (Alpha 21264) Source: CoolChip Tutorial

Standard Set-associative Caches: Energy Perspective tagsetoffset =? To CPU tag way0 data way1way2way3 Mux 4:1

Phased Cache tagsetoffset =? To CPU tag way0 data way1way2way3 Mux 4:1

Way-prediction Cache tagsetoffset tag way0 data way1way2way3 Way-prediction =? Mux 4:1 To CPU

Limits of Existing Techniques §Way-prediction caches l Effective for cache hits l Good for programs with strong locality §Phased caches l Effective for cache misses l Good for programs with weak locality

Cache Hit Ratios are Very Different

Cache Optimization Subject to Both Power and Access Time §Objectives l Pursue lowest access latency and power consumption for both cache hit and miss. l Achieve consistent power saving across a wide range of applications. §Solution l Apply way-prediction to cache hits and phase cache to misses. l Access mode prediction (AMP) cache. l Zhu and Zhang, IEEE Micro, 2002 (W&M)

Access predicted way (1 tag + 1 data) Way prediction Prediction correct? Yes Access mode prediction Way prediction Access mode prediction AMP Cache Access all N tags Access 1 data Access all other ways ((N-1) tag + (N-1) data) No

Prediction and Way Prediction §Prediction l Access predictor is designed to predict next access be a hit or a miss. l The prediction result is used to switch between phase cache and way prediction technique. l Cache misses are clustered and program behavior is repetitive. l Branch prediction techniques are adopted. §Way Prediction l Multi-column is found the most effective.

Energy Consumption: Multi-column over MRU Caches

Energy Consumption

Conclusion § Multi-column cache fundamentally addresses the performance issue for both high hit ratio and low access time. l major location mapping is dominant and has the minimum access time (=direct-mapped access) l swap can increase the first hit ratios in major locations. l Indexing selected locations make non-first-hits fast. § Multicolumn cache is also an effective way prediction mechanism for low powers.