TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 24 Instructor: L.N. Bhuyan
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
COMP 3500 Introduction to Operating Systems Paging: Basic Method Dr. Xiao Qin Auburn University Slides.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
Translation Lookaside Buffer
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Virtual memory.
COSC3330 Computer Architecture
CSE 351 Section 9 3/1/12.
ECE232: Hardware Organization and Design
CS161 – Design and Architecture of Computer
Memory and cache CPU Memory I/O.
Lecture 12 Virtual Memory.
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
Basic Performance Parameters in Computer Architecture:
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
Lecture: Cache Hierarchies
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Energy-Efficient Address Translation
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Part V Memory System Design
Andy Wang Operating Systems COP 4610 / CGS 5765
Memory and cache CPU Memory I/O.
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Lecture 22: Cache Hierarchies, Memory
Lecture 22: Cache Hierarchies, Memory
Andy Wang Operating Systems COP 4610 / CGS 5765
Virtual Memory فصل هشتم.
Performance metrics for caches
Adapted from slides by Sally McKee Cornell University
Performance metrics for caches
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
M. Usha Professor/CSE Sona College of Technology
Lecture 22: Cache Hierarchies, Memory
CSC3050 – Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
CS703 - Advanced Operating Systems
Cache - Optimization.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Andy Wang Operating Systems COP 4610 / CGS 5765
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy Presented by Rohit Reddy Takkala

Introduction First level caches are performance critical and are therefore optimized for speed. Modern processors reduce the miss ratio by using set-associative caches and optimize latency by reading all ways in parallel with the TLB(Translation Lookaside Buffer) and tag lookup. To reduce energy, phased-caches (latency increases) and way-prediction (complexity increases) techniques have been proposed wherein only data of the matching/predicted way is read. A new cache structure is formulated.

Introduction - Intel reports that 3% - 13% and 12% - 45% of the core power comes from TLBs and caches, respectively. Of this, 80% of the TLB and cache energy is spent in the data-array

Brief overview of Virtual Memory Why Virtual Memory? Limitations in size of physical memory Fragmentation Programs writing over each other Typical types of Virtual Memory Virtually Indexed Virtually Tagged Physically Indexed Physically Tagged Virtually Indexed Physically Tagged

VIPT (Virtually Indexed Physically tagged) Schematic Overview of 32kB, 8-way VIPT cache with a 64 entry, 8-way TLB

Why TLC? Achieve similar performance as that of VIPT as well as reduce the energy consumed in the 1st level cache How? Avoid tag comparisons (by eliminating tag array) Eliminate extra data-reads (by reading the right way directly) Filter out cache misses (through valid bit in CLT)

Design Overview of TLC Tag-Less Cache (TLC) design overview for a 32 kB, 8-way cache, with a 64-entry, 8-way eTLB

Example 1 - Cache Hit A cache hit requires two steps: 1) Lookup the cache line valid bit and the way index in the eTLB The CLT contains information for all the cache lines within a page. However, we only need the addressed cache line’s way index. So, use the cache index bits (IDX) from the virtual address to select the correct cache line information from the CLT. 2) Read the data from the data-array Generate the address in the data array by combining the way index from the CLT with the set index (IDX) from the virtual address to directly index into the data-array

Example 2 - Cache Miss Since we have no tags, there are two ways in which we can determine that the data is not in the cache: 1) The valid bit for the cache line in the eTLB is not set, or 2) the page is not in the eTLB (i.e., TLB miss). Energy is conserved on cache misses by not reading the data-array at all. A VIPT cache will read all ways regardless whether there is a miss or not. A cache miss is typically handled in two steps – First by evicting conflicting cache lines, and then inserting the new cache line. Both of these steps need to update the eTLB since it keeps track of what data is in the cache and where it is located. A cache miss is handled differently depending on whether: 1) the cache line is missing, but the page exists in the eTLB, or 2) both the cache line and the page are missing.

Making TLC Faster – Cache Effectiveness 34% of SPEC’s execution (Region C) exhibit more cache misses from premature cache evictions due to eTLB evictions. That is, cache lines are evicted before they can be used again because their pages were evicted from the eTLB. One way to reduce the number of cache line evictions due to eTLB replacement is to use a larger eTLB. However, this increases both the size and energy requirements.

Making TLC Faster – LAD (Least Allocated Data) Replacement The goal of the new eTLB replacement policy is to minimize number of cache line evictions that are caused by eTLB replacements. The page with the Least Allocated Data (LAD) is replaced (i.e., the page that has the fewest cache lines in the cache). The amount of data in the cache for each eTLB entry can be calculated by summing the valid bits in the CLT or with a counter that is incremented when a new cache line is inserted.

Making TLC Faster – LAD (Least Allocated Data) Replacement LRU TLB replacement evicts 1.88 cache lines per eTLB replacement on average (2.58 for Region C). LAD reduces number of cache line evictions to 0.74 cache lines per eTLB replacement (1.11 for Region C). However, both the eTLB and the cache miss ratio increase. The eTLB miss ratio increases because LRU handles temporal locality much better than LAD. The cache miss ratio on the other hand increases because LAD evicts pages that are about to be filled (i.e., the page is empty the first time it is used)

Making TLC Faster – LAD (Least Allocated Data) Replacement LAD+LRU: LAD is only applied to the n least recently used pages. That is, the page with least allocated data (LAD) of the n least recently used (LRU) pages is selected for replacement. This makes sure that newly loaded pages that are filling up with data stay in the cache (LRU), and that we minimize cache line evictions by evicting pages with the least data (LAD).

Cache Effectiveness – Micro-Pages Even with LAD+LRU, the miss ratio for TLC is still 0.45 percentage points higher on average and 1.5 percentage points higher for the worst simulation points. One of the reason for this is due to the discrepancy in number of entries in the eTLB and the cache (8× more cache entries than eTLB entries). Since every cache line must have an entry in the eTLB, the eTLB can become a bottleneck for applications with large data sets and sparse access patterns. Instead of naively adding more entries to the eTLB, micro-pages (i.e., pages smaller than 4 kB) are used to minimize the eTLB size more entries are added. A micro page contains fewer cache lines and therefore needs less cache line location information. For example, a 512-entry eTLB with 512 B pages needs 1/3 as many bits as a 512-entry eTLB with 4 kB pages

Summary Various other optimizations such as Macro-page preloading, eTLB & cache communications, higher associativity are done to achieve performance as close as VIPT. The simulation results show that the TLC design is: Fast enough to support clock frequencies beyond 4 GHz Can achieve the same performance (miss ratio and CPI) as a traditional VIPT cache, and Uses 78% less dynamic runtime energy than a VIPT cache.