Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

Slides:



Advertisements
Similar presentations
Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.
Advertisements

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
Non-Uniform Cache Architecture Prof. Hsien-Hsin S
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
Memory Hierarchy. Smaller and faster, (per byte) storage devices Larger, slower, and cheaper (per byte) storage devices.
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,
Lecture 17: Virtual Memory, Large Caches
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.
Systems I Locality and Caching
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Cache memory October 16, 2007 By: Tatsiana Gomova.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.
Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.
Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.
IT253: Computer Organization Lecture 11: Memory Tonga Institute of Higher Education.
IT253: Computer Organization
Chapter Twelve Memory Organization
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Lecture 26 Fasih ur Rehman.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
Cosc 2150: Computer Organization
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.
Project 11: Influence of the Number of Processors on the Miss Rate Prepared By: Suhaimi bin Mohd Sukor M
By Islam Atta Supervised by Dr. Ihab Talkhan
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Advanced Caches Smruti R. Sarangi.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Lecture: Large Caches, Virtual Memory
Multilevel Memories (Improving performance using alittle “cash”)
Lecture: Large Caches, Virtual Memory
Basic Performance Parameters in Computer Architecture:
Caches 2 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Lecture 21: Memory Hierarchy
Lecture 12: Cache Innovations
Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches
Lecture: Cache Innovations, Virtual Memory
Module IV Memory Organization.
CMSC 611: Advanced Computer Architecture
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Lecture: Cache Hierarchies
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Presentation transcript:

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller

Plan Motivation What is NUCA UCA and ML-UCA Static NUCA Dynamic NUCA Simulation Results

Motivation Bigger L2 and L3 Caches are needed –Programs are larger –SMT requires large cache for spatial locality –BW demands have increased on the package –Smaller technologies permit more bits per mm 2 Wire delays dominate in large caches –Bulk of the access time will involve routing to and from the banks, not the bank accesses themselves

What is NUCA? Data residing closer to the processor is accessed much faster than data that reside physically farther from the processor Example: The closest bank in a 16MB on-chip L2 cache built in 50nm process technology could be accessed in 4 cycles, while an access to the farthest bank might take 47 cycles.

UCA and ML-UCA UCA Avg. access time: 255 cycles Banks: 1 Size: 16MB Technology: 50nm L2 41 L3 41 L2 10 ML-UCA Avg. access time: 11/41 cycles Banks: 8/32 Size: 16MB Technology: 50nm

Static-NUCA-1 … … S-NUCA-1 Avg. access time: 34 cycles Banks: 32 Size: 16MB Technology: 50nm Area: Wire overhead 20.9%

S-NUCA-1 cache design

Static-NUCA … … … … S-NUCA-2 Avg. access time: 24 cycles Banks: 32 Size: 16MB Technology: 50nm Area: Channel overhead 5.9%

S-NUCA-2 cache design Address bus Sense amplifier

Dynamic-NUCA D-NUCA Avg. access time: 18 cycles Banks: 256 Size: 16MB Technology: 50nm 447 … … … … Data migration

Management of Data in DNUCA Mapping: –How the data are mapped to the banks and in which banks a datum can reside? Search: –How the set of possible locations are searched to find a line? Movement: –Under what conditions the data should be migrated from one bank to another?

Simple Mapping (implemented) 8 bank sets way 1 way 2 way 3 way 4 memory controller one set bank

Fair and Shared Mapping Fair MappingShared Mapping memory controller

Searching Cached Lines Incremental search Multicast search (Implemented) Limited multicast Partitioned multicast Smart Search: ss-performance ss-energy

Dynamic Movement of Lines LRU line furthest and MRU line closest One-bank promotion on a hit (implemented) Policy on miss: Which line is evicted? –Line in the furthest (slowest) bank -- (implemented) Where is the new line placed? –Closest (fastest) bank –Furthest (slowest) bank -- (implemented) What happens to the victim line? –Zero copy policy (implemented) –One copy policy

Advantages of DNUCA over ML-UCA DNUCA does not enforce inclusion thus preventing redundant copies of the same line In ML-UCA the faster level may not match the working set size of an application, either being too large and thus slow, or being too small and thus incurring misses

Configuration for simulation Used Sim-Alpha and Cacti Simple mapping Multicast search One-bank promotion on each hit Replacement policy that chooses the block in the slowest bank as the victim of a miss

Hit Rate Distribution for D-NUCA

Simulation results – integer benchmarks

Simulation results – FP benchmarks

Summary D-NUCA has the following plus points: Low Access Latency Technology scalability Performance stability Flattens the memory hierarchy