Non-Uniform Cache Architecture Prof. Hsien-Hsin S

Slides:



Advertisements
Similar presentations
Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.
Advertisements

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
55:035 Computer Architecture and Organization Lecture 7 155:035 Computer Architecture and Organization.
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.
Computer Architecture Evaluation, Simulation and Research OSU ECE OS Interaction with Cache Memories Dr. Sohum Sohoni School of Electrical and Computer.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )
Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.
1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 
Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.
1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Lecture 17: Virtual Memory, Large Caches
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.
Main Memory Background Random Access Memory (vs. Serial Access Memory) Cache uses SRAM: Static Random Access Memory –No refresh (6 transistors/bit vs.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
Lecture 6. Multithreading & Multicore Processors Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture.
Cache memory October 16, 2007 By: Tatsiana Gomova.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.
Design and Management of 3D CMP’s using Network-in-Memory Feihui Li et.al. Penn State University (ISCA – 2006)
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multithreading and Multicore Processors Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.
Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.
A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100.
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
ECE 4100/6100 Advanced Computer Architecture Lecture 11 DRAM and Storage Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia.
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
Analysis of NUCA Policies for CMPs Using Parsec Benchmark Suite Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research Center Intel.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
LRU-PEA: A Smart Replacement Policy for NUCA caches on Chip Multiprocessors Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
§ Georgia Institute of Technology, † Intel Corporation Initial Observations of Hardware/Software Co-simulation using FPGA in Architecture Research Taeweon.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.
Advanced Caches Smruti R. Sarangi.
CS 704 Advanced Computer Architecture
Lecture: Large Caches, Virtual Memory
Lecture 13: Large Cache Design I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Directory-based Protocol
Introduction to CUDA Programming
A Case for Interconnect-Aware Architectures
Presentation transcript:

Non-Uniform Cache Architecture Prof. Hsien-Hsin S Non-Uniform Cache Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100 for Prof. Yalamanchili

Non-Uniform Cache Architecture ASPLOS 2002 proposed by UT-Austin Facts Large shared on-die L2 Wire-delay dominating on-die cache 3 cycles 1MB 180nm, 1999 11 cycles 4MB 90nm, 2004 24 cycles 16MB 50nm, 2010

Multi-banked L2 cache Bank=128KB 11 cycles 2MB @ 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles

Multi-banked L2 cache Bank=64KB 47 cycles 16MB @ 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles

Static NUCA-1 Use private per-bank channel Tag Array Data Bus Address Bus Bank Sub-bank Predecoder Sense amplifier Wordline driver and decoder Use private per-bank channel Each bank has its distinct access latency Statically decide data location for its given address Average access latency =34.2 cycles Wire overhead = 20.9%  an issue

Static NUCA-2 Tag Array Bank Switch Data bus Predecoder Wordline driver and decoder Use a 2D switched network to alleviate wire area overhead Average access latency =24.2 cycles Wire overhead = 5.9%

Dynamic NUCA Data can dynamically migrate Move frequently used cache lines closer to CPU

Dynamic NUCA bank 8 bank sets one set way 0 way 1 way 2 way 3 Simple Mapping All 4 ways of each bank set needs to be searched Farther bank sets  longer access

Dynamic NUCA bank 8 bank sets one set way 0 way 1 way 2 way 3 Fair Mapping Average access time across all bank sets are equal

Dynamic NUCA bank 8 bank sets way 0 way 1 way 2 way 3 Shared Mapping Sharing the closet banks for farther banks