CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

1 Cache and Caching David Sands CS 147 Spring 08 Dr. Sin-Min Lee.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Non-Uniform Cache Architecture Prof. Hsien-Hsin S

1 Lecture 16: Large Cache Design Papers: An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al., ASPLOS’02 Distance Associativity.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.

1 Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections )

1 Lecture 16: Large Cache Innovations Today: Large cache design and other cache innovations Midterm scores  91-80: 17 students  79-75: 14 students 

Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture Zvika Guz, Idit Keidar, Avinoam Kolodny, Uri C. Weiser The Technion – Israel.

1 Lecture 19: Networks for Large Cache Design Papers: Interconnect Design Considerations for Large NUCA Caches, Muralimanohar and Balasubramonian, ISCA’07.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,

Lecture 17: Virtual Memory, Large Caches

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

1 Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies Reminders:  Assignment 5 posted.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

CS 7810 Lecture 6 The Impact of Delay on the Design of Branch Predictors D.A. Jimenez, S.W. Keckler, C. Lin Proceedings of MICRO

TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.

Non-Uniform Cache Architectures for Wire Delay Dominated Caches Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.

Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.

1 Lecture: Virtual Memory Topics: virtual memory, TLB/cache access (Sections 2.2)

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.

CSCI206 - Computer Organization & Programming

Lecture: Large Caches, Virtual Memory

ASR: Adaptive Selective Replication for CMP Caches

Lecture: Large Caches, Virtual Memory

Lecture: Cache Hierarchies

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture: Large Caches, Virtual Memory

Lecture: Cache Hierarchies

Lecture 13: Large Cache Design I

Lecture 12: Cache Innovations

Chapter 5 Memory CSE 820.

Lecture: Cache Innovations, Virtual Memory

Lecture 22: Cache Hierarchies, Memory

Lecture: Cache Hierarchies

A Case for Interconnect-Aware Architectures

Cache Performance Improvements

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

CS 7810 Lecture 17 Managing Wire Delay in Large CMP Caches B. Beckmann and D. Wood Proceedings of MICRO-37 December 2004

Cache Design DECODERDECODER Sense Amp Address DECODERDECODER Comparator Mux+driver Data Tag Array Data Array

Capacity Vs. Latency 8KB 1 cycle 32 KB 2 cycles 128 KB 3 cycles

Large L2 Caches CPU Issues to be addressed for Non-Uniform Cache Access: Mapping Searching Movement

Dynamic NUCA Frequently accessed blocks are moved closer to CPU – reduces average latency Partial (6-bit) tags are maintained close to CPU – tag look-up can identify potential location of block or quickly signal a miss Without partial tags, every possible location would have to be searched serially or in parallel What if you optimize for power?

DNUCA – CMP Latency 13-17cyc Latency 65 cyc Allocation: static, based on block’s address Migration: r.l  r.i  r.c  m.c  m.i  m.l Search: multicast to 6; then multicast to 10 False misses

Alternative Layout From Huh et al., ICS’05

Block Sharing

Hit Distribution

Block Migration Results While block migration reduces avg. distance, it complicates search.

CMP-TLC Pros: Fast wires enable uniform low-latency access Cons: Low-bandwidth interconnect High implementation cost More latency/complexity at the L2 interface

Stride Prefetching Prefetching algorithm: detect at least 4 uniform stride accesses and then allocate an entry in stream buffer Stream buffer has 8 entries and each stream stays 6 (L1) or 25 (L2) accesses ahead

Combination of Techniques

Title Bullet