Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

The University of Adelaide, School of Computer Science

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

Cache Optimization Summary

The University of Adelaide, School of Computer Science

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

Cache Organization of Pentium

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos

1 Lecture 7: Implementing Cache Coherence Topics: implementation details.

Performance of Snooping Protocols Kay Jr-Hui Jeng.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Cache Coherence: Directory Protocol

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Cache Organization of Pentium

תרגול מס' 5: MESI Protocol

Computer Engineering 2nd Semester

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture: Cache Hierarchies

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

A Study on Snoop-Based Cache Coherence Protocols

12.4 Memory Organization in Multiprocessor Systems

Cache Memory Presentation I

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

CMSC 611: Advanced Computer Architecture

Lecture 5: Snooping Protocol Design Issues

Lecture: Cache Innovations, Virtual Memory

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

/ Computer Architecture and Design

Lecture 25: Multiprocessors

High Performance Computing

CS 3410, Spring 2014 Computer Science Cornell University

Lucía G. Menezo Valentín Puente Jose Ángel Gregorio

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking Jason F. Cantin, Mikko H. Lipasti, and James E. Smith International Symposium on Computer Architecture June 7th, 2005

Overview of Idea Coarse-Grain Coherence Tracking: Monitors coherence status of memory at a multi-line granularity Uses the coarse-grain information to identify requests that don’t need a coherence broadcast Sends these requests directly to memory June 7, 2005 ISCA 2005

Problem Snoop-based systems support a limited number of processors Broadcast Network $ P DRAM NC MC Data Network Snoop-based systems support a limited number of processors Limited broadcast bandwidth Increasing memory latency June 7, 2005 ISCA 2005

Opportunity Some data requests don’t need a broadcast Requests for non-shared data Fetches of unmodified instructions Write-backs Some non-data requests don’t need to leave the processor Requests to upgrade copy, but not shared Requests to flush copies, but not cached elsewhere June 7, 2005 ISCA 2005

Unnecessary Broadcasts June 7, 2005 ISCA 2005

Our Approach Identify requests that don’t need a broadcast Send data requests directly to memory Reduce broadcast traffic Reduce latency in some systems Avoid sending non-data requests externally Further reduce broadcast traffic Reduce latency June 7, 2005 ISCA 2005

Coarse-Grain Coherence Tracking Memory is divided into coarse-grain regions Aligned, power-of-two multiple of cache line size Can range from two lines to a physical page A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA) June 7, 2005 ISCA 2005

Coarse-Grain Coherence Tracking Each entry has an address tag, state, and count of lines cached by the processor The state indicates if the processor and / or other processors are sharing / modifying lines in the region On cache misses, the region state is read to determine if a broadcast is necessary June 7, 2005 ISCA 2005

Coarse-Grain Coherence Tracking On snoops, the region state provides a response for the region Piggy-backed onto the conventional response Used to update other processors’ region state RCA maintains inclusion over caches When regions are evicted, their lines are evicted RCA must respond correctly if region’s lines cached Replacement algorithm uses line count June 7, 2005 ISCA 2005

Example: Conventional Snooping Network Read: P0, 100002 Read: P0, 100002 Invalid Invalid Tag State P0 loads 100002 0000 0010 $0 Exclusive Invalid Pending $1 0000 Invalid MISS 0000 Invalid 0000 Invalid Snoop performed Data Load: 100002 Data P0 P1 Response sent Data transfer M0 M1 June 7, 2005 ISCA 2005

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network P0 has exclusive access to region Read: P0, 100002 Invalid, Region Not Shared Read: P0, 100002 Invalid, Region Not Shared Tag State P0 loads 100002 0010 0000 $0 Pending Invalid Exclusive 000 001 RCA DI Pending Invalid 0000 $1 Invalid 000 RCA Invalid MISS 0000 Invalid 000 Invalid 0000 Invalid 000 Invalid Snoop performed Data Load: 100002 P0 P1 Response sent Data Data transfer M0 M1 June 7, 2005 ISCA 2005

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network Exclusive region state, broadcast unnecessary Tag State P0 loads 110002 $0 0010 Exclusive 001 RCA DI $1 0000 Invalid RCA 000 Invalid MISS, Region Hit 0011 0000 Invalid Exclusive Pending 000 Invalid 0000 Invalid 000 Invalid Direct request sent Data Load: 110002 P0 P1 Data transfer Read: P0, 110002 Data M0 M1 June 7, 2005 ISCA 2005

Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network Region not exclusive anymore Owned, Region Owned RFO: P1, 100002 Owned, Region Owned RFO: P1, 100002 P1 stores 100002 0010 $0 Exclusive Pending Invalid 001 RCA DI DD 0010 $1 0000 Invalid Modified Pending 001 000 RCA DD Pending Invalid MISS 0011 Exclusive 000 Invalid 0000 Invalid 000 Invalid Snoop performed Data Data Store: 100002 Hits in P0 cache P0 P1 Response sent Data transfer M0 M1 June 7, 2005 ISCA 2005

Overhead Storage space needed for RCA 3-6% storage overhead for cache Two bits needed in snoop response for region response Path to memory needed to avoid broadcasts Simple with on-chip memory controllers May leverage data network June 7, 2005 ISCA 2005

Simulator PHARMsim: Execution-driven simulator built on top of SimOS-PPC Four 4-way superscalar out-of-order processors Two-level hierarchy with split L1, unified L2 caches Separate address / data networks –similar to Fireplane Region Coherence Array with same sets/assoc. as L2 June 7, 2005 ISCA 2005

Workloads Scientific Multiprogrammed Commercial Ocean, Raytrace, Barnes Multiprogrammed SPECint2000_rate Commercial TPC-W, TPC-B, TPC-H, SPECweb99, SPECjbb2000 June 7, 2005 ISCA 2005

Broadcasts Avoided June 7, 2005 ISCA 2005

Snoop Traffic Reduction – Peak 64% 51% 38% June 7, 2005 ISCA 2005

Snoop Traffic Reduction – Average 47% 74% 86% June 7, 2005 ISCA 2005

Execution Time 91.2% June 7, 2005 ISCA 2005

Remaining Opportunity With 512B regions, ~10% of requests are broadcast unnecessarily A third of the 10% are region false sharing Half of the 10% miss in RCA Potential for prefetching June 7, 2005 ISCA 2005

Inclusion Overhead --Regions with no lines cached replaced first June 7, 2005 ISCA 2005

Conclusion Coarse-Grain Coherence Tracking: Reduces broadcast traffic Most data requests sent directly to memory Reduces latency Many requests not sent to central arbitration point Many non-data requests not sent externally Improves scalability and performance June 7, 2005 ISCA 2005

The End June 7, 2005 ISCA 2005

Inclusion Evictions June 7, 2005 ISCA 2005

Ordering Ordering point is now the Region Coherence Array A direct request is ordered once it accesses the RCA Direct requests are serialized w.r.t. to snoop requests A direct request occurs either before, or after a snoop All must appear to access and update RCA atomically No two processors can have exclusive access to a region at the same time (no races) June 7, 2005 ISCA 2005

Comparison to RegionScout CGCT RegionScout Optimization Latency Power Avoids broadcast for non-shared data Yes Avoids broadcast for clean data No Avoids tag lookups on snoops Yes –Like Jetty Region state storage Inclusive cache Hash table, small cache Region state transfer 2 bits in snoop response 1 bit in snoop response Region protocol 7 states Effectively 4 states June 7, 2005 ISCA 2005

Execution Time June 7, 2005 ISCA 2005