A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

Cache Optimization Summary

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Cache Coherence in Scalable Machines (IV) Dealing with Correctness Issues Serialization of operations Deadlock Livelock Starvation.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

CS 258 Spring An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing Per Stenström, Mats Brorsson, and Lars Sandberg Presented by Allen.

1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

Architecture and Design of AlphaServer GS320 Kourosh Gharachorloo, Madhu Sharma, Simon Steely, and Stephen Van Doren ASPLOS’2000 Presented By: Alok Garg.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェントアンドゥク.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.

DISTRIBUTED COMPUTING

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 7, 2005 Session 23.

March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

The University of Adelaide, School of Computer Science

Distributed Shared Memory

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

CS5102 High Performance Computer Systems Distributed Shared Memory

Lecture 17: Transactional Memories I

Lecture 8: Directory-Based Cache Coherence

Lecture 7: Directory-Based Cache Coherence

Lecture 21: Synchronization and Consistency

11 – Snooping Cache and Directory Based Multiprocessors

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Lecture 9: Directory Protocol Implementations

Lecture 9: Directory-Based Examples

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Lecture 8: Directory-Based Examples

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Cache coherence CEG 4131 Computer Architecture III

Lecture 24: Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

Programming with Shared Memory Specifying parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 10: Directory-Based Examples II

Lecture 17 Multiprocessors and Thread-Level Parallelism

Multiprocessors and Multi-computers

Presentation transcript:

A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology and Application

ABSTRACT The implementation of multiprocessors cache coherence and memory consistency can help the homemade CPUs support a wide range of system designs. lot of research on various cache coherence protocols, such as Piranha prototype system,GS320 and AMD64, but use NB2CC protocol. It divides the serial processing into two steps: conflict detection and conflict solution. Conflict detection is completed at the home node, while conflict solution is distributed to owners.

INTRODUCTION The directory-based cache coherence protocols are widely used in many systems for their good expansibility. Generally, there are two ways to solve the cache coherence problem: Direction and Indirection. Some thought that the directory-based protocols incur a performance penalty on sharing misses due to indirection. However, DBP introduce a level of indirection to obtain scalability at the cost of increasing sharing miss latency. They believe that the Token counting is a new and good way.

Indirection protocols, also called traditional 3-hop protocols, have been widely used in many systems. The home node will forward the request from the local to the owner when it doesn’t have the newest data, and the forwarded request is serviced by the owner. Figure 1. (a) Basic directory-based protocols

SGI Origin solves the conflict at the home node, as position A shown in Figure 1(b). When a request arrives at the home node, it sets the related block’s state to be active (we call it “busy”). All subsequent requests for that block are queued (at the home node or in the network) until the active request is deactivated. So directory busy state is necessary in SGI Origin. Figure 1.(b): Different conflict solution positions.

Comparison In GS320, global switch( Network) is where conflicts are solved. In Piranha prototype, the solution position is moved to the end of the system (Owners). This protocol also introduces other innovative techniques, which make great contributions to directory-based protocols, such as clean-exclusive optimization, reply forwarding from remote owners, eager exclusive replies and avoiding the use of negative acknowledgment (NAK) messages. This makes the protocol more specific to the current design and limits its applicability.

MATHEMATICS MODEL - Serial processing in DBP could be divided into two steps: conflict detection and conflict solution. - We define several sets for description convenience. - Request = {R1, R2… Rn-1} is the set of requests for a shared address block (e.g., address X) at a logic time. -Home = {H1, H2…Hn-1} is the ordering of requests in set Request processed by the home directory. If Hi > Hj, then Ri is processed by the home directory before Rj. -Local = {L1, L2…Ln-1} is the ordering of requests in set Request satisfied by owners. If Li > Lj, then Ri is satisfied by the owner before Rj. Premise 1: For any Request Ri and Ri+1, Hi > Hi+1. Premise 2: If Hi > Hi+1, then Li > Li+1. Deduction 1: For any Request Ri and Rj, if Hi > Hj, then Li > Lj.

NB2CC To achieve high efficiency and concurrency at small cost, NB2CC inherits a lot of characteristics of traditional protocols, such as relaxed memory model, the method of avoiding protocol deadlock and basic process of request races.

A. Avoiding Deadlock NB2CC uses three virtual channels (VC0, VC1, and VC2) to eliminate the possibility of protocol deadlocks without resorting to NAKs/retries. The first channel (VC0) carries all requests (RQ1) from a processor to the home node. Messages from the home directory/memory (replies (ACK1) or forwarded messages (RQ2) to third party nodes or processors) are always carried on the second channel (VC1). The third channel (VC2) carries replies (ACK2) from a third-party node or processor to the requestor.

B. Non-Negative ACK -The lack of NAKs/retries leads to a more efficient protocol and provides several important and desirable characteristics.  since an owner node is guaranteed to service a forwarded request, the directory state could be changed immediately (non-busy).  we inherently eliminate live-lock and starvation problems that arise due to the presence of NAKs.

C. P2P Order in VC1 ONLY A: - A(RQ1(A)) → Home (RQ2(A)) → Owner - In the same time,the transaction updates the directory state immediately. - Coherence Acknowledge(Coh_ack1(A)) is sent to the Local A. B requests: - Home(RQ2(B)) → Owner A * Two requests travel through VC1 channel Between Home node and node A: Coh_ack1(A) and RQ2(B) with knowing the order.

D. Ownership Migrated Machine (OMM) - NB2CC guarantees that no more than two forwarded requests are sent to each owner, because of this technique. - A cache line in the owned state holds the most recent, correct copy of the data and other processors can hold a copy of the most recent, correct data,too.

E. Illegible Invalidates Acknowledge (IIA) NB2CC supports aggressive relaxed memory model, such as the Alpha memory model, which requires the use of explicit memory barrier instructions to impose memory ordering. It supports eager exclusive replies, and it is possible for a request generated at the home node to be locally satisfied while remote invalidations caused by a previous operation still have not been committed at the owners. It injects invalidation messages from the home node and gathers the corresponding acknowledgments at the requesting node.

Receiving counter and received counter are needed for IIA (48 bits or 64 bits are enough). Receiving counter (sent from home node to local node along with ACK1) is used to record the number of acknowledgements that the local node needs to receive. The received counter will be added 1 as soon as invalidations acknowledge received (From VC2). Memory Barrier (MB) could be done when the above two counter are equal. This brings a great help to improve the performance because there is no need to receive all invalid acknowledges to complete a RQ1.

With the increase of chip frequency, network delay plays a more and more important role in system based on research. Non-blocking for invalidations processing at the owner is a good method to hide the network delay. The combination of IIA, aggressive relaxed memory model and eager exclusive replies techniques increases the concurrency of the system and brings nice performance to NB2CC.

F. Putting it all together Many techniques, such as Avoiding Deadlock, P2P Order in VC1, No Negative ACK, which have been already used in other systems, such as GS320 and Piranha, are incorporated by NB2CC.

ANALYSIS -Only the requests for the same block need to be delayed in this system rather than blocking the head request of the queue. -NB2CC is balanced and has no hot point, providing flexible interface for programmer and complier to reach high system performance. -NB2CC is designed for a high concurrency and pipelining system, which solves the multi cache coherence problem in a software-transparent way. - The low overhead and little dependence on the hardware implementation lead to an implementation-free protocol.

NB2CC vs. GS320 - NB2CC does not support early commit and invalidate acknowledge is needed for the purpose of regularity. -Pipelining and non-blocking will bring high efficiency when processing invalidates. -Complex optimizations used in GS320 are avoided here because we need a simple, regular and efficient protocol. - With respect to conflict solution, Owners are responsible for this job in NB2CC, while in GS320 global switch takes the responsibility. - Hot points are decentralized in our protocol.

Future work If needed, other techniques could be incorporated in this protocol, as long as there is no conflict with the basic rules that are provided here.

Thanks