Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill 1998. Proceedings. The 25th Annual International Symposium.

Slides:

Advertisements

Similar presentations

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Advertisements

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Dynamic Branch Prediction

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Multiprocessor Cache Coherency

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-7 Memory Management (1) Department of Computer Science and Software.

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown.

Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェントアンドゥク.

Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator 1 黃翔 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,

CIS 720 Distributed Shared Memory. Shared Memory Shared memory programs are easier to write Multiprocessor systems Message passing systems: - no physically.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

By Islam Atta Supervised by Dr. Ihab Talkhan

Sunpyo Hong, Hyesoon Kim

Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

The University of Adelaide, School of Computer Science

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

12.4 Memory Organization in Multiprocessor Systems

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture

Lecture 23: Cache, Memory, Virtual Memory

Module 3: Branch Prediction

The Stanford FLASH Multiprocessor

Lecture 22: Cache Hierarchies, Memory

Using Prediction to Accelerate Coherence Protocols

Lecture: Cache Innovations, Virtual Memory

High Performance Computing

The University of Adelaide, School of Computer Science

Dynamic Hardware Prediction

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

rePLay: A Hardware Framework for Dynamic Optimization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Presentation transcript:

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium on Computer Architecture Publication Date: 27 Jun-1 Jul 1998 On page(s): Presenter : Naresh Sukumar

Motivation In multi processors using directory protocols, some memory references suffer long latencies for misses to remotely-cached blocks. To ameliorate this latency, standard coherence protocols have been augmented with optimizations for specific sharing patterns (eg. Read-modify-write, producer-consumer and migratory sharing This paper aims to create a general prediction logic that adapts to the actual patterns encountered during operation.

What will be covered ? Introduction to the directory protocol General behavior of a predictor. The Cosmos coherence message predictor. Integrating Cosmos with a coherence Protocol. Benchmarking the Cosmos Analysis of the Results Conclusions

Introduction to the Directory Protocol Preferred method of cache coherence in large-scale shared- memory multiprocessors. This protocol associates state with both caches and memory at the granularity of a cache block. To simplify discussion, this paper considers a full-map and write-invalidate directory protocol. A sample of coherence messages usually found in full-map, write-invalidate coherence protocols.

Disadvantages It often incurs multiple long-latency operations. A directory may need to exchange messages with other caches before it can respond to a processor's request for a memory block. A store action to a block residing in another node’s cache

General Behavior of a Predictor Predictors predict future sharing patterns and take actions to overlap coherence message activity with current work. Types: Read-modify-write Pair-wise sharing Dynamic self-invalidation Migratory protocols Predictors would sit beside each standard directory and cache module to monitor coherence activity and request appropriate actions.

The Cosmos coherence message Predictor Signature patterns Basic structure of Cosmos Updating Cosmos Adaptability to a complex signature Filtering Noise Implementation issues for Cosmos.

Signature patterns Sequence of message signatures by the producer cache, consumer cache and directory. In a slightly more complicated example, we can have two consumers sending a get_ro_request. It can be seen later that the order in which they arrive does not matter.

Basic Structure of Cosmos Logic structure of the Cosmos coherence message predictor Two important things required: Address of cache blocks – As patterns may be different for different cache blocks. History of messages for a cache block.

Basic Structure of Cosmos contd… MHT – Message History Table PHT – Pattern History Table Obtaining a Prediction from Cosmos

Updating Cosmos Index into MHR table with address of a cache block Use the entry in MHR to index into the corresponding PHT. Write new tuple as new prediction for the index corresponding to the MHR entry. Left shift the tuple into the MHR for the cache block.

Adaptability to a complex signature For a scenario where the directory receives messages from two or three consumers, the Cosmos can adapt itself making itself immune to the order of arrival of the messages. Cosmos can adapt to complex message streams.

Filtering Noise For ex. If 99% of the time, message B follows message A, then on seeing message A, Cosmos will predict the next message to be B. The prediction should not change if rarely, these messages arrive in the sequence A, C, B instead of A, B. Use counter and update the prediction only if there are two consecutive message mis-predictions for the same block.

Implementation issues for Cosmos Cosmos is a two-level adaptive predictor. The first level containing the MHRs can be merged with the cache block state maintained at both directories and caches. The second-level is challenging as it may require large amounts of memory. But statistically, it was found that the memory overhead for 128bytes cache blocks is less than 14% for an MHR depth of one

Integrating Cosmos with a coherence Protocol Mapping Predictions to Actions. Determining When to Perform Actions. Detecting and Handling Mis-Predictions Actions that move protocol between two “legal” states. Actions that move the protocol state to a future state, but do not expose this state to the processor Actions that allow both the processor and the protocol to move to future states.

Modeling the Performance For the simplistic model the parameters are defined as below. o p – prediction accuracy for each message. o f – fraction of delay incurred on messages predicted correctly o r – penalty due to mis-predicted message. A crude execution model that translates coherence message prediction rates into a parallel program’s speedup.

Benchmarking the Cosmos Bench marks that were run Prediction accuracy for the various benchmarks

Analysis of the Results Filters increase prediction accuracy slightly, but only for predictors with MHR depth of one. Time to reach steady state prediction rates varies with the application. Memory requirement of Cosmos Predictors is generally within 22%

Conclusions Cosmos is less complex than including composition of predictors of several directed optimizations in a single protocol. Cosmos can identify application specific patterns not known a priori Cosmos has high accuracies of 80% and above for most applications. Compared to other optimizations, Cosmos requires more hardware resource to store, access and update the MHT and PHT.

Thank You Questions ??