Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.

Slides:



Advertisements
Similar presentations
Branch prediction Titov Alexander MDSP November, 2009.
Advertisements

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
A Coherent Grid Traversal Algorithm for Volume Rendering Ioannis Makris Supervisors: Philipp Slusallek*, Céline Loscos *Computer Graphics Lab, Universität.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.
From Sequences of Dependent Instructions to Functions An Approach for Improving Performance without ILP or Speculation Ben Rudzyn.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.
Branch Prediction. Literature Tse-Yu Yeh and Yale N. Patt, “A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History,”Tse-Yu Yeh.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
Pre-Silicon Simulation of Multi-Core Benchmarks Shubu Mukherjee Principal Engineer Director, SPEARS Group Intel Corporation Panel in Symposium on Workload.
Analysis of Branch Predictors
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Low Power Cache Design M.Bilal Paracha Hisham Chowdhury Ali Raza.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
CPE731: Advanced Computer Architecture Course Introduction Dr. Gheith Abandah د. غيث علي عبندة.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
CSL718 : Pipelined Processors
Multiscalar Processors
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
A New Coherence Method Using A Multicast Address Network
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
‘99 ACM/IEEE International Symposium on Computer Architecture
Multiprocessor Cache Coherency
The Problem Finding a needle in haystack An expert (CPU)
Prefetch-Aware Cache Management for High Performance Caching
The Stanford FLASH Multiprocessor
Address-Value Delta (AVD) Prediction
Using Prediction to Accelerate Coherence Protocols
CARP: Compression-Aware Replacement Policies
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
The O-GEHL branch predictor
Presentation transcript:

Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Joint work Mark D. Hill at the University of Wisconsin-Madison Published in the Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), 1998.

Distributed Shared-Memory Machine CPU Cache Directory Hardware Main Memory CPU Cache Directory Hardware Main Memory Network Memory is physically distributed for scalability Per-CPU caches cache remote memory Cache coherence via directory protocols

Reduce Directory Protocol Latency Using Prediction get_rw_request inval_ro_request inval_ro_response get_rw_response Producer Cache Directory Consumer Cache get_rw_request get_rw_response Producer Cache Directory Consumer Cache inval _ro_response Coherence Protocol ActionSpeculative Action Dynamic Self-Invalidation (Lebeck & Wood, ISCA ‘95)

Directed Predictors Many Examples Read-modify write in SGI Origin (Laudon & Lenoski, ISCA ‘97) Scalable Coherence Interface (SCI)’s pairwise sharing Protocols optimized for migratory sharing (Cox/Fowler, Stenstrom, et al. ISCA ‘93) Dynamic Self-Invalidation (Lebeck & Wood, ISCA ‘95) Competitive Update (Karlin, et al., Algorithmica ‘88) Half-migratory optimization Compiler-directed prediction Can we have a general predictor? => COSMOS + easier to compose multiple predictors + discover & adapt to application-specific patterns - more hardware

Cosmos: A General Predictor Cosmos predictors for both cache (CP) and directory (DP) Predictor issues what message to predict?………………….…………...this talk how to integrate with real system?…………….NOT in this talk Network CPU Cache Directory Hardware Main Memory DP Cache CP CPU Cache Directory Hardware Main Memory DP Cache CP

Cosmos Overview Given cache block address history of incoming coherence messages for cache block (i.e., source processor and message type tuples) Cosmos Predicts next incoming coherence message for the cache block Cosmos’ Structure two-level adaptive predictor resembles Yeh & Patth’s PAp branch predictor (ISCA ‘92) Cosmos’ Prediction Accuracy % for five parallel scientific applications

Outline Motivation & Overview Cosmos’ Structure Cosmos Results

Producer-Consumer Sharing Pattern Cache Blocks Have Predictable Message Signatures get_rw_request from producer inval_ro_response from consumer inval_rw_response from producer get_ro_request from consumer get_rw_responseinval_rw_request Producer Cache get_ro_response inval_ro_request Consumer Cache DIRECTORY

Cosmos’ Basic Structure Parameterized by “depth” of MHT and “filters” for PHT (Reminiscent of Yeh and Patt’s PAp branch predictor) Message History Table (MHT) Pattern History Tables (PHT) Global Address of Cache Block

Cosmos’ Entries for Producer-Consumer Signature get_rw_request from producer inval_ro_response from consumer inval_rw_response from producer get_ro_request from consumer get_rw_responseinval_rw_request Producer Cache (P) get_ro_response inval_ro_request Consumer Cache (C) DIRECTORY MHT Index Prediction PHT Global Address of Cache Block Cosmos at the directory

Outline Motivation & Overview Cosmos’ Structure Cosmos Results

Evaluation Methodology Traces of coherence messages Simulator Wisconsin Wind Tunnel II (Mukhejee, et al. PAID, ‘97) Simulated coherence protocol = Wisconsin Stache Full-map Simple COMA (main memory used as software cache) Reinhardt, et al. ISCA ‘94 Simulated benchmarks appbt………………………………………………………………………NAS barnes……………………………………………………………...SPLASH II dsmc, moldyn, unstructured………….Universities of Maryland & Wisconsin

Cosmos’ Base Prediction Rate Overall accuracy = % (base) Low accuracy for barnes reassignment of logical data strcutrures to different memory addresses

Example Signatures: Appbt 94 inval_rw_request upgrade_response inval_ro_request get_ro_response get_ro_request inval_rw_response inval_ro_response upgrade_request CACHE DIRECTORY Numbers for MHR of depth one, summarized for all cache blocks

Increasing Cosmos’ Accuracy Overall prediction accuracy = % Other techniques filters (e.g., J. Smith’s saturating counters subdividing coherence message stream (suggested by Sohi) available in Mukherjee, PhD. Thesis, May 1998 ftp://ftp.cs.wisc.edu/wwt/Theses/mukherjee-1side.ps

Cosmos’ Memory Overhead Depth of MHR appbt barnes dsmc moldynunstruct. ratio ovhdratioovhd ratioovhdratioovhd ratioovhd % % % % % % % % % % % % % % % % % % % % Ratio = total number of PHT entries / total number of MHT entries Ovhd = average memory overhead per 128-byte block For MHR depth = 2 overhead < 13% for all, except barnes (35%)

Summary and Future Work Cosmos Predictor predicts next coherence message for a cache block uses history information + simpler than composition of multiple directed predictors + adapts dynamically to application-specific coherence streams - requires more hardware than directed predictors Cosmos’ Prediction Accuracy % for four applications % for barnes (reassignment of logical data structures) Future Work improve Cosmos’ accuracy ( e.g., Kaxiras/Goodman 1999, Lai/Falsafi 1999 ) integrate Cosmos with a coherence protocol ( e.g., Lai/Falsafi 1999 )