Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

To Include or Not to Include? Natalie Enright Dana Vantrease.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Thread-Level Speculation Karan Singh CS

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

Sunpyo Hong, Hyesoon Kim

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

Performance of Snooping Protocols Kay Jr-Hui Jeng.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Cache Coherence: Directory Protocol

Cache Coherence: Directory Protocol

Lecture 19: Coherence and Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

A New Coherence Method Using A Multicast Address Network

Lecture 18: Coherence and Synchronization

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Improving Multiple-CMP Systems with Token Coherence

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Lecture: Coherence, Synchronization

Dynamic Verification of Sequential Consistency

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture: Coherence and Synchronization

rePLay: A Hardware Framework for Dynamic Optimization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in Microarchitecture, December 2012

Coherence Communication A A Miss Block A exclusive to T0 T13: Request to share A T13 “communicates” with T0. Block A is copied to T13. The result of data sharing between threads, when those run on a shared memory multiprocessor with coherent private caches. [Shared Memory Model / Write-Invalidate Coherence Protocol]

Coherence Communication Block A is shared. T13: Request for exclusive ownership. T13 “communicates” with T0 & T6. Invalidate copies. A A A Upgrade Communicating Misses: all request that must communicate with at least one other core. [Shared Memory Model / Write-Invalidate Coherence Protocol]

Directory-based Coherence Protocol A Miss Snoop-based Coherence Protocol A Miss Communication Overheads 4 A A Indirect Miss to the Directory => Increase Miss Latency Broadcast to all => Increase traffic A: T0

Communication Prediction A A Miss A: T0 Predict Trade-Off Accuracy vs Extra traffic

Traditional Prediction Approaches 1.Simple temporal-based prediction. - Locality between consecutive misses. 2.ADDRESS-based prediction. - Locality based on the address of the request. 3.INSTRUCTION-based prediction. - Locality based on the static store/load instr. A Miss ADDRPREDICTOR A [T0, … ] PREDICTOR [T0, …] INSTPREDICTOR {LD} [T0, …] # access addresses # static LD/SRs

Contribution of this work Synchronization Point based Prediction (SP-prediction) Inter-thread communication caused by coherence transactions is tightly related with the synchronization points in parallel execution Main Idea: Associate the communication behavior with synchronization points and utilize this association to predict the destination of misses. Main Advantage: Has very low storage cost, yet delivers relatively high performance.

Outline  Introduction  Motivation & Observations  SP-Prediction  Evaluation  Conclusion 8

Core 1 Core 2 Core 3 Core 4 Why Synchronization Points? SIGNAL BARRIERLOCK UNLOCK WAIT [Pthread notation] BARRIER shared data communication direction

Synchronization Epochs Communication Distribution of Core 0 (full interval) Core 0 SYNC-POINT A SYNC-POINT BSYNC-POINT C SYNC-POINT ESYNC-POINT D Destination Core ID Sync-epoch A Sync-epoch B Sync-epoch C Sync-epoch D Destination Core ID Communication Distribution of Core 0 (different sync-epochs) # contacts [Benchmark: Bodytrack / 16-threads]

Communication Distribution of Core 0 (same sync-epoch in different dynamic instances) Destination Core ID # contacts Sync-Epoch Dynamic Instances [Benchmark: Bodytrack / 16-threads] Core 0 SYNC-POINT A SYNC-POINT BSYNC-POINT C SYNC-POINT ESYNC-POINT D Core 0 AB ABABAB

Outline  Introduction  Motivation & Observations  SP-Prediction  Evaluation  Conclusion 12

SP-prediction – Overview Monitor destinations of each miss on each core. Extract communication signatures for each sync-epoch. Store and later reuse those signatures to predict misses in future sync-epoch instances. When initial predictions do not exist or are inaccurate, reconstruct the signatures within the sync-epochs. Sync Points must be exposed to the hardware so it can sense the beginning and end of sync-epochs. –A dedicated instruction must be inserted at the calling location of the synchronization point. –PC, lock variable and type must be extracted and pass to a history table.

SP-prediction: History-based Track Communication Extract Hot Commun.Set CORE 0 SYNC-POINT ASYNC-POINT BSYNC-POINT A SYNC-POINT B Sync-Point PCPREDICTOR A [hot comm. set ] C0C1C2C3 [hot comm. set ] Store to SP-table SP-TABLE A A

SP-prediction: History-based Retrieve hot core set CORE 0 SYNC-POINT ASYNC-POINT BSYNC-POINT A SYNC-POINT B Sync-Point PCPREDICTOR A [hot comm. set ] Miss SP-TABLE

LOCK ADDRPREDICTOR SP-prediction: History-based (for Locks) Lock Release: Store Core Id CORE 0 LOCK AUNLOCK B LOCK ADDRPREDICTOR A [C0] SP-TABLE

SP-prediction: History-based (for Locks) CORE 0 LOCK AUNLOCK B LOCK ADDRPREDICTOR A [C0] Lock Acquire: Retrieve Predictor LOCK AUNLOCK B CORE 0 SP-TABLE [C0]

SP-prediction: First Sync-Epoch Instances CORE 0 SYNC-POINT A SYNC-POINT B 1 st Instance No history exists for this point (first instance). Allow some warm-up time and then extract an “early” hot communication set. Use the set as a predictor for the rest of the interval. Early Hot Set Sync-Point PCPREDICTOR SP-TABLE

SP-prediction: Adaptive Recovery CORE 0 SYNC-POINT A SYNC-POINT B Sync-point is detected, predictor is retrieved from SP-table Start using predictor for each miss, with high confidence. If prediction accuracy drops low, extract a new hot communication set on the spot. Continue predictions based on the new predictor. Retrieved Hot SetNew hot set Sync-Point PCPREDICTOR A [hot comm. set ] Miss SP-TABLE

Why SP-prediction In contrast to simple temporal prediction, it exploits application- defined interval-based communication localities. –No restricted on temporal locality among consecutive misses. –Can adapts faster to the changes. –Can recall old and forgotten communication patterns. Compared to address and instruction based prediction, it has very low storage requirements. –SP table must holds, on average 5-30 static sync points for a given application. Take advantage of the existing programming paradigm while being transparent to the programmer.

Outline  Introduction  Motivation & Observations  SP-Prediction  Evaluation  Conclusion 21

Evaluation Methodology Workloads –From Splash 2 & PARSEC Suites. –# static sync-epochs: 5-30 –# dynamic sync-epochs: 22-20,000 (for the evaluated input sizes) SP-prediction implemented on top of Baseline Directory. Simulated Machine Configuration (based on simics) –In order core –Private L1/L2 –DIR slice. –Network logic –Coherence Logic

Prediction Accuracy 76%

Prediction Accuracy Average Destination Set Size (actual)1.2 SP-prediction Set Size2.6 76%

Results: Latency & Bandwidth 13% 18% (5%) Execution Time Improvements: 7% on average. Additional Energy Dissipation: <7% (14% NoC, 9% cache lookups). (more than 90% lower compared to broadcasting)

Comparison with other Predictors BEST POSITION

Comparison with other Predictors INFINIT ENTRIES PREDICTION TABLE STORAGE BEST POSITION

Comparison with other Predictors 512 ENTRIES PREDICTION TABLE STORAGE BEST POSITION

Conclusions SP-prediction is a new, run-time and application-driven approach on communication predictability. Promotes very low storage requirements, an important property for emerging CMP implementations. Scales independent of core count and cache sizes. Takes advantage of the existing shared memory programming paradigm and current consistency models.

Thank you for your attention! 45 th International Symposium in Microarchitecture, December 2012

Discussion SP-table consumes considerably lower dynamic power than ADDR or INSTR tables. –accessed only on sync-points and not on each miss. Thread migration support –By tracking “logical” destinations. Projections for commercial workloads (show bars) –Critical Sections (unpredictable patterns) are effectively handled. SP-prediction is not perfect –Coarse-grain sync-epochs may exhibit communication behaviors that change. –Very fine sync-epochs cannot give a good representative hot communication set. –Unless the sync-epoch is critical section, unpredictable patterns cannot be discovered.