Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Slides:



Advertisements
Similar presentations
Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept.
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Instruction-Level Parallelism (ILP)
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Multiscalar processors
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Thread-Level Speculation Karan Singh CS
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Automatically Exploiting Cross- Invocation Parallelism Using Runtime Information Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and.
Data Flow Analysis for Software Prefetching Linked Data Structures in Java Brendon Cahoon Dept. of Computer Science University of Massachusetts Amherst,
Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.
Computer Architecture: Multithreading (IV) Prof. Onur Mutlu Carnegie Mellon University.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
D A C U C P Speculative Alias Analysis for Executable Code Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de.
12/12/2001HPCC SEMINARS1 When All Else Fails, Guess: The Use of Speculative Multithreading for High-Performance Computing David J. Lilja Presented By:
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 A Scalable Approach to Thread-Level SpeculationSteffan Carnegie Mellon A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS 352H: Computer Systems Architecture
Mihai Burcea, J. Gregory Steffan, Cristiana Amza
Multiscalar Processors
‘99 ACM/IEEE International Symposium on Computer Architecture
Antonia Zhai, Christopher B. Colohan,
Address-Value Delta (AVD) Prediction
Improving Value Communication for Thread-Level Speculation
Computer Architecture: Multithreading (IV)
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Efficient software checkpointing framework for speculative techniques
Computer Architecture
15-740/ Computer Architecture Lecture 14: Prefetching
CSC3050 – Computer Architecture
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan † and Todd C. Mowry School of Computer Science Carnegie Mellon University † Dept. Elec. & Comp. Engineering University of Toronto

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Motivation Chip-level multiprocessing is becoming commonplace  We need parallel programs  UntraSPARC IV  2 UltraSparc III cores  IBM Power 4  SUN MAJC  Sibyte SB-1250 Can multithreaded processors improve the performance of a single application?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Why Is Automatic Parallelization Difficult?  One solution: Thread-Level Speculation Automatic parallelization today  Must statically prove threads are independent  Constructing proofs is difficult due to ambiguous data dependences  Complex control flow  Pointers and indirect references  Runtime inputs Optimistic compiler?  Limited only by true dependences

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Example while (...){ … x=hash[index1]; … hash[index2]=y;... } Time … = hash[19] … hash[21] =... check_dep() Thread 2 … = hash[33] … hash[30] =... check_dep() Thread 3 … = hash[3] … hash[10] =... check_dep() Thread 1 … = hash[10] … hash[25] =... check_dep() Thread 4 … = hash[31] … hash[12] =... check_dep() Thread 5 … = hash[9] … hash[44] =... check_dep() Thread 6 … = hash[27] … hash[32] =... check_dep() Thread 7    … = hash[10] … hash[25] =... check_dep() Thread 4 Retry   Processor 1Processor 2Processor 3Processor 4

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Frequently Dependent Scalars …=a a=… …=a a=…  Can identify scalars that always cause dependences Time Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Frequently Dependent Scalars …=a a=… …=a a=…  Dependent scalars should be synchronized [ASPLOS’02] Time Signal(a) Wait(a) Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Frequently Dependent Scalars …=a a=…  Dataflow analysis allows us to deal with complex control flow [ASPLOS’02] …=a a=… Time Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Communicating Memory-Resident Values Synchronize? Speculate?  Will speculation succeed? Time Load *p Store *q Load *p Store *q Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution Load *p  Speculation succeeds: efficient Time Load *p Store *q Load *p Store *q

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution  Speculation fails: inefficient       Load *p Time Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q violation

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution  Frequent dependences: Synchronize  Infrequent dependences: Speculate Load *p Time Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Performance Potential  Reducing failed speculation improves performance Detailed simulation: TLS support 4-processor CMP 4-way issue, out-of-order superscalar 10-cycle communication latency Original Perfect memory value Prediction Norm. Regional Exec. Time m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp go

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Hardware vs. Compiler Inserted Synchronization Store*q Load *p Memory Store*q Load *p Memory Store *q Load *p Memory Speculation Hardware-inserted Synchronization [HPCA’02] Compiler-inserted Synchronization [CGO’04] Time Signal() (stall) Producer Consumer Producer Consumer Producer Consumer Wait()

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Issues in Synchronizing Memory-Resident Values  Static analysis  Which instructions to synchronize?  Inter-procedural dependences  Runtime  Detecting and recovering from improper synchronization Store *q Load *p Producer Consumer Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Outline  Static analysis  Runtime checks  Results  Conclusions Load *p Producer Consumer Store *q Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Compiler Passes Front End Back End foo.c foo.exe Insert Synchronization Profile Data Dependences Create Threads Schedule Instructions Decide what to Synchronize

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Example work() push (head, entry) do { push (&set, element); work(); } while (test);

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Example work() { if (condition(&set)) push (&set, element); } push (head, entry) do { push (&set, element); work(); } while (test);

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Example work() { if (condition(&set)) push (&set, element); } push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head Store *head Load *head (work, push) Load *head (push) Store *head (work, push) do { push (&set, element); work(); } while (test); Store *head (push)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Example work() { if (condition(&set)) push (&set, element); } do { push (&set, element); work(); } while (test); push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head (push) Store *head (push) Load *head (work, push) Store *head (work, push) Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10 Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Dependence Graph Load *head (work, push) Store *head (work, push) Load *head (push) Store *head (push)  Pairs that need to be synchronized can be extracted from the dependence graph Infrequent dependences: occur in less than 5% of iterations

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Example work() { if (condition(&set)) push (&set, element); } do { push (&set, element); work(); } while (test); push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head (push) Store *head (push) 990 Load *head (push) Store *head (push) Synchronize these push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head); } push_clone(&set, element);

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Outline Static analysis  Runtime checks  Results  Conclusions Producer Consumer Store *q Load *p Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Runtime Checks  Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and Load *p Signal(q, *q);  Producer forwards the address to ensure a match between the load and the store Producer Consumer Load *p Store *q Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Ensuring Correctness Store *x Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer  Hardware support  Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] Load *p Store *q Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Ensuring Correctness  Hardware support: TLS hardware already knows which locations are stored to Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer Store *y Load *p Store *q Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Outline Static analysis Runtime checks  Results  Conclusions Producer Consumer Store *q Load *p Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Crossbar Experimental Framework Underlying architecture  4-processor, single-chip multiprocessor  speculation supported through coherence Simulator  superscalar, similar to MIPS R14K  10-cycle communication latency  models all bandwidth and contention Benchmarks  SPECint95 and SPECint2000, -O3 optimization  detailed simulation C C P C P

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Parallel Region Coverage go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp  Coverage is significant  Average coverage: 54%

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Failed Speculation Synchronization Stall Other Busy U=No synchronization inserted C=Compiler-Inserted Synchronization  Seven benchmarks speed up by 5% to 46% Compiler-Inserted Synchronization go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp U C U C U C U C U C U C U C U C U C U C U C U C U C 10%46%13%5%8%5%21% Norm. Regional Exec. Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Compiler- vs. Hardware-Inserted Synchronization go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp C H C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization Compiler and hardware [HPCA’02] each benefits different benchmarks Norm. Regional Exec. Time Failed Speculation Synchronization Stall Other Busy Hardware does better Compiler does better

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Combining Hardware and Compiler Synchronization C=Compiler-inserted synchronization H=Hardware-inserted synchronization B=Combining Both  The combination is more robust than each technique individually go m88ksim gzip_comp gzip_decomp perlbmk gap C H B Norm. Regional Exec. Time Failed Speculation Synchronization Stall Other Busy

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Related Work Zhai et. al. CGO’04 Cytron ICPP’86 Compiler-inserted Moshovos et. al. ISCA’97 Cintra & Torrellas HPCA’02 Steffan et. al. HPCA’02 Hardware-inserted Centralized TableDistributed Table Tsai & Yew PACT’96

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Conclusions Compiler-inserted synchronization for memory-resident value communication:  Effective in reducing speculation failure  Half of the benchmarks speedup by 5% to 46% (regional)  Combining hardware and compiler techniques is more robust  Neither consistently outperforms the other  Can be combined to track the best performer  Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Questions?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… The Potential of Instruction Scheduling go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place mcf crafty parser perlbmk gap gzip_comp gcc E=Early C=Compiler-Inserted Synchronization L=Late Failed Speculation Synchronization Stall Other Busy Scheduling instructions has addition benefit for some benchmarks ECL Bzip2_comp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Program Performance go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp bzip2_decomp twolf gzip_comp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Failed Speculation Synchronization Stall Other Busy UCHB

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Which Technique Synchronizes This Load? go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp twolf UCHB gzip_comp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Synchronized by neither technique Synchronized by compiler Synchronized by hardware Synchronized by both

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Ensuring Correctness  Hardware support  Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] Store *q Load *p Store *x Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Consumer Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Ensuring Correctness  Hardware support Use the forwarded value only if the synchronized pair is dependent Use Forwarded Value Use Memory Value Local Store to *p q == p NO YES NO Store *q Load *p Store *x Signal(q); Signal(*q) Producer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Issues in Synchronizing Memory-Resident Values Inserting synchronization using compilers Ensuring correctness  Reducing synchronization cost Store *q Load *p Consumer Producer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Reducing Cost of Synchronization Before Instruction Scheduling Consumer Producer  Instruction scheduling algorithms are described in [ASPLOS’02] After Instruction Scheduling Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… The Potential of Instruction Scheduling m88ksim ijpeg gzip_comp gzip_decomp vpr_place gap E = Perfectly predicting synchronized memory-resident values C = Compiler-inserted synchronization L = Consumer stalls until previous thread commits  Scheduling instructions could offer additional benefit E C L Failed Speculation Synchronization Stall Other Busy Norm. Regional Exec. Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… Using More Accuracy of Profiling Information CRU U=No Instruction Scheduling C=Compiler-Inserted Synchronization R=Compiler-Inserted Synchronization (Profiled with the ref input set)  Gzip_comp is the only benchmark sensitive to profiling input gzip_comp Failed Speculation Synchronization Stall Other Busy Norm. Regional Exec. Time