Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan † and Todd C. Mowry School of Computer Science Carnegie Mellon University † Dept. Elec. & Comp. Engineering University of Toronto

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 2 - Motivation Chip-level multiprocessing is becoming commonplace  We need parallel programs  UntraSPARC IV  2 UltraSparc III cores  IBM Power 4  SUN MAJC  Sibyte SB-1250 Can multithreaded processors improve the performance of a single application?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 3 - Why Is Automatic Parallelization Difficult?  One solution: Thread-Level Speculation Automatic parallelization today  Must statically prove threads are independent  Constructing proofs is difficult due to ambiguous data dependences  Complex control flow  Pointers and indirect references  Runtime inputs Optimistic compiler?  Limited only by true dependences

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 4 - Example while (...){ … x=hash[index1]; … hash[index2]=y;... } Time … = hash[19] … hash[21] =... check_dep() Thread 2 … = hash[33] … hash[30] =... check_dep() Thread 3 … = hash[3] … hash[10] =... check_dep() Thread 1 … = hash[10] … hash[25] =... check_dep() Thread 4 … = hash[31] … hash[12] =... check_dep() Thread 5 … = hash[9] … hash[44] =... check_dep() Thread 6 … = hash[27] … hash[32] =... check_dep() Thread 7    … = hash[10] … hash[25] =... check_dep() Thread 4 Retry   Processor 1Processor 2Processor 3Processor 4

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 5 - Frequently Dependent Scalars …=a a=… …=a a=…  Can identify scalars that always cause dependences Time Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 6 - Frequently Dependent Scalars …=a a=… …=a a=…  Dependent scalars should be synchronized [ASPLOS’02] Time Signal(a) Wait(a) Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 7 - Frequently Dependent Scalars …=a a=…  Dataflow analysis allows us to deal with complex control flow [ASPLOS’02] …=a a=… Time Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 8 - Communicating Memory-Resident Values Synchronize? Speculate?  Will speculation succeed? Time Load *p Store *q Load *p Store *q Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 9 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution Load *p  Speculation succeeds: efficient Time Load *p Store *q Load *p Store *q

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 10 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution  Speculation fails: inefficient       Load *p Time Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q violation

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 11 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution  Frequent dependences: Synchronize  Infrequent dependences: Speculate Load *p Time Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 12 - Performance Potential  Reducing failed speculation improves performance Detailed simulation: TLS support 4-processor CMP 4-way issue, out-of-order superscalar 10-cycle communication latency Original Perfect memory value Prediction Norm. Regional Exec. Time 0 100 m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp go

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 13 - Hardware vs. Compiler Inserted Synchronization Store*q Load *p Memory Store*q Load *p Memory Store *q Load *p Memory Speculation Hardware-inserted Synchronization [HPCA’02] Compiler-inserted Synchronization [CGO’04] Time Signal() (stall) Producer Consumer Producer Consumer Producer Consumer Wait()

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 14 - Issues in Synchronizing Memory-Resident Values  Static analysis  Which instructions to synchronize?  Inter-procedural dependences  Runtime  Detecting and recovering from improper synchronization Store *q Load *p Producer Consumer Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 15 - Outline  Static analysis  Runtime checks  Results  Conclusions Load *p Producer Consumer Store *q Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 16 - Compiler Passes Front End Back End foo.c foo.exe Insert Synchronization Profile Data Dependences Create Threads Schedule Instructions Decide what to Synchronize

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 17 - Example work() push (head, entry) do { push (&set, element); work(); } while (test);

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 18 - Example work() { if (condition(&set)) push (&set, element); } push (head, entry) do { push (&set, element); work(); } while (test);

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 19 - Example work() { if (condition(&set)) push (&set, element); } push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head Store *head Load *head (work, push) Load *head (push) Store *head (work, push) do { push (&set, element); work(); } while (test); Store *head (push)

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 20 - Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 21 - Example work() { if (condition(&set)) push (&set, element); } do { push (&set, element); work(); } while (test); push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head (push) Store *head (push) Load *head (work, push) Store *head (work, push) Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10 Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 23 - Dependence Graph Load *head (work, push) Store *head (work, push) 990 10 Load *head (push) Store *head (push)  Pairs that need to be synchronized can be extracted from the dependence graph Infrequent dependences: occur in less than 5% of iterations

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 25 - Example work() { if (condition(&set)) push (&set, element); } do { push (&set, element); work(); } while (test); push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head (push) Store *head (push) 990 Load *head (push) Store *head (push) Synchronize these push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head); } push_clone(&set, element);

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 26 - Outline Static analysis  Runtime checks  Results  Conclusions Producer Consumer Store *q Load *p Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 27 - Runtime Checks  Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and Load *p Signal(q, *q);  Producer forwards the address to ensure a match between the load and the store Producer Consumer Load *p Store *q Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 28 - Ensuring Correctness Store *x Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer  Hardware support  Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] Load *p Store *q Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 29 - Ensuring Correctness  Hardware support: TLS hardware already knows which locations are stored to Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer Store *y Load *p Store *q Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 30 - Outline Static analysis Runtime checks  Results  Conclusions Producer Consumer Store *q Load *p Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 31 - Crossbar Experimental Framework Underlying architecture  4-processor, single-chip multiprocessor  speculation supported through coherence Simulator  superscalar, similar to MIPS R14K  10-cycle communication latency  models all bandwidth and contention Benchmarks  SPECint95 and SPECint2000, -O3 optimization  detailed simulation C C P C P

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 32 - Parallel Region Coverage 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp  Coverage is significant  Average coverage: 54%

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 33 - Failed Speculation Synchronization Stall Other Busy U=No synchronization inserted C=Compiler-Inserted Synchronization  Seven benchmarks speed up by 5% to 46% Compiler-Inserted Synchronization 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp U C U C U C U C U C U C U C U C U C U C U C U C U C 10%46%13%5%8%5%21% Norm. Regional Exec. Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 34 - Compiler- vs. Hardware-Inserted Synchronization 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp C H C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization Compiler and hardware [HPCA’02] each benefits different benchmarks Norm. Regional Exec. Time Failed Speculation Synchronization Stall Other Busy Hardware does better Compiler does better

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 35 - Combining Hardware and Compiler Synchronization C=Compiler-inserted synchronization H=Hardware-inserted synchronization B=Combining Both  The combination is more robust than each technique individually 0 100 go m88ksim gzip_comp gzip_decomp perlbmk gap C H B Norm. Regional Exec. Time Failed Speculation Synchronization Stall Other Busy

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 36 - Related Work Zhai et. al. CGO’04 Cytron ICPP’86 Compiler-inserted Moshovos et. al. ISCA’97 Cintra & Torrellas HPCA’02 Steffan et. al. HPCA’02 Hardware-inserted Centralized TableDistributed Table Tsai & Yew PACT’96

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 37 - Conclusions Compiler-inserted synchronization for memory-resident value communication:  Effective in reducing speculation failure  Half of the benchmarks speedup by 5% to 46% (regional)  Combining hardware and compiler techniques is more robust  Neither consistently outperforms the other  Can be combined to track the best performer  Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 38 - Questions?

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 39 - The Potential of Instruction Scheduling 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place mcf crafty parser perlbmk gap gzip_comp gcc E=Early C=Compiler-Inserted Synchronization L=Late Failed Speculation Synchronization Stall Other Busy Scheduling instructions has addition benefit for some benchmarks ECL Bzip2_comp

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 40 - Program Performance 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp bzip2_decomp twolf gzip_comp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Failed Speculation Synchronization Stall Other Busy UCHB

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 41 - Which Technique Synchronizes This Load? 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp twolf UCHB gzip_comp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Synchronized by neither technique Synchronized by compiler Synchronized by hardware Synchronized by both

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 42 - Ensuring Correctness  Hardware support  Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] Store *q Load *p Store *x Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 43 - Consumer Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Ensuring Correctness  Hardware support Use the forwarded value only if the synchronized pair is dependent Use Forwarded Value Use Memory Value Local Store to *p q == p NO YES NO Store *q Load *p Store *x Signal(q); Signal(*q) Producer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 44 - Issues in Synchronizing Memory-Resident Values Inserting synchronization using compilers Ensuring correctness  Reducing synchronization cost Store *q Load *p Consumer Producer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 45 - Reducing Cost of Synchronization Before Instruction Scheduling Consumer Producer  Instruction scheduling algorithms are described in [ASPLOS’02] After Instruction Scheduling Producer Consumer

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 46 - The Potential of Instruction Scheduling 0 100 m88ksim ijpeg gzip_comp gzip_decomp vpr_place gap E = Perfectly predicting synchronized memory-resident values C = Compiler-inserted synchronization L = Consumer stalls until previous thread commits  Scheduling instructions could offer additional benefit E C L Failed Speculation Synchronization Stall Other Busy Norm. Regional Exec. Time

Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 47 - Using More Accuracy of Profiling Information 0 100 CRU U=No Instruction Scheduling C=Compiler-Inserted Synchronization R=Compiler-Inserted Synchronization (Profiled with the ref input set)  Gzip_comp is the only benchmark sensitive to profiling input gzip_comp Failed Speculation Synchronization Stall Other Busy Norm. Regional Exec. Time

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Similar presentations

Presentation on theme: "Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Similar presentations

Presentation on theme: "Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory."— Presentation transcript:

Similar presentations

About project

Feedback