Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Similar presentations


Presentation on theme: "Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory."— Presentation transcript:

1 Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan † and Todd C. Mowry School of Computer Science Carnegie Mellon University † Dept. Elec. & Comp. Engineering University of Toronto

2 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 2 - Motivation Chip-level multiprocessing is becoming commonplace  We need parallel programs  UntraSPARC IV  2 UltraSparc III cores  IBM Power 4  SUN MAJC  Sibyte SB-1250 Can multithreaded processors improve the performance of a single application?

3 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 3 - Why Is Automatic Parallelization Difficult?  One solution: Thread-Level Speculation Automatic parallelization today  Must statically prove threads are independent  Constructing proofs is difficult due to ambiguous data dependences  Complex control flow  Pointers and indirect references  Runtime inputs Optimistic compiler?  Limited only by true dependences

4 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 4 - Example while (...){ … x=hash[index1]; … hash[index2]=y;... } Time … = hash[19] … hash[21] =... check_dep() Thread 2 … = hash[33] … hash[30] =... check_dep() Thread 3 … = hash[3] … hash[10] =... check_dep() Thread 1 … = hash[10] … hash[25] =... check_dep() Thread 4 … = hash[31] … hash[12] =... check_dep() Thread 5 … = hash[9] … hash[44] =... check_dep() Thread 6 … = hash[27] … hash[32] =... check_dep() Thread 7    … = hash[10] … hash[25] =... check_dep() Thread 4 Retry   Processor 1Processor 2Processor 3Processor 4

5 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 5 - Frequently Dependent Scalars …=a a=… …=a a=…  Can identify scalars that always cause dependences Time Producer Consumer

6 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 6 - Frequently Dependent Scalars …=a a=… …=a a=…  Dependent scalars should be synchronized [ASPLOS’02] Time Signal(a) Wait(a) Producer Consumer

7 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 7 - Frequently Dependent Scalars …=a a=…  Dataflow analysis allows us to deal with complex control flow [ASPLOS’02] …=a a=… Time Producer Consumer

8 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 8 - Communicating Memory-Resident Values Synchronize? Speculate?  Will speculation succeed? Time Load *p Store *q Load *p Store *q Producer Consumer

9 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 9 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution Load *p  Speculation succeeds: efficient Time Load *p Store *q Load *p Store *q

10 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 10 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution  Speculation fails: inefficient       Load *p Time Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q violation

11 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 11 - Speculation vs. Synchronization Sequential ExecutionSpeculative Parallel Execution  Frequent dependences: Synchronize  Infrequent dependences: Speculate Load *p Time Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q Load *p Store *q

12 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 12 - Performance Potential  Reducing failed speculation improves performance Detailed simulation: TLS support 4-processor CMP 4-way issue, out-of-order superscalar 10-cycle communication latency Original Perfect memory value Prediction Norm. Regional Exec. Time 0 100 m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp go

13 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 13 - Hardware vs. Compiler Inserted Synchronization Store*q Load *p Memory Store*q Load *p Memory Store *q Load *p Memory Speculation Hardware-inserted Synchronization [HPCA’02] Compiler-inserted Synchronization [CGO’04] Time Signal() (stall) Producer Consumer Producer Consumer Producer Consumer Wait()

14 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 14 - Issues in Synchronizing Memory-Resident Values  Static analysis  Which instructions to synchronize?  Inter-procedural dependences  Runtime  Detecting and recovering from improper synchronization Store *q Load *p Producer Consumer Time

15 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 15 - Outline  Static analysis  Runtime checks  Results  Conclusions Load *p Producer Consumer Store *q Time

16 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 16 - Compiler Passes Front End Back End foo.c foo.exe Insert Synchronization Profile Data Dependences Create Threads Schedule Instructions Decide what to Synchronize

17 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 17 - Example work() push (head, entry) do { push (&set, element); work(); } while (test);

18 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 18 - Example work() { if (condition(&set)) push (&set, element); } push (head, entry) do { push (&set, element); work(); } while (test);

19 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 19 - Example work() { if (condition(&set)) push (&set, element); } push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head Store *head Load *head (work, push) Load *head (push) Store *head (work, push) do { push (&set, element); work(); } while (test); Store *head (push)

20 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 20 - Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c

21 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 21 - Example work() { if (condition(&set)) push (&set, element); } do { push (&set, element); work(); } while (test); push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head (push) Store *head (push) Load *head (work, push) Store *head (work, push) Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10 Profile Information ======================================================== Source Destination Frequency Store *head(push) Load *head(push) 990 Store *head(push) Load *head(work, push) 10 Store *head(work, push) Load *head(push) 10

22 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 22 - Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c

23 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 23 - Dependence Graph Load *head (work, push) Store *head (work, push) 990 10 Load *head (push) Store *head (push)  Pairs that need to be synchronized can be extracted from the dependence graph Infrequent dependences: occur in less than 5% of iterations

24 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 24 - Compiler Passes Front End Back End Insert Synchronization Profile Data Dependences Thread Creating Instruction Scheduling Decide what to Synchronize foo.exe foo.c

25 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 25 - Example work() { if (condition(&set)) push (&set, element); } do { push (&set, element); work(); } while (test); push(head,entry) { entry->next = *head; *head = entry; } push(head,entry) { entry->next = *head; *head = entry; } Load *head (push) Store *head (push) 990 Load *head (push) Store *head (push) Synchronize these push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head); } push_clone(&set, element);

26 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 26 - Outline Static analysis  Runtime checks  Results  Conclusions Producer Consumer Store *q Load *p Time

27 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 27 - Runtime Checks  Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and Load *p Signal(q, *q);  Producer forwards the address to ensure a match between the load and the store Producer Consumer Load *p Store *q Time

28 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 28 - Ensuring Correctness Store *x Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer  Hardware support  Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] Load *p Store *q Time

29 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 29 - Ensuring Correctness  Hardware support: TLS hardware already knows which locations are stored to Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer Store *y Load *p Store *q Time

30 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 30 - Outline Static analysis Runtime checks  Results  Conclusions Producer Consumer Store *q Load *p Time

31 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 31 - Crossbar Experimental Framework Underlying architecture  4-processor, single-chip multiprocessor  speculation supported through coherence Simulator  superscalar, similar to MIPS R14K  10-cycle communication latency  models all bandwidth and contention Benchmarks  SPECint95 and SPECint2000, -O3 optimization  detailed simulation C C P C P

32 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 32 - Parallel Region Coverage 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp  Coverage is significant  Average coverage: 54%

33 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 33 - Failed Speculation Synchronization Stall Other Busy U=No synchronization inserted C=Compiler-Inserted Synchronization  Seven benchmarks speed up by 5% to 46% Compiler-Inserted Synchronization 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp U C U C U C U C U C U C U C U C U C U C U C U C U C 10%46%13%5%8%5%21% Norm. Regional Exec. Time

34 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 34 - Compiler- vs. Hardware-Inserted Synchronization 0 100 go m88ksim ijpeg gzip_comp gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp C H C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization Compiler and hardware [HPCA’02] each benefits different benchmarks Norm. Regional Exec. Time Failed Speculation Synchronization Stall Other Busy Hardware does better Compiler does better

35 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 35 - Combining Hardware and Compiler Synchronization C=Compiler-inserted synchronization H=Hardware-inserted synchronization B=Combining Both  The combination is more robust than each technique individually 0 100 go m88ksim gzip_comp gzip_decomp perlbmk gap C H B Norm. Regional Exec. Time Failed Speculation Synchronization Stall Other Busy

36 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 36 - Related Work Zhai et. al. CGO’04 Cytron ICPP’86 Compiler-inserted Moshovos et. al. ISCA’97 Cintra & Torrellas HPCA’02 Steffan et. al. HPCA’02 Hardware-inserted Centralized TableDistributed Table Tsai & Yew PACT’96

37 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 37 - Conclusions Compiler-inserted synchronization for memory-resident value communication:  Effective in reducing speculation failure  Half of the benchmarks speedup by 5% to 46% (regional)  Combining hardware and compiler techniques is more robust  Neither consistently outperforms the other  Can be combined to track the best performer  Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware

38 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 38 - Questions?

39 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 39 - The Potential of Instruction Scheduling 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place mcf crafty parser perlbmk gap gzip_comp gcc E=Early C=Compiler-Inserted Synchronization L=Late Failed Speculation Synchronization Stall Other Busy Scheduling instructions has addition benefit for some benchmarks ECL Bzip2_comp

40 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 40 - Program Performance 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp bzip2_decomp twolf gzip_comp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Failed Speculation Synchronization Stall Other Busy UCHB

41 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 41 - Which Technique Synchronizes This Load? 0 100 go m88ksim ijpeg gzip_comp_R gzip_decomp vpr_place gcc mcf crafty parser perlbmk gap bzip2_comp twolf UCHB gzip_comp U=Un-optimized C=Compiler-Inserted Synchronization H=Hardware-Inserted Synchronization B=Both compiler and hardware Synchronized by neither technique Synchronized by compiler Synchronized by hardware Synchronized by both

42 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 42 - Ensuring Correctness  Hardware support  Similar to memory conflict buffer [Gallagher et al, ASPLOS’94] Store *q Load *p Store *x Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Consumer Producer

43 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 43 - Consumer Store *q and Load *p access the same memory address  No store modifies the forwarded address between Store *q and load *p Ensuring Correctness  Hardware support Use the forwarded value only if the synchronized pair is dependent Use Forwarded Value Use Memory Value Local Store to *p q == p NO YES NO Store *q Load *p Store *x Signal(q); Signal(*q) Producer

44 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 44 - Issues in Synchronizing Memory-Resident Values Inserting synchronization using compilers Ensuring correctness  Reducing synchronization cost Store *q Load *p Consumer Producer

45 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 45 - Reducing Cost of Synchronization Before Instruction Scheduling Consumer Producer  Instruction scheduling algorithms are described in [ASPLOS’02] After Instruction Scheduling Producer Consumer

46 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 46 - The Potential of Instruction Scheduling 0 100 m88ksim ijpeg gzip_comp gzip_decomp vpr_place gap E = Perfectly predicting synchronized memory-resident values C = Compiler-inserted synchronization L = Consumer stalls until previous thread commits  Scheduling instructions could offer additional benefit E C L Failed Speculation Synchronization Stall Other Busy Norm. Regional Exec. Time

47 Zhai, Colohan, Steffan and Mowry Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication… - 47 - Using More Accuracy of Profiling Information 0 100 CRU U=No Instruction Scheduling C=Compiler-Inserted Synchronization R=Compiler-Inserted Synchronization (Profiled with the ref input set)  Gzip_comp is the only benchmark sensitive to profiling input gzip_comp Failed Speculation Synchronization Stall Other Busy Norm. Regional Exec. Time


Download ppt "Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory."

Similar presentations


Ads by Google