Presentation is loading. Please wait.

Presentation is loading. Please wait.

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Similar presentations


Presentation on theme: "Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh"— Presentation transcript:

1 Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh http://www.inf.ed.ac.uk/home/mc Universidad de Valladolid http://www.infor.uva.es/~diego

2 Symp. on Principles and Practice of Parallel Programming - June 20032 Speculative parallelization on SMP for(i=0; i<100; i++) {... = A[L[i]]; A[K[i]] =... }  Assume no dependences and execute iterations in parallel Iteration J+2... = A[5]; Iteration J+1... = A[2]; Iteration J... = A[4]; A[6] =...A[2] =...A[5] =...  Access to shared data should be tracked at runtime RAW  If a violation is detected, offending threads are squashed

3 Symp. on Principles and Practice of Parallel Programming - June 20033 Hardware vs. Software schemes  Hardware schemes +High performance –Changes to processor, caches, and coherence controller  Software schemes +No hardware changes –Poorer performance:  Software management overhead  Suboptimal scheduling  Contention due to the need of synchronization

4 Symp. on Principles and Practice of Parallel Programming - June 20034 Wish List  To reduce software overhead  use of efficient speculative data structures and optimized operations  To have an efficient scheduling  minimizing memory overhead while maximizing tolerance to load imbalance and violations  To reduce contention  avoid synchronization as much as possible  To avoid performance degradation  squash contention mechanism

5 Symp. on Principles and Practice of Parallel Programming - June 20035 Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

6 Symp. on Principles and Practice of Parallel Programming - June 20036 Speculative Access Structures  Use of versions of the shared data structure Shared structure Thread A (iteration J) Thread B (iteration J+1) na el na  A speculative access structure holds the state (na, m, el, elm) of each version of elements m A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]...

7 Symp. on Principles and Practice of Parallel Programming - June 20037 Speculative Access Structure I: Simple Array  Array of access states directly mapped to shadow copy of the user data array NAELMNA ELNA Spec. access structure Version copy NA: not accessed EL: exposed loaded M: modified ELM: exposed loaded and modified

8 Symp. on Principles and Practice of Parallel Programming - June 20038 Speculative Access Structure I: Simple Array  Cheap to look up on speculative memory operations... = A[2] NAELMNA ELNA Version copy EL NAELMNA ELNA Access array User array Scan  Expensive to search on commits

9 Symp. on Principles and Practice of Parallel Programming - June 20039 Speculative Access Structure II: Indirection Array  Array of indices that indicate the elements of the shadow data array that were touched NAELMNA ELNA Access array User array 164 Indirection array

10 Symp. on Principles and Practice of Parallel Programming - June 200310 Speculative Access Structure II: Indirection Array 164 Indirection array... = A[2] NAELMNA ELNA Access array EL 2  Cheap to look up on speculative memory operations  Cheap to search on commits Scan NAELMNA ELNA Access array 164 Indirection array

11 Symp. on Principles and Practice of Parallel Programming - June 200311 Scheduling Threads  Static: assign a chunk of N/P iterations to each processor +Only P active threads  little memory overhead –Poor tolerance to load imbalance and dependence violations  Dynamic: dynamically assign each of N iterations –N active threads  bigger memory structures +Better tolerance to load imbalance and dependence violations  Our solution: software version of an aggressive sliding window mechanism † † Cintra, Martinez and Torrellas; ISCA 2000

12 Symp. on Principles and Practice of Parallel Programming - June 200312  Schedule a window of W iterations at a time Sliding window Window (W) Thread 1 Thread 2 Time Iterations (N): 12345768  Dynamic assignment of iterations inside the window 12 1 2 3 32  When the non-spec thread finishes, the window is advanced 12345678 4  Tradeoff between load balancing and size of version structures

13 Symp. on Principles and Practice of Parallel Programming - June 200313 Memory operations... = A[K[i]]  Load operation L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version A[K[i]] =...  Store operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations  Correctness guaranteed if globally performed in program order  But program order may not be respected…  Compiler reordering  Use of relaxed memory consistency models

14 Symp. on Principles and Practice of Parallel Programming - June 200314 Race Conditions  Certain interleaving of operations may lead to incorrect execution Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Thread executing iteration J Thread executing iteration J+K Time S2 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L2 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version S1 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L3 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version  Incorrect value S3 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations  Violation not detected L1 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version

15 Symp. on Principles and Practice of Parallel Programming - June 200315 Conservative Solution  To embrace operations in a critical section Load Operation # lock A L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version # unlock A Store Operation # lock A S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations # unlock A  Drawback: contention

16 Symp. on Principles and Practice of Parallel Programming - June 200316 Our Solution: Memory Fences Load Operation L1: Update state of the element to EL # memory fence L2: Scan backwards access array for version L3: Obtain most up-to-date version Store Operation S1: Perform the store of the new version # memory fence S2: Update state of the element to M or ELM # memory fence S3: Scan forwards access array for violations  All pending operations should be performed before passing the memory fence  This is the minimun set of memory fences needed  Critical sections are still necessary to protect structures on thread starts, commits and squashes.

17 Symp. on Principles and Practice of Parallel Programming - June 200317 Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

18 Symp. on Principles and Practice of Parallel Programming - June 200318 Evaluation Environment  Execution of experiments on a real machine  Sun SunFire 6800 SMP with 24 UltraSPARC-III processors  OpenMP 2.0  Study of applications with non-analyzable loops  TREE, WUPWISE, MDG  no dependences  LUCAS, AP3M  dependences

19 Symp. on Principles and Practice of Parallel Programming - June 200319 Speedups of Loops: TREE Very close to “ideal” DOALL speedup

20 Symp. on Principles and Practice of Parallel Programming - June 200320 Speedups of Loops: WUPWISE Not so close to “ideal” DOALL speedup: huge spec data size

21 Symp. on Principles and Practice of Parallel Programming - June 200321 Importance of Indirection Array

22 Symp. on Principles and Practice of Parallel Programming - June 200322 Cost of Violation Checks  Systems evaluated:  Baseline: our scheme, with violation checks upon stores  sys2: same as Baseline, but violation checks upon commits

23 Symp. on Principles and Practice of Parallel Programming - June 200323 Cost of Violation Checks May outperform checks at commit on sparse accesses Checks upon loads and stores are not too expensive

24 Symp. on Principles and Practice of Parallel Programming - June 200324 Effects of Scheduling Schemes  Systems evaluated  Baseline: Sliding window moved when non- speculative thread finishes  sys3: Sliding window moved when all thread finish (solution adopted by Dang et al. [IPDPS 2002])  sys4: Dynamic scheduling, no partial commits (solution adopted by Rundberg et al. [WSSMM 2000])

25 Symp. on Principles and Practice of Parallel Programming - June 200325 Effects of Scheduling Schemes P = 4 processors Fully dynamic schedule is not always feasible Best performance for W=2*P to 4*P

26 Symp. on Principles and Practice of Parallel Programming - June 200326 Wish List Revisited  To reduce software overhead Access and Indirection Arrays Early violation detection (on stores instead of during commit)  To have an efficient scheduling Agressive Sliding Window mechanism  To reduce contention Use of memory fences instead of critical sections  To avoid performance degradation Squash monitor with feedback

27 Symp. on Principles and Practice of Parallel Programming - June 200327 Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

28 Symp. on Principles and Practice of Parallel Programming - June 200328 Software-only speculative parallelization schemes  SW-R-LRPD at Texas A&M University (IPDPS 2002)  Less aggressive window (moved when all threads finish)  Violation checks when threads commit  Chalmers University (WSSMM 2000)  Dynamic scheme  Violation checks upon stores  IBM Research (SC 1998)  Series of tests for various specific behaviors  TLDS at Carnegie Mellon University (tech. rep. 2001)  Speculation in software DSM engine

29 Symp. on Principles and Practice of Parallel Programming - June 200329 Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

30 Symp. on Principles and Practice of Parallel Programming - June 200330 Conclusions  Systematic consideration of the design space and cost/performance issues  New efficient and robust software-only speculative parallelization scheme: –Fine-tuned data structures –Aggressive sliding window –Reduced synchronization requirements –Overhead monitors and feedback  Very good performance: –7 to 25% faster than previous schemes –71% of hand-made, manual parallelization speedup

31 Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh http://www.inf.ed.ac.uk/home/mc Universidad de Valladolid http://www.infor.uva.es/~diego

32 Symp. on Principles and Practice of Parallel Programming - June 200332 Data Structures Implementation User array 0 2nNA M EL NA Access structures Version copies M 0 1 n

33 Symp. on Principles and Practice of Parallel Programming - June 200333 Squashing Threads  Violations are detected by looking up speculative access structure  On every store +Check only the element being accessed +Earlier violation detections ±Frequent checks  Need some form of synchronization  At commit  Check all elements +Faster speculative memory operations

34 Symp. on Principles and Practice of Parallel Programming - June 200334 Squash contention mechanism  Goal: to avoid performance degradation in the presence of dependences  Implemented with commit and squash monitors  After a given threshold, following invocations of the same loop will be executed sequentially

35 Symp. on Principles and Practice of Parallel Programming - June 200335 Importance of Squash Monitors

36 Symp. on Principles and Practice of Parallel Programming - June 200336 Application Characteristics Application TREE MDG Loops accel_10 interf_1000 WUPWISE muldeo_200’ muldoe_200’ % of Seq. Time 94 86 41 Spec data Size (KB) < 1 12,000 AP3M Shgravll_700 LUCAS mers_mod_square (line 444) 78 20 3,000 4,000

37 Symp. on Principles and Practice of Parallel Programming - June 200337 Speedups of Loops: MDG Very close to “ideal” DOALL speedup

38 Symp. on Principles and Practice of Parallel Programming - June 200338 Overall Speedups: TREE

39 Symp. on Principles and Practice of Parallel Programming - June 200339 Overall Speedups: WUPWISE

40 Symp. on Principles and Practice of Parallel Programming - June 200340 Overall Speedups: MDG

41 Symp. on Principles and Practice of Parallel Programming - June 200341 Constrained Memory Overheads Mixed results: either Baseline Or Sys4 perform best

42 Symp. on Principles and Practice of Parallel Programming - June 200342 Related Work Hardware-based speculative parallelization schemes: –I-ACOMA at University of Illinois –HYDRA at Stanford –Multiplex at Purdue –Multiscalar at Wisconsin –Clustered Speculative Multithreading at UPC –TLDS at Carnegie Mellon Inspector-Executor scheme: –Leung and Zahorjan (PPoPP 1993) –Saltz, Mirchandney, and Crowley (IEEE ToC 1991)

43 Symp. on Principles and Practice of Parallel Programming - June 200343 Related Work Optimistic Concurrency Control schemes: –E.g., Herlihy (ACM TDBS 1990); Kung and Robinson (ACM TDBS 1981) –Only need to enforce that accesses to objects in critical sections do not overlap  no total order –Applied to explicitly parallel applications


Download ppt "Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh"

Similar presentations


Ads by Google