Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh http://www.inf.ed.ac.uk/home/mc Universidad de Valladolid http://www.infor.uva.es/~diego

Symp. on Principles and Practice of Parallel Programming - June 20032 Speculative parallelization on SMP for(i=0; i<100; i++) {... = A[L[i]]; A[K[i]] =... }  Assume no dependences and execute iterations in parallel Iteration J+2... = A[5]; Iteration J+1... = A[2]; Iteration J... = A[4]; A[6] =...A[2] =...A[5] =...  Access to shared data should be tracked at runtime RAW  If a violation is detected, offending threads are squashed

Symp. on Principles and Practice of Parallel Programming - June 20033 Hardware vs. Software schemes  Hardware schemes +High performance –Changes to processor, caches, and coherence controller  Software schemes +No hardware changes –Poorer performance:  Software management overhead  Suboptimal scheduling  Contention due to the need of synchronization

Symp. on Principles and Practice of Parallel Programming - June 20034 Wish List  To reduce software overhead  use of efficient speculative data structures and optimized operations  To have an efficient scheduling  minimizing memory overhead while maximizing tolerance to load imbalance and violations  To reduce contention  avoid synchronization as much as possible  To avoid performance degradation  squash contention mechanism

Symp. on Principles and Practice of Parallel Programming - June 20035 Outline  Motivation  Our software-only scheme  Evaluation  Related Work  Conclusions

Symp. on Principles and Practice of Parallel Programming - June 20036 Speculative Access Structures  Use of versions of the shared data structure Shared structure Thread A (iteration J) Thread B (iteration J+1) na el na  A speculative access structure holds the state (na, m, el, elm) of each version of elements m A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]...

Symp. on Principles and Practice of Parallel Programming - June 20037 Speculative Access Structure I: Simple Array  Array of access states directly mapped to shadow copy of the user data array NAELMNA ELNA Spec. access structure Version copy NA: not accessed EL: exposed loaded M: modified ELM: exposed loaded and modified

Symp. on Principles and Practice of Parallel Programming - June 20038 Speculative Access Structure I: Simple Array  Cheap to look up on speculative memory operations... = A[2] NAELMNA ELNA Version copy EL NAELMNA ELNA Access array User array Scan  Expensive to search on commits

Symp. on Principles and Practice of Parallel Programming - June 20039 Speculative Access Structure II: Indirection Array  Array of indices that indicate the elements of the shadow data array that were touched NAELMNA ELNA Access array User array 164 Indirection array

Symp. on Principles and Practice of Parallel Programming - June 200310 Speculative Access Structure II: Indirection Array 164 Indirection array... = A[2] NAELMNA ELNA Access array EL 2  Cheap to look up on speculative memory operations  Cheap to search on commits Scan NAELMNA ELNA Access array 164 Indirection array

Symp. on Principles and Practice of Parallel Programming - June 200311 Scheduling Threads  Static: assign a chunk of N/P iterations to each processor +Only P active threads  little memory overhead –Poor tolerance to load imbalance and dependence violations  Dynamic: dynamically assign each of N iterations –N active threads  bigger memory structures +Better tolerance to load imbalance and dependence violations  Our solution: software version of an aggressive sliding window mechanism † † Cintra, Martinez and Torrellas; ISCA 2000

Symp. on Principles and Practice of Parallel Programming - June 200312  Schedule a window of W iterations at a time Sliding window Window (W) Thread 1 Thread 2 Time Iterations (N): 12345768  Dynamic assignment of iterations inside the window 12 1 2 3 32  When the non-spec thread finishes, the window is advanced 12345678 4  Tradeoff between load balancing and size of version structures

Symp. on Principles and Practice of Parallel Programming - June 200313 Memory operations... = A[K[i]]  Load operation L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version A[K[i]] =...  Store operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations  Correctness guaranteed if globally performed in program order  But program order may not be respected…  Compiler reordering  Use of relaxed memory consistency models

Symp. on Principles and Practice of Parallel Programming - June 200314 Race Conditions  Certain interleaving of operations may lead to incorrect execution Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Thread executing iteration J Thread executing iteration J+K Time S2 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L2 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version S1 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L3 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version  Incorrect value S3 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations  Violation not detected L1 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version

Symp. on Principles and Practice of Parallel Programming - June 200315 Conservative Solution  To embrace operations in a critical section Load Operation # lock A L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version # unlock A Store Operation # lock A S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations # unlock A  Drawback: contention

Symp. on Principles and Practice of Parallel Programming - June 200316 Our Solution: Memory Fences Load Operation L1: Update state of the element to EL # memory fence L2: Scan backwards access array for version L3: Obtain most up-to-date version Store Operation S1: Perform the store of the new version # memory fence S2: Update state of the element to M or ELM # memory fence S3: Scan forwards access array for violations  All pending operations should be performed before passing the memory fence  This is the minimun set of memory fences needed  Critical sections are still necessary to protect structures on thread starts, commits and squashes.

Symp. on Principles and Practice of Parallel Programming - June 200318 Evaluation Environment  Execution of experiments on a real machine  Sun SunFire 6800 SMP with 24 UltraSPARC-III processors  OpenMP 2.0  Study of applications with non-analyzable loops  TREE, WUPWISE, MDG  no dependences  LUCAS, AP3M  dependences

Symp. on Principles and Practice of Parallel Programming - June 200319 Speedups of Loops: TREE Very close to “ideal” DOALL speedup

Symp. on Principles and Practice of Parallel Programming - June 200320 Speedups of Loops: WUPWISE Not so close to “ideal” DOALL speedup: huge spec data size

Symp. on Principles and Practice of Parallel Programming - June 200321 Importance of Indirection Array

Symp. on Principles and Practice of Parallel Programming - June 200322 Cost of Violation Checks  Systems evaluated:  Baseline: our scheme, with violation checks upon stores  sys2: same as Baseline, but violation checks upon commits

Symp. on Principles and Practice of Parallel Programming - June 200323 Cost of Violation Checks May outperform checks at commit on sparse accesses Checks upon loads and stores are not too expensive

Symp. on Principles and Practice of Parallel Programming - June 200324 Effects of Scheduling Schemes  Systems evaluated  Baseline: Sliding window moved when non- speculative thread finishes  sys3: Sliding window moved when all thread finish (solution adopted by Dang et al. [IPDPS 2002])  sys4: Dynamic scheduling, no partial commits (solution adopted by Rundberg et al. [WSSMM 2000])

Symp. on Principles and Practice of Parallel Programming - June 200325 Effects of Scheduling Schemes P = 4 processors Fully dynamic schedule is not always feasible Best performance for W=2*P to 4*P

Symp. on Principles and Practice of Parallel Programming - June 200326 Wish List Revisited  To reduce software overhead Access and Indirection Arrays Early violation detection (on stores instead of during commit)  To have an efficient scheduling Agressive Sliding Window mechanism  To reduce contention Use of memory fences instead of critical sections  To avoid performance degradation Squash monitor with feedback

Symp. on Principles and Practice of Parallel Programming - June 200328 Software-only speculative parallelization schemes  SW-R-LRPD at Texas A&M University (IPDPS 2002)  Less aggressive window (moved when all threads finish)  Violation checks when threads commit  Chalmers University (WSSMM 2000)  Dynamic scheme  Violation checks upon stores  IBM Research (SC 1998)  Series of tests for various specific behaviors  TLDS at Carnegie Mellon University (tech. rep. 2001)  Speculation in software DSM engine

Symp. on Principles and Practice of Parallel Programming - June 200330 Conclusions  Systematic consideration of the design space and cost/performance issues  New efficient and robust software-only speculative parallelization scheme: –Fine-tuned data structures –Aggressive sliding window –Reduced synchronization requirements –Overhead monitors and feedback  Very good performance: –7 to 25% faster than previous schemes –71% of hand-made, manual parallelization speedup

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh http://www.inf.ed.ac.uk/home/mc Universidad de Valladolid http://www.infor.uva.es/~diego

Symp. on Principles and Practice of Parallel Programming - June 200332 Data Structures Implementation User array 0 2nNA M EL NA Access structures Version copies M 0 1 n

Symp. on Principles and Practice of Parallel Programming - June 200333 Squashing Threads  Violations are detected by looking up speculative access structure  On every store +Check only the element being accessed +Earlier violation detections ±Frequent checks  Need some form of synchronization  At commit  Check all elements +Faster speculative memory operations

Symp. on Principles and Practice of Parallel Programming - June 200334 Squash contention mechanism  Goal: to avoid performance degradation in the presence of dependences  Implemented with commit and squash monitors  After a given threshold, following invocations of the same loop will be executed sequentially

Symp. on Principles and Practice of Parallel Programming - June 200335 Importance of Squash Monitors

Symp. on Principles and Practice of Parallel Programming - June 200336 Application Characteristics Application TREE MDG Loops accel_10 interf_1000 WUPWISE muldeo_200’ muldoe_200’ % of Seq. Time 94 86 41 Spec data Size (KB) < 1 12,000 AP3M Shgravll_700 LUCAS mers_mod_square (line 444) 78 20 3,000 4,000

Symp. on Principles and Practice of Parallel Programming - June 200337 Speedups of Loops: MDG Very close to “ideal” DOALL speedup

Symp. on Principles and Practice of Parallel Programming - June 200338 Overall Speedups: TREE

Symp. on Principles and Practice of Parallel Programming - June 200339 Overall Speedups: WUPWISE

Symp. on Principles and Practice of Parallel Programming - June 200340 Overall Speedups: MDG

Symp. on Principles and Practice of Parallel Programming - June 200341 Constrained Memory Overheads Mixed results: either Baseline Or Sys4 perform best

Symp. on Principles and Practice of Parallel Programming - June 200342 Related Work Hardware-based speculative parallelization schemes: –I-ACOMA at University of Illinois –HYDRA at Stanford –Multiplex at Purdue –Multiscalar at Wisconsin –Clustered Speculative Multithreading at UPC –TLDS at Carnegie Mellon Inspector-Executor scheme: –Leung and Zahorjan (PPoPP 1993) –Saltz, Mirchandney, and Crowley (IEEE ToC 1991)

Symp. on Principles and Practice of Parallel Programming - June 200343 Related Work Optimistic Concurrency Control schemes: –E.g., Herlihy (ACM TDBS 1990); Kung and Robinson (ACM TDBS 1981) –Only need to enforce that accesses to objects in critical sections do not overlap  no total order –Applied to explicitly parallel applications

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Similar presentations

Presentation on theme: "Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Similar presentations

Presentation on theme: "Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback