Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh Universidad de Valladolid
Symp. on Principles and Practice of Parallel Programming - June Speculative parallelization on SMP for(i=0; i<100; i++) {... = A[L[i]]; A[K[i]] =... } Assume no dependences and execute iterations in parallel Iteration J+2... = A[5]; Iteration J+1... = A[2]; Iteration J... = A[4]; A[6] =...A[2] =...A[5] =... Access to shared data should be tracked at runtime RAW If a violation is detected, offending threads are squashed
Symp. on Principles and Practice of Parallel Programming - June Hardware vs. Software schemes Hardware schemes +High performance –Changes to processor, caches, and coherence controller Software schemes +No hardware changes –Poorer performance: Software management overhead Suboptimal scheduling Contention due to the need of synchronization
Symp. on Principles and Practice of Parallel Programming - June Wish List To reduce software overhead use of efficient speculative data structures and optimized operations To have an efficient scheduling minimizing memory overhead while maximizing tolerance to load imbalance and violations To reduce contention avoid synchronization as much as possible To avoid performance degradation squash contention mechanism
Symp. on Principles and Practice of Parallel Programming - June Outline Motivation Our software-only scheme Evaluation Related Work Conclusions
Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structures Use of versions of the shared data structure Shared structure Thread A (iteration J) Thread B (iteration J+1) na el na A speculative access structure holds the state (na, m, el, elm) of each version of elements m A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]...
Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structure I: Simple Array Array of access states directly mapped to shadow copy of the user data array NAELMNA ELNA Spec. access structure Version copy NA: not accessed EL: exposed loaded M: modified ELM: exposed loaded and modified
Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structure I: Simple Array Cheap to look up on speculative memory operations... = A[2] NAELMNA ELNA Version copy EL NAELMNA ELNA Access array User array Scan Expensive to search on commits
Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structure II: Indirection Array Array of indices that indicate the elements of the shadow data array that were touched NAELMNA ELNA Access array User array 164 Indirection array
Symp. on Principles and Practice of Parallel Programming - June Speculative Access Structure II: Indirection Array 164 Indirection array... = A[2] NAELMNA ELNA Access array EL 2 Cheap to look up on speculative memory operations Cheap to search on commits Scan NAELMNA ELNA Access array 164 Indirection array
Symp. on Principles and Practice of Parallel Programming - June Scheduling Threads Static: assign a chunk of N/P iterations to each processor +Only P active threads little memory overhead –Poor tolerance to load imbalance and dependence violations Dynamic: dynamically assign each of N iterations –N active threads bigger memory structures +Better tolerance to load imbalance and dependence violations Our solution: software version of an aggressive sliding window mechanism † † Cintra, Martinez and Torrellas; ISCA 2000
Symp. on Principles and Practice of Parallel Programming - June Schedule a window of W iterations at a time Sliding window Window (W) Thread 1 Thread 2 Time Iterations (N): Dynamic assignment of iterations inside the window When the non-spec thread finishes, the window is advanced Tradeoff between load balancing and size of version structures
Symp. on Principles and Practice of Parallel Programming - June Memory operations... = A[K[i]] Load operation L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version A[K[i]] =... Store operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Correctness guaranteed if globally performed in program order But program order may not be respected… Compiler reordering Use of relaxed memory consistency models
Symp. on Principles and Practice of Parallel Programming - June Race Conditions Certain interleaving of operations may lead to incorrect execution Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Thread executing iteration J Thread executing iteration J+K Time S2 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L2 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version S1 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L3 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version Incorrect value S3 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Violation not detected L1 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version
Symp. on Principles and Practice of Parallel Programming - June Conservative Solution To embrace operations in a critical section Load Operation # lock A L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version # unlock A Store Operation # lock A S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations # unlock A Drawback: contention
Symp. on Principles and Practice of Parallel Programming - June Our Solution: Memory Fences Load Operation L1: Update state of the element to EL # memory fence L2: Scan backwards access array for version L3: Obtain most up-to-date version Store Operation S1: Perform the store of the new version # memory fence S2: Update state of the element to M or ELM # memory fence S3: Scan forwards access array for violations All pending operations should be performed before passing the memory fence This is the minimun set of memory fences needed Critical sections are still necessary to protect structures on thread starts, commits and squashes.
Symp. on Principles and Practice of Parallel Programming - June Outline Motivation Our software-only scheme Evaluation Related Work Conclusions
Symp. on Principles and Practice of Parallel Programming - June Evaluation Environment Execution of experiments on a real machine Sun SunFire 6800 SMP with 24 UltraSPARC-III processors OpenMP 2.0 Study of applications with non-analyzable loops TREE, WUPWISE, MDG no dependences LUCAS, AP3M dependences
Symp. on Principles and Practice of Parallel Programming - June Speedups of Loops: TREE Very close to “ideal” DOALL speedup
Symp. on Principles and Practice of Parallel Programming - June Speedups of Loops: WUPWISE Not so close to “ideal” DOALL speedup: huge spec data size
Symp. on Principles and Practice of Parallel Programming - June Importance of Indirection Array
Symp. on Principles and Practice of Parallel Programming - June Cost of Violation Checks Systems evaluated: Baseline: our scheme, with violation checks upon stores sys2: same as Baseline, but violation checks upon commits
Symp. on Principles and Practice of Parallel Programming - June Cost of Violation Checks May outperform checks at commit on sparse accesses Checks upon loads and stores are not too expensive
Symp. on Principles and Practice of Parallel Programming - June Effects of Scheduling Schemes Systems evaluated Baseline: Sliding window moved when non- speculative thread finishes sys3: Sliding window moved when all thread finish (solution adopted by Dang et al. [IPDPS 2002]) sys4: Dynamic scheduling, no partial commits (solution adopted by Rundberg et al. [WSSMM 2000])
Symp. on Principles and Practice of Parallel Programming - June Effects of Scheduling Schemes P = 4 processors Fully dynamic schedule is not always feasible Best performance for W=2*P to 4*P
Symp. on Principles and Practice of Parallel Programming - June Wish List Revisited To reduce software overhead Access and Indirection Arrays Early violation detection (on stores instead of during commit) To have an efficient scheduling Agressive Sliding Window mechanism To reduce contention Use of memory fences instead of critical sections To avoid performance degradation Squash monitor with feedback
Symp. on Principles and Practice of Parallel Programming - June Outline Motivation Our software-only scheme Evaluation Related Work Conclusions
Symp. on Principles and Practice of Parallel Programming - June Software-only speculative parallelization schemes SW-R-LRPD at Texas A&M University (IPDPS 2002) Less aggressive window (moved when all threads finish) Violation checks when threads commit Chalmers University (WSSMM 2000) Dynamic scheme Violation checks upon stores IBM Research (SC 1998) Series of tests for various specific behaviors TLDS at Carnegie Mellon University (tech. rep. 2001) Speculation in software DSM engine
Symp. on Principles and Practice of Parallel Programming - June Outline Motivation Our software-only scheme Evaluation Related Work Conclusions
Symp. on Principles and Practice of Parallel Programming - June Conclusions Systematic consideration of the design space and cost/performance issues New efficient and robust software-only speculative parallelization scheme: –Fine-tuned data structures –Aggressive sliding window –Reduced synchronization requirements –Overhead monitors and feedback Very good performance: –7 to 25% faster than previous schemes –71% of hand-made, manual parallelization speedup
Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh Universidad de Valladolid
Symp. on Principles and Practice of Parallel Programming - June Data Structures Implementation User array 0 2nNA M EL NA Access structures Version copies M 0 1 n
Symp. on Principles and Practice of Parallel Programming - June Squashing Threads Violations are detected by looking up speculative access structure On every store +Check only the element being accessed +Earlier violation detections ±Frequent checks Need some form of synchronization At commit Check all elements +Faster speculative memory operations
Symp. on Principles and Practice of Parallel Programming - June Squash contention mechanism Goal: to avoid performance degradation in the presence of dependences Implemented with commit and squash monitors After a given threshold, following invocations of the same loop will be executed sequentially
Symp. on Principles and Practice of Parallel Programming - June Importance of Squash Monitors
Symp. on Principles and Practice of Parallel Programming - June Application Characteristics Application TREE MDG Loops accel_10 interf_1000 WUPWISE muldeo_200’ muldoe_200’ % of Seq. Time Spec data Size (KB) < 1 12,000 AP3M Shgravll_700 LUCAS mers_mod_square (line 444) ,000 4,000
Symp. on Principles and Practice of Parallel Programming - June Speedups of Loops: MDG Very close to “ideal” DOALL speedup
Symp. on Principles and Practice of Parallel Programming - June Overall Speedups: TREE
Symp. on Principles and Practice of Parallel Programming - June Overall Speedups: WUPWISE
Symp. on Principles and Practice of Parallel Programming - June Overall Speedups: MDG
Symp. on Principles and Practice of Parallel Programming - June Constrained Memory Overheads Mixed results: either Baseline Or Sys4 perform best
Symp. on Principles and Practice of Parallel Programming - June Related Work Hardware-based speculative parallelization schemes: –I-ACOMA at University of Illinois –HYDRA at Stanford –Multiplex at Purdue –Multiscalar at Wisconsin –Clustered Speculative Multithreading at UPC –TLDS at Carnegie Mellon Inspector-Executor scheme: –Leung and Zahorjan (PPoPP 1993) –Saltz, Mirchandney, and Crowley (IEEE ToC 1991)
Symp. on Principles and Practice of Parallel Programming - June Related Work Optimistic Concurrency Control schemes: –E.g., Herlihy (ACM TDBS 1990); Kung and Robinson (ACM TDBS 1981) –Only need to enforce that accesses to objects in critical sections do not overlap no total order –Applied to explicitly parallel applications