Download presentation
Presentation is loading. Please wait.
Published byVincent Pinkett Modified over 9 years ago
1
Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh http://www.inf.ed.ac.uk/home/mc Universidad de Valladolid http://www.infor.uva.es/~diego
2
Symp. on Principles and Practice of Parallel Programming - June 20032 Speculative parallelization on SMP for(i=0; i<100; i++) {... = A[L[i]]; A[K[i]] =... } Assume no dependences and execute iterations in parallel Iteration J+2... = A[5]; Iteration J+1... = A[2]; Iteration J... = A[4]; A[6] =...A[2] =...A[5] =... Access to shared data should be tracked at runtime RAW If a violation is detected, offending threads are squashed
3
Symp. on Principles and Practice of Parallel Programming - June 20033 Hardware vs. Software schemes Hardware schemes +High performance –Changes to processor, caches, and coherence controller Software schemes +No hardware changes –Poorer performance: Software management overhead Suboptimal scheduling Contention due to the need of synchronization
4
Symp. on Principles and Practice of Parallel Programming - June 20034 Wish List To reduce software overhead use of efficient speculative data structures and optimized operations To have an efficient scheduling minimizing memory overhead while maximizing tolerance to load imbalance and violations To reduce contention avoid synchronization as much as possible To avoid performance degradation squash contention mechanism
5
Symp. on Principles and Practice of Parallel Programming - June 20035 Outline Motivation Our software-only scheme Evaluation Related Work Conclusions
6
Symp. on Principles and Practice of Parallel Programming - June 20036 Speculative Access Structures Use of versions of the shared data structure Shared structure Thread A (iteration J) Thread B (iteration J+1) na el na A speculative access structure holds the state (na, m, el, elm) of each version of elements m A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]... A[0] A[1] A[2] A[n]...
7
Symp. on Principles and Practice of Parallel Programming - June 20037 Speculative Access Structure I: Simple Array Array of access states directly mapped to shadow copy of the user data array NAELMNA ELNA Spec. access structure Version copy NA: not accessed EL: exposed loaded M: modified ELM: exposed loaded and modified
8
Symp. on Principles and Practice of Parallel Programming - June 20038 Speculative Access Structure I: Simple Array Cheap to look up on speculative memory operations... = A[2] NAELMNA ELNA Version copy EL NAELMNA ELNA Access array User array Scan Expensive to search on commits
9
Symp. on Principles and Practice of Parallel Programming - June 20039 Speculative Access Structure II: Indirection Array Array of indices that indicate the elements of the shadow data array that were touched NAELMNA ELNA Access array User array 164 Indirection array
10
Symp. on Principles and Practice of Parallel Programming - June 200310 Speculative Access Structure II: Indirection Array 164 Indirection array... = A[2] NAELMNA ELNA Access array EL 2 Cheap to look up on speculative memory operations Cheap to search on commits Scan NAELMNA ELNA Access array 164 Indirection array
11
Symp. on Principles and Practice of Parallel Programming - June 200311 Scheduling Threads Static: assign a chunk of N/P iterations to each processor +Only P active threads little memory overhead –Poor tolerance to load imbalance and dependence violations Dynamic: dynamically assign each of N iterations –N active threads bigger memory structures +Better tolerance to load imbalance and dependence violations Our solution: software version of an aggressive sliding window mechanism † † Cintra, Martinez and Torrellas; ISCA 2000
12
Symp. on Principles and Practice of Parallel Programming - June 200312 Schedule a window of W iterations at a time Sliding window Window (W) Thread 1 Thread 2 Time Iterations (N): 12345768 Dynamic assignment of iterations inside the window 12 1 2 3 32 When the non-spec thread finishes, the window is advanced 12345678 4 Tradeoff between load balancing and size of version structures
13
Symp. on Principles and Practice of Parallel Programming - June 200313 Memory operations... = A[K[i]] Load operation L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version A[K[i]] =... Store operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Correctness guaranteed if globally performed in program order But program order may not be respected… Compiler reordering Use of relaxed memory consistency models
14
Symp. on Principles and Practice of Parallel Programming - June 200314 Race Conditions Certain interleaving of operations may lead to incorrect execution Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Thread executing iteration J Thread executing iteration J+K Time S2 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L2 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version S1 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations L3 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version Incorrect value S3 Iteration J: Store Operation S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations Violation not detected L1 Iteration J+K: Load Operation L1: Update state of the element to EL L2: Scan backwards acc. array for version L3: Obtain most up-to-date version
15
Symp. on Principles and Practice of Parallel Programming - June 200315 Conservative Solution To embrace operations in a critical section Load Operation # lock A L1: Update state of the element to EL L2: Scan backwards access array for version L3: Obtain most up-to-date version # unlock A Store Operation # lock A S1: Perform the store of the new version S2: Update state of the element to M or ELM S3: Scan forwards access array for violations # unlock A Drawback: contention
16
Symp. on Principles and Practice of Parallel Programming - June 200316 Our Solution: Memory Fences Load Operation L1: Update state of the element to EL # memory fence L2: Scan backwards access array for version L3: Obtain most up-to-date version Store Operation S1: Perform the store of the new version # memory fence S2: Update state of the element to M or ELM # memory fence S3: Scan forwards access array for violations All pending operations should be performed before passing the memory fence This is the minimun set of memory fences needed Critical sections are still necessary to protect structures on thread starts, commits and squashes.
17
Symp. on Principles and Practice of Parallel Programming - June 200317 Outline Motivation Our software-only scheme Evaluation Related Work Conclusions
18
Symp. on Principles and Practice of Parallel Programming - June 200318 Evaluation Environment Execution of experiments on a real machine Sun SunFire 6800 SMP with 24 UltraSPARC-III processors OpenMP 2.0 Study of applications with non-analyzable loops TREE, WUPWISE, MDG no dependences LUCAS, AP3M dependences
19
Symp. on Principles and Practice of Parallel Programming - June 200319 Speedups of Loops: TREE Very close to “ideal” DOALL speedup
20
Symp. on Principles and Practice of Parallel Programming - June 200320 Speedups of Loops: WUPWISE Not so close to “ideal” DOALL speedup: huge spec data size
21
Symp. on Principles and Practice of Parallel Programming - June 200321 Importance of Indirection Array
22
Symp. on Principles and Practice of Parallel Programming - June 200322 Cost of Violation Checks Systems evaluated: Baseline: our scheme, with violation checks upon stores sys2: same as Baseline, but violation checks upon commits
23
Symp. on Principles and Practice of Parallel Programming - June 200323 Cost of Violation Checks May outperform checks at commit on sparse accesses Checks upon loads and stores are not too expensive
24
Symp. on Principles and Practice of Parallel Programming - June 200324 Effects of Scheduling Schemes Systems evaluated Baseline: Sliding window moved when non- speculative thread finishes sys3: Sliding window moved when all thread finish (solution adopted by Dang et al. [IPDPS 2002]) sys4: Dynamic scheduling, no partial commits (solution adopted by Rundberg et al. [WSSMM 2000])
25
Symp. on Principles and Practice of Parallel Programming - June 200325 Effects of Scheduling Schemes P = 4 processors Fully dynamic schedule is not always feasible Best performance for W=2*P to 4*P
26
Symp. on Principles and Practice of Parallel Programming - June 200326 Wish List Revisited To reduce software overhead Access and Indirection Arrays Early violation detection (on stores instead of during commit) To have an efficient scheduling Agressive Sliding Window mechanism To reduce contention Use of memory fences instead of critical sections To avoid performance degradation Squash monitor with feedback
27
Symp. on Principles and Practice of Parallel Programming - June 200327 Outline Motivation Our software-only scheme Evaluation Related Work Conclusions
28
Symp. on Principles and Practice of Parallel Programming - June 200328 Software-only speculative parallelization schemes SW-R-LRPD at Texas A&M University (IPDPS 2002) Less aggressive window (moved when all threads finish) Violation checks when threads commit Chalmers University (WSSMM 2000) Dynamic scheme Violation checks upon stores IBM Research (SC 1998) Series of tests for various specific behaviors TLDS at Carnegie Mellon University (tech. rep. 2001) Speculation in software DSM engine
29
Symp. on Principles and Practice of Parallel Programming - June 200329 Outline Motivation Our software-only scheme Evaluation Related Work Conclusions
30
Symp. on Principles and Practice of Parallel Programming - June 200330 Conclusions Systematic consideration of the design space and cost/performance issues New efficient and robust software-only speculative parallelization scheme: –Fine-tuned data structures –Aggressive sliding window –Reduced synchronization requirements –Overhead monitors and feedback Very good performance: –7 to 25% faster than previous schemes –71% of hand-made, manual parallelization speedup
31
Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh http://www.inf.ed.ac.uk/home/mc Universidad de Valladolid http://www.infor.uva.es/~diego
32
Symp. on Principles and Practice of Parallel Programming - June 200332 Data Structures Implementation User array 0 2nNA M EL NA Access structures Version copies M 0 1 n
33
Symp. on Principles and Practice of Parallel Programming - June 200333 Squashing Threads Violations are detected by looking up speculative access structure On every store +Check only the element being accessed +Earlier violation detections ±Frequent checks Need some form of synchronization At commit Check all elements +Faster speculative memory operations
34
Symp. on Principles and Practice of Parallel Programming - June 200334 Squash contention mechanism Goal: to avoid performance degradation in the presence of dependences Implemented with commit and squash monitors After a given threshold, following invocations of the same loop will be executed sequentially
35
Symp. on Principles and Practice of Parallel Programming - June 200335 Importance of Squash Monitors
36
Symp. on Principles and Practice of Parallel Programming - June 200336 Application Characteristics Application TREE MDG Loops accel_10 interf_1000 WUPWISE muldeo_200’ muldoe_200’ % of Seq. Time 94 86 41 Spec data Size (KB) < 1 12,000 AP3M Shgravll_700 LUCAS mers_mod_square (line 444) 78 20 3,000 4,000
37
Symp. on Principles and Practice of Parallel Programming - June 200337 Speedups of Loops: MDG Very close to “ideal” DOALL speedup
38
Symp. on Principles and Practice of Parallel Programming - June 200338 Overall Speedups: TREE
39
Symp. on Principles and Practice of Parallel Programming - June 200339 Overall Speedups: WUPWISE
40
Symp. on Principles and Practice of Parallel Programming - June 200340 Overall Speedups: MDG
41
Symp. on Principles and Practice of Parallel Programming - June 200341 Constrained Memory Overheads Mixed results: either Baseline Or Sys4 perform best
42
Symp. on Principles and Practice of Parallel Programming - June 200342 Related Work Hardware-based speculative parallelization schemes: –I-ACOMA at University of Illinois –HYDRA at Stanford –Multiplex at Purdue –Multiscalar at Wisconsin –Clustered Speculative Multithreading at UPC –TLDS at Carnegie Mellon Inspector-Executor scheme: –Leung and Zahorjan (PPoPP 1993) –Saltz, Mirchandney, and Crowley (IEEE ToC 1991)
43
Symp. on Principles and Practice of Parallel Programming - June 200343 Related Work Optimistic Concurrency Control schemes: –E.g., Herlihy (ACM TDBS 1990); Kung and Robinson (ACM TDBS 1981) –Only need to enforce that accesses to objects in critical sections do not overlap no total order –Applied to explicitly parallel applications
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.