Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Similar presentations


Presentation on theme: "Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh"— Presentation transcript:

1 Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

2 Moore for Less Keynote - September Compilers for Parallel Computers (Today)  Auto-parallelizing compilers –“Holy grail”: convert sequential programs into parallel programs with little or no user intervention –Only partial success, despite decades of work –No performance debugging tools  For explicitly parallel languages/annotations (e.g., OpenMP, Java Threads) –Main goal: correctly map high-level data and control flow to hardware/OS threads and communication –Secondary goal: perform simple optimizations specific to parallel execution –Simple correctness and performance debugging tools

3 Moore for Less Keynote - September Compilers for Parallel Computers (Future)  Data flow/dependence analysis tools – unsafe/speculative –Probabilistic approaches –Profile-based approaches  Multithreading-specific optimization toolbox –Including alternative/speculative parallel programming models (e.g., Transactional Memory (TM))  Auto-parallelizing compilers – with speculation –Thread-level speculation (TLS) –Helper threads Holistic parallelizing tool chain.

4 Moore for Less Keynote - September Why Be Speculative?  Performance of programs ultimately limited by control and data flows  Most compiler optimizations exploit knowledge of control and data flows  Techniques based on complete/accurate knowledge of control and data flows are reaching their limit –True for both sequential and parallel optimizations Future compiler optimizations must rely on incomplete knowledge: speculative execution

5 Moore for Less Keynote - September Compilers for Parallel Computers (Future) Dependence/Flow Analysis Tool Parallelizing Compiler Unsafe

6 Moore for Less Keynote - September Outline  Context and Motivation  History and status-quo of auto-parallelizing compilers –Data dependence analysis for array-based programs –Data dependence analysis for irregular programs  Auto-parallelizing compilers for TLS –TLS execution model (speculative parallelization) –Static compiler cost model (PACT’04, TACO’07)

7 Moore for Less Keynote - September Data Dependence Analysis for Arrays  Based on mathematical evaluation of array index expressions within loop nests  Progressively more capable analyses (e.g., GCD test, Banerjee test), but still restricted to affine loop index expressions  Coupled with mathematical framework to represent loop transformations (e.g., loop interchange, skewing) that can help expose more parallelism

8 Moore for Less Keynote - September Data Dependence Analysis for Arrays  What’s wrong with traditional data dependence? –Not all index expressions are affine or even statically defined (e.g., subscripted subscripts) –Not all loops are well structured (e.g., conditional exits, control flow) –Not all procedures are analyzable (e.g., unavailable code, aliasing, global data access) –Not all applications make intense use of arrays (e.g., trees, hash tables, linked lists, etc) and loop nests

9 Moore for Less Keynote - September Data Dependence Analysis for Irregular Programs  Based on ad-hoc analyses (e.g., pointer analysis, shape analysis, task graph analysis) There isn’t a comprehensive data dependence analysis framework for irregular applications

10 Moore for Less Keynote - September Outline  Context and Motivation  History and status-quo of auto-parallelizing compilers –Data dependence analysis for array-based programs –Data dependence analysis for irregular programs  Auto-parallelizing compilers for TLS –TLS execution model (speculative parallelization) –Static compiler cost model (PACT’04, TACO’07)

11 Moore for Less Keynote - September Thread Level Speculation (TLS)  Assume no dependences and execute threads in parallel  While speculating, buffer speculative data separately  Track data accesses and monitor cross-thread violations  Squash offending threads and restart them  All this can be done in hardware, software, or a combination for(i=0; i<100; i++) { … = A[L[i]] + … A[K[i]] = … } Iteration J+2 … = A[5]+… A[6] =... Iteration J+1 … = A[2]+… A[2] =... Iteration J … = A[4]+… A[5] =... RAW

12 Moore for Less Keynote - September  Squash & restart: re-executing the threads  Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative  Dispatch & commit: writing back speculative data into memory and starting next speculative thread  Load imbalance: processor waiting for thread to become non-speculative to commit TLS Overheads

13 Moore for Less Keynote - September Coping with overheads: Cost Model!  Compiler cost models are key to guide optimizations, but no such cost model exists for TLS  Speculative parallelization can deliver significant speedup or slowdown –Several speculation overheads –Overheads are hard to estimate (e.g., squash?)  A prediction of the value of speedup can be useful –e.g. multi-tasking environment  program A wants to run speculatively in parallel on 4 cores ( predicted speedup 1.8 )  other programs waiting to be scheduled  OS decides it does not pay off

14 Moore for Less Keynote - September  Squash & restart: re-executing the threads –Hard because violations are highly unpredictable  Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative –Hard because write-sets are somewhat unpredictable  Dispatch & commit: writing back speculative data into memory and starting next speculative thread –Hard because write-sets are somewhat unpredictable  Load imbalance: processor waiting for thread to become non-speculative to commit –Hard because workloads are very unpredictable and order does matter due to in-order commit requirement TLS Overheads

15 Moore for Less Keynote - September Our Compiler Cost Model: Highlights  First fully static compiler cost model for TLS  Can handle all TLS overheads in a single framework –Including loop imbalance, which is not handled by any other cost model  Produces not only a qualitative (“good” or “bad”) assessment of the TLS benefits but instead a quantitative value (i.e., expected speedup/slowdown)  Can be easily integrated into most compilers at the intermediate representation level  Simple and fast to compute

16 Moore for Less Keynote - September Speedup Distribution Very varied speedup/slowdown behavior

17 Moore for Less Keynote - September Model Accuracy (I): Outcomes Only 17% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model

18 Moore for Less Keynote - September Current Developments  Done: –Completed implementation of TLS code generator in GCC  Doing: –Implementing cost model in this TLS GCC –Profiling TLS program behavior (with IBM and U. of Manchester)  To Do: –Develop hybrid cost models based on static and profile information –Develop “intelligent” cost models based on Machine Learning (with U. of Manchester)

19 Moore for Less Keynote - September Summary  Paraphrasing M. Snir † (UIUC): “parallel programming will have to become synonymous with programming”  However, –Better (and unsafe) data dependence analysis tools –Explicit (and speculative) parallel models –Auto-parallelizing (speculative) compilers  Much work still needs to be done.  At U. of Edinburgh: –Auto-parallelizing TLS compilers –TLS hardware –STM (software TM) † Director of Intel+Microsoft’s UPCRC

20 Moore for Less Keynote - September Acknowledgments  Research Team and Collaborators –Jialin Dou –Salman Khan –Polychronis Xekalakis –Nikolas Ioannou –Fabricio Goes –Constantino Ribeiro –Dr. G. Brown, Dr. M. Lujan, Prof. I. Watson (U. of Manchester) –Prof. Diego Llanos (U. of Valladolid)  Funding –UK – EPSRC:GR/R65169/01 EP/G000697/1

21 Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh


Download ppt "Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh"

Similar presentations


Ads by Google