Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc

Moore for Less Keynote - September 20082 Compilers for Parallel Computers (Today)  Auto-parallelizing compilers –“Holy grail”: convert sequential programs into parallel programs with little or no user intervention –Only partial success, despite decades of work –No performance debugging tools  For explicitly parallel languages/annotations (e.g., OpenMP, Java Threads) –Main goal: correctly map high-level data and control flow to hardware/OS threads and communication –Secondary goal: perform simple optimizations specific to parallel execution –Simple correctness and performance debugging tools

Moore for Less Keynote - September 20083 Compilers for Parallel Computers (Future)  Data flow/dependence analysis tools – unsafe/speculative –Probabilistic approaches –Profile-based approaches  Multithreading-specific optimization toolbox –Including alternative/speculative parallel programming models (e.g., Transactional Memory (TM))  Auto-parallelizing compilers – with speculation –Thread-level speculation (TLS) –Helper threads Holistic parallelizing tool chain.

Moore for Less Keynote - September 20084 Why Be Speculative?  Performance of programs ultimately limited by control and data flows  Most compiler optimizations exploit knowledge of control and data flows  Techniques based on complete/accurate knowledge of control and data flows are reaching their limit –True for both sequential and parallel optimizations Future compiler optimizations must rely on incomplete knowledge: speculative execution

Moore for Less Keynote - September 20085 Compilers for Parallel Computers (Future) Dependence/Flow Analysis Tool Parallelizing Compiler Unsafe <P-way parallel Seq. P-way parallel TLS TM Auto-TLS Compiler Auto-TLS Compiler

Moore for Less Keynote - September 20086 Outline  Context and Motivation  History and status-quo of auto-parallelizing compilers –Data dependence analysis for array-based programs –Data dependence analysis for irregular programs  Auto-parallelizing compilers for TLS –TLS execution model (speculative parallelization) –Static compiler cost model (PACT’04, TACO’07)

Moore for Less Keynote - September 20087 Data Dependence Analysis for Arrays  Based on mathematical evaluation of array index expressions within loop nests  Progressively more capable analyses (e.g., GCD test, Banerjee test), but still restricted to affine loop index expressions  Coupled with mathematical framework to represent loop transformations (e.g., loop interchange, skewing) that can help expose more parallelism

Moore for Less Keynote - September 20088 Data Dependence Analysis for Arrays  What’s wrong with traditional data dependence? –Not all index expressions are affine or even statically defined (e.g., subscripted subscripts) –Not all loops are well structured (e.g., conditional exits, control flow) –Not all procedures are analyzable (e.g., unavailable code, aliasing, global data access) –Not all applications make intense use of arrays (e.g., trees, hash tables, linked lists, etc) and loop nests

Moore for Less Keynote - September 20089 Data Dependence Analysis for Irregular Programs  Based on ad-hoc analyses (e.g., pointer analysis, shape analysis, task graph analysis) There isn’t a comprehensive data dependence analysis framework for irregular applications

Moore for Less Keynote - September 200810 Outline  Context and Motivation  History and status-quo of auto-parallelizing compilers –Data dependence analysis for array-based programs –Data dependence analysis for irregular programs  Auto-parallelizing compilers for TLS –TLS execution model (speculative parallelization) –Static compiler cost model (PACT’04, TACO’07)

Moore for Less Keynote - September 200811 Thread Level Speculation (TLS)  Assume no dependences and execute threads in parallel  While speculating, buffer speculative data separately  Track data accesses and monitor cross-thread violations  Squash offending threads and restart them  All this can be done in hardware, software, or a combination for(i=0; i<100; i++) { … = A[L[i]] + … A[K[i]] = … } Iteration J+2 … = A[5]+… A[6] =... Iteration J+1 … = A[2]+… A[2] =... Iteration J … = A[4]+… A[5] =... RAW

Moore for Less Keynote - September 200812  Squash & restart: re-executing the threads  Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative  Dispatch & commit: writing back speculative data into memory and starting next speculative thread  Load imbalance: processor waiting for thread to become non-speculative to commit TLS Overheads

Moore for Less Keynote - September 200813 Coping with overheads: Cost Model!  Compiler cost models are key to guide optimizations, but no such cost model exists for TLS  Speculative parallelization can deliver significant speedup or slowdown –Several speculation overheads –Overheads are hard to estimate (e.g., squash?)  A prediction of the value of speedup can be useful –e.g. multi-tasking environment  program A wants to run speculatively in parallel on 4 cores ( predicted speedup 1.8 )  other programs waiting to be scheduled  OS decides it does not pay off

Moore for Less Keynote - September 200814  Squash & restart: re-executing the threads –Hard because violations are highly unpredictable  Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative –Hard because write-sets are somewhat unpredictable  Dispatch & commit: writing back speculative data into memory and starting next speculative thread –Hard because write-sets are somewhat unpredictable  Load imbalance: processor waiting for thread to become non-speculative to commit –Hard because workloads are very unpredictable and order does matter due to in-order commit requirement TLS Overheads

Moore for Less Keynote - September 200815 Our Compiler Cost Model: Highlights  First fully static compiler cost model for TLS  Can handle all TLS overheads in a single framework –Including loop imbalance, which is not handled by any other cost model  Produces not only a qualitative (“good” or “bad”) assessment of the TLS benefits but instead a quantitative value (i.e., expected speedup/slowdown)  Can be easily integrated into most compilers at the intermediate representation level  Simple and fast to compute

Moore for Less Keynote - September 200816 Speedup Distribution Very varied speedup/slowdown behavior

Moore for Less Keynote - September 200817 Model Accuracy (I): Outcomes Only 17% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model

Moore for Less Keynote - September 200818 Current Developments  Done: –Completed implementation of TLS code generator in GCC  Doing: –Implementing cost model in this TLS GCC –Profiling TLS program behavior (with IBM and U. of Manchester)  To Do: –Develop hybrid cost models based on static and profile information –Develop “intelligent” cost models based on Machine Learning (with U. of Manchester)

Moore for Less Keynote - September 200819 Summary  Paraphrasing M. Snir † (UIUC): “parallel programming will have to become synonymous with programming”  However, –Better (and unsafe) data dependence analysis tools –Explicit (and speculative) parallel models –Auto-parallelizing (speculative) compilers  Much work still needs to be done.  At U. of Edinburgh: –Auto-parallelizing TLS compilers –TLS hardware –STM (software TM) † Director of Intel+Microsoft’s UPCRC

Moore for Less Keynote - September 200820 Acknowledgments  Research Team and Collaborators –Jialin Dou –Salman Khan –Polychronis Xekalakis –Nikolas Ioannou –Fabricio Goes –Constantino Ribeiro –Dr. G. Brown, Dr. M. Lujan, Prof. I. Watson (U. of Manchester) –Prof. Diego Llanos (U. of Valladolid)  Funding –UK – EPSRC:GR/R65169/01 EP/G000697/1

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Similar presentations

Presentation on theme: "Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Similar presentations

Presentation on theme: "Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh"— Presentation transcript:

Similar presentations

About project

Feedback