Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009.

Similar presentations


Presentation on theme: "1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009."— Presentation transcript:

1 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

2 2 Papers Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining A Practical Approach to Exploring Coarse- Grained Pipeline Parallelism in C Programs Semi-automatic Coarse grained pipelining

3 3 First paper Automatic Thread Extraction with Decoupled Software Pipelining Guilherme Ottoni, Ram Rangan, Adam Stoler and David August From Princeton University

4 4 What is the paper about? Despite increasing uses of multiprocessors, many single threaded applications do not benefit Let the compiler automatically extract threads and exploit lurking pipeline parallelism Extract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP)

5 5 Why decoupled pipelining? Example Linked list traversal

6 6 Why decoupled pipelining? DOACROSS Iteration * (LD latency + communication latency)

7 7 Why decoupled pipelining? DSWP Iteration * LD latency One way pipelining

8 8 DSWP Flow of data (dependency) is acyclic among cores With use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency

9 9 DSWP Algorithm Build dependence graph Find strongly connected components (SCC) Create DAG of SCC Partition DAG Split codes into partitions Add flows to partitions

10 10 Build dependence graph Include every traditional dependence (data, control, and memory) & extensions

11 11 Find SCC SCC : Instructions that form a dependency cycle in a loop Instructions in SCC cannot be parallelized 1 2 1 1 2 2

12 12 Create DAG of SCCs Merge instructions within each SCC and update dependency arrows

13 13 Partition DAG Partition DAG nodes into n partitions ( n <= # of processors) Use heuristic to maximize load balance Decide # of partitions (threads) Start filling in from partition 1 with nodes from the top of DAG. When the partition is stuffed (estimated by # of cycles), move on to next partition Find the best # of threads and its partition

14 14 Split codes and insert flows (done!) For each partition, insert code basic blocks relevant to its contained SCC node Add in codes for dependency flow

15 15 Result 19.4% speedup on important benchmark loops, 9.2% overall When core bandwidth is halved Single threaded code slows down by 17.1% DSWP code is still slightly faster than single- threaded code running on full-bandwidth core Promising enabler for Thread-Level- Parallelism(TLP)?

16 16 Second Paper A Practical Approach to Exploring Coarse- Grained Pipeline Parallelism in C Programs William Thies, Vikram Chandrasekhar and Saman Amaransinghe From MIT

17 17 What is the paper about? Despite increasing uses of multiprocessors, many single threaded… (Repeated) Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes Let people define pipeline, and learn practical dependencies in runtime

18 18 What is the paper about? Despite increasing uses of multiprocessors, many single threaded… (Repeated) Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes Let people define stages, and learn practical dependencies in runtime …for streaming applications

19 19 Interface Add annotations in the body of top loop

20 20 Dynamic analysis The system creates a stream graph according to annotations. How do they find dependencies?

21 21 Dynamic analysis Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages

22 22 Dynamic analysis Run the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies

23 23 Interface Program shows a complete stream graph User decides if he/she likes this pipelining or not If yes, done! else, redo annotations. Iterate over until satisfied

24 24 Actual pipelining When compiled, annotation macros emit codes that will fork original program for each pipeline stage

25 25 Result Average 2.78x speedup, max 3.89x on 4-core Seems unsound but practical (?)


Download ppt "1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009."

Similar presentations


Ads by Google