A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Practical techniques & Examples

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Introductions to Parallel Programming Using OpenMP

Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.

Program Representations. Representing programs Goals.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.

Multiple Processor Systems

Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

CS 267 Spring 2008 Horst Simon UC Berkeley May 15, 2008 Code Generation Framework for Process Network Models onto Parallel Platforms Man-Kit Leung, Isaac.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

CS 470/570:Introduction to Parallel and Distributed Computing.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Lab 2 Parallel processing using NIOS II processors

Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.

Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition, Chapter 4: Multithreaded Programming.

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

Outline Why this subject? What is High Performance Computing?

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Chapter 1 Java Programming Review. Introduction Java is platform-independent, meaning that you can write a program once and run it anywhere. Java programs.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Parallel Computing Presented by Justin Reschke

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

Operating System Concepts

Background Computer System Architectures Computer System Software.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

ECE 297 Concurrent Servers Process, fork & threads ECE 297.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Conception of parallel algorithms

Introduction to OpenMP

Martin Rinard Laboratory for Computer Science

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

Chapter 4: Threads.

Multithreaded Programming

Parallel Computing Explained How to Parallelize a Code

Presentation transcript:

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer

2 Research Contents Background Problem Definition and Project Goal Splitting – Producer Selection – Inter-process Communication – Consumer Selection Implementation Conclusion And Further Work

3 Research Background Parallelization is not new Forking a sequential application Classic example, matrix-matrix multiplication: – Master processor executes code up to parallel loop – Execute parallel iterations on other processors – Synchronize at end of parallel loop CPU Cache CPU Cache CPU Cache CPU Cache Memory Bus Main Memory : do i, n do j, n a(i,j) = … : end do :

4 Research Background Applications are specified as parallel tasks: Example JPEG decoder :

5 Research Cakesim (eCos+CCP) - profile for JPEG-KPN: Problem Definition ?

6 Research Problem Definition Automatic procedure for process splitting in KPNs to take advantage of multiprocessor architectures. Original process network Split-up network:

7 Research Splitting – The Concept Required: Determine computational expensive process: profiling or pragma’s + static support Partitioning of the Iteration Space (IS) N = number of times a process has to be split L = loop-nest level at which the splitting takes place To do: Duplication of code and FIFOs Adding control for token production and consumption

8 Research Techniques used: Data dependence analysis: – Data flow analysis – Array data flow analysis Tree transformations: – Adding/removing/duplicating tree statements Compiler framework: – GCC

9 Research Solution for KPNs Four step approach: COMPUTATION: 1.Partitioning (computation) COMMUNICATION: 2. Interprocess communication 3. Token production 4. Token consumption P1P2P3 P1P21P3 P22

10 Research Partitioning of the original process computation over the resulted split-up processes

11 Research Interprocess Communication : for(int i=1; i<10; i++) a[i] = a[i-1] + i; //s1 : Inter process communication is given by the loop-carried dependency: a[i-1] at iteration i is produced at iteration i-1. If execution of stmt s1 is distributed over different processes, token a[i-1] needs to be communicated:: for(int i=1; i<10; i++){ if(i%2==0) if(i%2==1) a[i] = a[i-1] + i; a[i] = a[i-1] + i;:

12 Research Problems: –P1.At the producer side: where to send the tokens to? –PII.At the consumer side: from where to consume tokens ? Solutions P1: 1.Producer filters the tokens (static solution) 2.Producer sends all tokens to all split-up processes (run time solution) Solutions PII: 1.The consumer knows by it self when to switch (static solution) 2.Each producer sends a signal to the consumer when to switch reading data from a different FIFO (run time solution) Token Production&Consumption ? P1P2P1 P2’ P2’’ ? P2’ P2’’ P2P3

13 Research Token Production– runtime vs. static 100 tokens P1P2 50 tokens Static solution P1 P2’ P2’’ Runtime solution P1 P2’ P2’’ 100 tokens

14 Research Token Consumption – runtime vs. static 100 tokens P2P3 Switch is known internally by the consumer 50 tokens Static solution P2’ P2’’ P2 Switch is communicated over the channels to the consumer 50 tagged tokens 50 tagged tokens Runtime solution P2’ P2’’ P3

15 Research Token Production & Consumption – static solution Establish the data-dependencies over the processes HOW? Data Dependence function (DD) and DD -1 DD -1 : Producer Consumer DD : Consumer Producer However, DD cannot always be determined at compile time

16 Research Token Production – static solution without DD -1 Observation: loop counters producer side equal loop counters from consumer side

17 Research Token Production – static solution without DD -1 DD -1 (w1,w2,w3)=(w4,w5,w6); P2(DD -1 (w1,w2,w3))=w5 w5=w2 => P2(DD -1 (w1,w2,w3)%2= w2%2

18 Research Token Consumption – static solution without DD Similar to production of tokens.

19 Research Runtime solution:

20 Research Multiple split-up processes Split-up into 3 processes P1P2 P3P4 P1P2’’P3’’ P4 P2’ P3’ P2’’’ P3’’’

21 Research Copy-nodes P1P2 P3P4 Copy-nodes insertion P1 P2 P3 P4 Splitting transformation P1 P2’’ P3’’ P4 P2’ P3’ P2’’ P3’’

22 Research Copy-nodes Pros: –Simple network structure –Apply four-step splitting approach Cons: –More processes => more communication (can be improved) => overhead

23 Research Implementation Used technique: –Runtime solution (general) Used framework: –GCC (GNU Compiler Collection) Advantages GCC: – Availability of data dependence information – Supported by large community; – We are in contact with Sebastian Pop, maintainer and developer of various compiler phases e.g. the data dependence analysis, control flow and induction variable.

24 Research Implementation Data dependence analysis (already present): – scalars – arrays Data Dependence Graph (DDG) present only on RTL level, not on tree SSA Two new passes: 1.Create DDG 2.Splitting

25 Research Implementation Function foo() { : //stmt1 : for () { //stmt2 } : //stmt3 : } Data Dependence ? Check DDG If no loop-carried data dependence Modify Tree/CFG: duplicate basic blocks create if-condition

26 Research Implementation 1.Splitting pragma 2.Data dependence graph 3.Class definition reconstruction 4.Function cloning 5.Modulo condition insertion

27 Research Implementation To do: 1.Copying of class definition 2.Copying of class member functions 3.Reconstruction network structure –FIFO –Network definition

28 Research Implementation Final result: Data dependence information tells whether splitting is legal (no IPC) Semi-automatic transformation/case- study

29 Research Results Improvement of 21% Original KPN KPN with copy nodes Processes split-up into two

30 Research Future work: YAPI and CCP Difference in active and passive connectors. Active connectors in YAPI are modeled as a thread Passive do not run in a separate thread More connectors in CCP: P1 P2’’ P3’’ P4 P2’ P3’ P2’’ P3’’ Mesh Merge Fork

31 Research Future Work Connect GCC with SCOTTY: GCC branch –Main branch: may not accept the patch –GOMP branch targets parallelization + data dependence + Network topology

32 Research Conclusion Only split-up the most computationally expensive processes The transformation is profitable

34 Research Building threads within sequential applications Another transformation: creation of threads Why another transformation: Widely applicable, also out of the context of KPNs

35 Research Technique pthread_t thread1, thread2; void pfunc1() { for(i=0; i<N; i++){ if(i%2==0){ b[i] = sqrt(a[i]) + sqrt(a[i]*2) / (N*i) + log(i+1); } pthread_exit(0); } int main(){ int i; const int N=200; float a[N], b[N]; for(i=0; i<N; i++) a[i] = i; for(i=0; i<N; i++) b[i] = sqrt(a[i]) + sqrt(a[i]*2) / (N*i) + log(i+1); return 0; } pthread_create( &thread1, NULL, (void *) &pfunc1, NULL); pthread_create( &thread2, NULL, (void *) &pfunc2, NULL);

36 Research Technique Process splitting Thread creation Process splitting: extra input and output FIFOs Threads: – Competing for input tokens – Unknown running time, result: output order of tokens is not respected.

37 Research Example – put it all together

38 Research

39 Research Data Dependencies –True/flow dependency S1  f S2 S1 X = … S2 … = X – Output dependency S1  o S2 S1 X = … S2 X = … – Anti-dependency S1  a S2 S1 … = X S2 X = …