Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming Models, Languages and Compilers

Similar presentations


Presentation on theme: "Parallel Programming Models, Languages and Compilers"— Presentation transcript:

1 Parallel Programming Models, Languages and Compilers
Module 6

2 Points to be covered Parallel Programming Models-
Shared-Variable Model, Message-Passing Model, Data-Parallel Model, Object Oriented Model, Functional and Logic Models Parallel Languages and Role of Compilers- Language Features for Parallelism, Parallel Language Constructs, Optimizing Compilers for Parallelism Code Optimization and Scheduling- Scalar Optimization with Basic Blocks, Local and Global Optimizations, Vectorization and Parallelization Methods, Code Generation and Scheduling, Trace Scheduling Compilation

3 Parallel Programming Model
Programming model->simplified and transparent view of computer hardware/software system. Parallel Programming Model are specifically designed for multiprocessors, multicomputer or vector/SIMD computers.

4 Classification We have 5 programming models-: Shared-Variable Model
Message-Passing Model Data-Parallel Model Object Oriented Model Functional and Logic Models

5 Shared Variable Model In all programming system, processors are active resources and memory & IO devices are passive resources. Program is a collection of processes. Parallelism depends on how IPC( Interprocess Communication) is implemented. Process address space is shared. To ensure orderly IPC ,a mutual exclusion property requires that shared object must be shared by only 1 process at a time.

6 Shared Variable Model(Subpoints)
Shared Variable communication Critical section Protected access Partitioning and replication Scheduling and synchronization Cache coherance problem

7 Shared Variable communication
Used in multiprocessor programming Shared variable IPC demands use of shared memory and mutual exclusion among multiple processes accessing the same set of variables. Process B Shared Variable in common memory Process A Process C

8 Critical Section Critical Section(CS) is a code segment accessing shared variable, which must be executed by only one process at a time and which once started must be completed without interruption.

9 Critical Section Requirements
It should satisfy following requirements-: Mutual Exclusion At most one process executing CS at a time. No deadlock in waiting No circular wait by 2 or more process. No preemption No interrupt until completion. Eventual Entry Once entered CS,must be out after completion.

10 Protected Access Granularity of CS affects the performance.
If CS is too large,it may limit parallism due to excessive waiting by process. When CS is too small,it may add unnecessary code complexity/Software overhead.

11 4 operational Modes Multiprogramming Multiprocessing Multitasking
Multithreading

12 Multiprogramming Multiple independent programs running on single processor/multiprocessor by time sharing use of system resource. When program enters the I/O mode, the processor switches to another program.

13 Multiprocessing When multiprogramming is implemented at the process level on a multiprocessor, it is called multiprocessing. 2 types of multiprocessing-: If interprocessor communication are handled at the instruction level, the multiprocessor operates in MIMD mode. If interprocessor communication are handled at the program,subroutine or procedure level, the multiprocessor operates in MPMD mode.

14 Multitasking A single program can be partitioned into multiple interrelated tasks concurrently executed on a multiprocessor. Thus multitasking provides the parallel execution of 2 or more parts of single program.

15 Multithreading The traditional UNIX/OS has a single threaded kernal in which 1 process can receive OS kernal service at a time. In multiprocessor we extend single kernal to be multithreaded. The purpose is to allow multiple threads of light weight processes to share same address space.

16 Partitioning and Replication
Goal of parallel processing is to exploit parallelism as much as possible with lowest overhead. Program partitioning is a technique for decomposing a large program and data set into many small pieces for parallel execution by multiple processors. Program partitioning involves both programmers and compilers.

17 Partitioning and Replication
Program replication refers to duplication of same program code for parallel execution on multiple processors over different data sets.

18 Scheduling and Synchronization
Scheduling further classified-: Static Scheduling It is conducted at post compile time. Its advantage is low overhead but shortcomings is a possible mismatch with run time profile of each task. Dynamic Scheduling Catches the run time conditions. Requires fast context switching,premption and much more OS support. Advantage include better resource utilization at expense of highest scheduling overhead.

19 Cache Coherence & Protection
Multicache coherance problem demands an invalidation or update after each write operation.

20 Message Passing Model Two processes D and E residing at different processor nodes may communicate wit each other by passing messages through a direct network. The messages may be instructions, data,synchronization or interrupt signals etc. Multicomputers are considered loosely coupled multiprocessors.

21 IPC using Message Passing
Message(Send/Recieve) Process D Process E

22 Synchronous Message Passing
No shared Memory No mutual Exclusion Synchronization of sender and reciever process just like telephone call. No buffer used. If one process is ready to cummunicate and other is not,the one that is ready must be blocked.

23 Asynchronous Message Passing
Does not require that message sending and receiving be synchronised in time and space. Arbitrary communication delay may be experienced because sender may not know if and when the message has been received until acknowledgement is received from receiver. This scheme is like a postal service using mailbox with no synchronization between senders and recievers.

24 Data Parallel Model Used in SIMD computers
Parallelism handled by hardware synchronization and flow control. Fortran 90 ->data parallel lang. Require predistrubuted data sets.

25 Data Parallelism This technique used in array processors(SIMD)
Issue->match problem size with machine size.

26 Array Language Extensions
Various data parallel language used Represented by high level data types CFD for Illiac 4,DAP fortran for Distributed array processor,C* for Connection machine Target to make the number of PE’s of problem size.

27 Object Oriented Model Objects dynamically created and manipulated.
Processing is performed by sending and receiving messages among objects.

28 Concurrent OOP Need of OOP because of abstraction and reusability concept. Objects are program entities which encapsulate data and operations in single unit. Concurrent manipulation of objects in OOP.

29 Actor Model This is a framework for Concurrent OOP.
Actors->independent component Communicate via asynchronous message passing. 3 primitives->create,send to and become.

30 Parallelism in COOP 3 common patterns for parallism-:
1)Pipeline concurrency 2)Divide and conquer 3)Cooperative Problem Solving

31 Functional and logic Model
Functional Programming Language-> Lisp,Sisal and Strand 88. Logic Programming Language-> Concurrent Prolog and Parlog

32 Functional Programming Model
Should not produce any side effects. No concept of storage,assignment and branching. Single assignment and data flow language functional in nature.

33 Logic Programming Models
Used for knowledge processing from large database. Supports implicitly search strategy. And parallel execution and Or Parallel Reduction technique used. Used in artificial intelligence

34 Parallel Language and Compilers
Programming environment is collection of s/w tools and system support. Parallel Software Programming environment needed. Users still forced to focus on hardware details rather than parallelism using high level abstraction.

35 Language Features For Parallelism
Optimization Features Availability Features Synchronization/communication Features Control Of Parallelism Data Parallelism Features Process Management Features

36 Optimization Features
Theme->Conversion of sequential Program to Parallel Program. The purpose is to match s/w parallelism with hardware parallelism.

37 Software in Practice-:
1)Automated Parallelizer Express C automated parallelizer and Allaint FX Fortran compiler. 2)Semiautomated Parallizer Needs compiler directives or programmers interaction.

38 Availability Features
Theme-:Enhance user friendliness, make language portable for large no of parallel computers and expand the applicability of software libraries.

39 1)Scalability Language should be scalable to number of processors and independent of hardware topology. 2)Compatibility Compatible with sequential language. 3)Portability Language should be portable to shared memory multiprocessor, message passing or both.

40 Synchronization/Communication Features
Shared Variable (locks) for IPC Remote Procedure Calls Data Flow languages Mailbox,Semaphores,Monitors

41 Control Of Parallelism
Coarse,Medium and fine grain Explicit vs implicit parallelism Global Parallelism Loop Parallelism Task Parallelism Divide and Conquer Parallelism

42 Data Parallelism Features
Theme-:how data is accessed and distributed in either SIMD and MIMD computers. 1)Runtime automatic decomposition Data automatically distributed with no user interaction. 2)Mapping Specification User specifies patterns and input data mapped to hardware.

43 Virtual Processor Support
Compilers made statically and maps to physical processor. Direct Access to shared data Shared data is directly accessed by operating system.

44 Process Management Features
Theme-: Support efficient creation of parallel process,multithreading/multitasking,program partitioning and replication and dynamic load balancing at run time.

45 1)Dynamic Process Creation at Run Time
1)Dynamic Process Creation at Run Time. 2)Creation of lightweight processes. 3)Replication technique. 4)Partitioned Networks. 5)Automatic Load Balancing

46 Optimizing Compilers for Parallelism
Role of compiler to remove burden of optimization and generation. 3 Phases-: 1)Flow analysis 2)Optimization 3)Code Generation

47 Flow Analysis Reveals design flow patters to determine data and control dependencies. Flow analysis carried at various execution levels. 1)Instruction level->VLSI or superscaler processors. 2)Loop level->Simd and systolic computer 3)Task level->Multiprocessor/Multicomputer

48 Program Optimization Transformation of user program to explore hardware capability. Explores better performance. Goal to maximise speed of code execution. To minimize code length. Local and global optimizations. Machine dependent Transformation

49 Parallel Code Generation
Compiler directive can be used to generate parallel code. 2 optimizing compilers-: 1)Parafase and Parafase 2 2)PFC and Parascope

50 Parafase and Parafase2 Transforms sequential programs of fortran 77 into parallel programs. Parafase consists of 100 program that are encoded and passed. Pass list indentifies dependencies and converts it to concurrent program. Parafase2 for c and pascal in extension to fortran.

51 PFC AND Parascope Translates fortran 77 to fortran 90 code.
PFC package extended to PFC + for parallel code generation on shared memory multiprocessor. PFC performs analysis as following steps below-:

52 PFC performs analysis as following steps below-:
1)Interprocedure Flow analysis 2)Transformation 3)dependence analysis 4)Vector Code Generation

53 Code Optimization and scheduling
Compiler is a software that translates the source code to object code. Optimization demands efforts from programmer and compiler.

54 Code Optimization & Schedulin(Sub Points)
Scaler optimization within basic blocks Basic Block Scheduling 2)Local & Global Optimization Local-: Common Subexpression Elimination Constant Propagation/Folding Algebraic Optimization to simplify expressions Instruction Reordering Dead Code Elimination

55 Global-: Global Version of local Optimization Loop Optimization Control Flow Optimization

56 Scaler optimization within basic blocks
2 types of scheduling-: Static Scheduling (compiler determines dependency analysis) Dynamic Scheduling(Hardware of operating system determines dependency analysis at run time) Code scheduling methods ensure that control dependency,data dependency and resource dependency are properly handled during concurrent execution.

57 Basic Blocks Basic blocks is a sequence of statements, with the properties that (a) The flow of control can only enter the basic block through the first instruction in the block. That is, there are no jumps into the middle of the block. (b) Control will leave the block without halting or branching, except possibly at the last instruction in the block.

58 Partitioning instructions into basic blocks
INPUT: A sequence of three-address instructions. OUTPUT: A list of the basic blocks for that sequence in which each instruction is assigned to exactly one basic block METHOD: First, we determine those instructions in the intermediate code that are leaders, that is, the first instructions in some basic block.

59 The rules for finding leaders are:
The first three-address instruction in the intermediate code is a leader. 2. Any instruction that is the target of a conditional or unconditional jump is a leader. 3. Any instruction that immediately follows a conditional or unconditional jump is a leader.

60 Example sum = 0 do 10 i = 1, n 10 sum = sum + a[i]*a[i]

61 Three Address Code sum = 0 initialize sum
i = 1 initialize loop counter if i > n goto 15 loop test, check for limit t1 = addr(a) – 4 t2 = i * 4 a[i] t3 = t1[t2] t4 = addr(a) – 4 t5 = i * 4 a[i] t6 = t4[t5] t7 = t3 * t6 a[i]*a[i] t8 = sum + t7 sum = t8 increment sum i = i increment loop counter goto 3

62 Control Flow Graph (CFG)
sum = 0 i = 1 if i > n goto 15 t1 = addr(a) – 4 t2 = i*4 t3 = t1[t2] t4 = addr(a) – 4 t5 = i*4 t6 = t4[t5] t7 = t3*t6 t8 = sum + t7 sum = t8 i = i + 1 goto 3 T 15. … F

63 Common Subexpression Elimination
sum = sum = 0 i = i = 1 if i > n goto if i > n goto 15 t1 = addr(a) – t1 = addr(a) – 4 t2 = i* t2 = i*4 t3 = t1[t2] t3 = t1[t2] t4 = addr(a) – t4 = addr(a) – 4 t5 = i* t5 = i*4 t6 = t4[t5] t6 = t4[t5] t7 = t3*t t7 = t3*t6 t8 = sum + t7 10a t7 = t3*t3 sum = t t8 = sum + t7 i = i sum = t8 goto i = i + 1 … goto 3

64 Copy Propagation 1. sum = 0 1. sum = 0 2. i = 1 2. i = 1
if i > n goto if i > n goto 15 t1 = addr(a) – t1 = addr(a) - 4 5. t2 = i * t2 = i * 4 t3 = t1[t2] 6. t3 = t1[t2] 10a t7 = t3 * t3 10a t7 = t3 * t3 11 t8 = sum + t t8 = sum + t7 12. sum = t8 11a sum = sum + t7 13. i = i sum = t8 14. goto i = i + 1 15. … goto 3 15. …

65 Invariant Code Motion 1. sum = 0 1. sum = 0 2. i = 1 2. i = 1
if i > n goto 15 2a t1 = addr(a) - 4 t1 = addr(a) – if i > n goto 15 5. t2 = i * t1 = addr(a) - 4 t3 = t1[t2] 5. t2 = i * 4 10a t7 = t3 * t t3 = t1[t2] 11a sum = sum + t7 10a t7 = t3 * t3 13. i = i a sum = sum + t7 14. goto i = i + 1 15. … goto 3 15. …

66 Strength Reduction 1. sum = 0 1. sum = 0 2. i = 1 2. i = 1
2a t1 = addr(a) – 4 2a t1 = addr(a) - 4 3. if i > n goto 15 2b t2 = i * 4 5. t2 = i * if i > n goto 15 6. t3 = t1[t2] t2 = i * 4 10a t7 = t3 * t t3 = t1[t2] 11a sum = sum + t7 10a t7 = t3 * t3 13. i = i a sum = sum + t7 14. goto b t2 = t2 + 4 15. … i = i + 1 14. goto 3 15. …

67 Constant Propagation and Dead Code Elimination
1. sum = sum = 0 2. i = i = 1 2a t1 = addr(a) – 4 2a t1 = addr(a) - 4 2b t2 = i * 4 2b t2 = i * 4 2c t9 = n * 4 2c t9 = n * 4 3a if t2 > t9 goto 15 2d t2 = 4 6. t3 = t1[t2] 3a if t2 > t9 goto 15 10a t7 = t3 * t t3 = t1[t2] 11a sum = sum + t7 10a t7 = t3 * t3 11b t2 = t a sum = sum + t7 14. goto 3a 11b t2 = t2 + 4 15. … goto 3a 15. …

68 New Control Flow Graph 1. sum = 0 2. t1 = addr(a) - 4 3. t9 = n * 4
5. if t2 > t9 goto 11 6. t3 = t1[t2] 7. t7 = t3 * t3 8. sum = sum + t7 9. t2 = t2 + 4 10. goto 5 T 11. … F

69 Analysis and Optimizing Transformations
Local optimizations – performed by local analysis of a basic block Global optimizations – requires analysis of statements outside a basic block Local optimizations are performed first, followed by global optimizations

70 Local Optimizations --- Optimizing Transformations of a Basic Blocks
Local common subexpression elimination Dead code elimination Copy propagation Renaming of compiler-generated temporaries to share storage

71 Vectorization and Parallization
Vectorization->Process of converting scalar looping operations into equivalent vector instructions. Parallelization->Converting sequential code to parallel code. Vectorizing Compiler->Compiler that does vectorization automatically. Vector hardware->Speed up Vector Operations. Multiprocessors->used to speed up parallel codes.

72 Vectorization Methods
Use of temporary storage Loop Interchange Loop Distribution Vector Reduction Node Splitting

73 Vectorization Example-:
Consider scalar loop for addition-: Do 20 I=8,120,2 A(I)=B(I+3)+C(I+1) After Vectorization-: A(8:120:2)=B(11:123:2)+C(9:121:2)

74 Use of temporary Storage
Theme->In order to have pipelined execution, use temporary array to produce vector code. Do 20 I=1,N A(1)=B(1)+C(1) A(I)=B(I)+C(I) B(1)=2*A(2) 20 B(I)=2*A(I+1) A(2)=B(2)+C(2) B(2)=2*A(3) …….

75 By use of temperory storage method we have-: TEMP(1:N)=A(1:N+1) A(1:N)=B(1:N)+C(1:N) B(1:N)=2*TEMP(1:N)

76 Loop Interchange Theme->Exchange inner loop with outer loop and increase profitability(improvement in execution time) Example-: Do 20 I=2,N Do 10 J=2,N S1: A(I,J)=(A(I,J-1)+A(I,J+1))/2 10 Continue 20 Continue

77 After Loop Interchange
Do 20 J=2,N Do 20 I=2,N S1: A(I,J)=(A(I,J-1)+A(I,J+1))/2 20 Continue Now inner loop can be vectorized as follows-: A(2:N,J) =(A(2:N,J-1)+A(2:N,J+1))/2

78 Loop Distribution Theme:Distribute the loop and vectorize it.
Example-: DO I = 1, N S1 A(I+1) = B(I) + C S D(I) = A(I) + E ENDDO

79 • transformed to: DO I = 1, N S1 A(I+1) = B(I) + C ENDDO S2 D(I) = A(I) + E • leads to: S1 A(2:N+1) = B(1:N) + C S2 D(1:N) = A(1:N) + E

80 Vector Reduction Theme->Produces the scaler value from 1 or more data arrays. Example-:sum,product,maximum,minimum of all elements in a single array. Example-: DO 40 I=1,N A(I)=b(I)+C(I) S=S+A(I) amax==max(amax,A(I)) Continue

81 After vector Reduction
A(I)=B(I)+C(I) S=S+SUM(A(1:N)) AMAX=MAX(AMAX,MAXVAL(A(1:N)) Where sum,max and maxval are all vector operations.

82 Node Splitting The data dependence cycle can some times be broken by node splitting. Example:- Do 50 I=2,N Do 50 I=2,N S1: T(I)=A(I-1)+A(I+1) S1a: X(I)=A(I+1) S2: A(I)=B(I)+C(I) S2: A(I)=B(I)+C(I) 50 Continue S1b:T(I)=A(I-1)+X(I) 50 Continue

83 Thus new loop structure can be vectorized as follows-: S1a: X(2:N)=A(3:N+1) S2: A(2:N)=B(2:N)+C(2:N) S1b: T(2:N)=A(1:N-1) +X(2:N)

84 Code Generation and Scheduling
Directed Acyclic Graph(DAG) Register Allocation List Scheduling Cycle Scheduling Trace Scheduling Compilation

85 DAG Constructed from basic block Basic block->no backtracking
So we construct DAG

86 Example of Constructing the DAG
t1:= 4 * i Step (1): create node 4 and i0 Step (2): create node Step (3): attach identifier t1 t2:= a[t1] Step (1): create nodes labeled [], a Step (2): find previously node(t1) Step (3): attach label t3:= 4 * i Here we determine that: node (4) was created node (i) was created node (*) was created just attach t3 to *. * t1 * 4 i0 t2 [] t1,t3 a0 * 4 i0

87 Compiler Design Intermediate Language Front End Back End
Source Program Scheduling Register Allocation Fall 2000

88 Why Register Allocation?
Storing and accessing variables from registers is much faster than accessing data from memory. The way operations are performed in load/store (RISC) processors. Therefore, in the interests of performance—if not by necessity—variables ought to be stored in registers. For performance reasons, it is useful to store variables as long as possible, once they are loaded into registers. Fall 2000

89 The Goal Primarily to assign registers to variables.
However, the allocator runs out of registers quite often. Decide which variables to “flush” out of registers to free them up, so that other variables can be bought in. This important indirect consequence of allocation is referred to as spilling. Fall 2000

90 Scheduling List Cycle Trace

91 Trace Scheduling Compaction Code Compensation Code

92 Thank You All the best for exams….!!!

93

94


Download ppt "Parallel Programming Models, Languages and Compilers"

Similar presentations


Ads by Google