Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frédéric Brégier - LaBRI1 Extensions du langage HPF pour la mise en œuvre de programmes parallèles manipulant des structures de données irrégulières Frédéric.

Similar presentations


Presentation on theme: "Frédéric Brégier - LaBRI1 Extensions du langage HPF pour la mise en œuvre de programmes parallèles manipulant des structures de données irrégulières Frédéric."— Presentation transcript:

1 Frédéric Brégier - LaBRI1 Extensions du langage HPF pour la mise en œuvre de programmes parallèles manipulant des structures de données irrégulières Frédéric Brégier Thèse présentée à l’Université de Bordeaux I 21 Décembre 1999

2 Frédéric Brégier - LaBRI2 Frame of Work Parallel program by compilation HPF: standard for Data-parallel programs (regular programs) Need investments for irregular programs: poor efficiencies Optimizations at compile-time Optimizations at run-time (generated at compile-time)

3 Frédéric Brégier - LaBRI3 Plan Optimizations at compile-time Irregular Data Structure (IDS) A Tree to represent an IDS Optimizations at run-time Inspection-Execution principles Irregular communications: irregular active processor sets Irregular iteration spaces Scheduling of loops with irregular loop-carried dependencies New data-parallel irregular operation: progressive irregular prefix operation Conclusion and Perspectives

4 Frédéric Brégier - LaBRI4 IF (B(I) is local) THEN Send(B(J) to Owner(A(I))) END IF IF (A(I) is local) THEN Receive(in TMP from Owner(B(J))) A(I) = TMP + X END IF A B XY HPF (High Performance Fortran): HPF (High Performance Fortran): data-parallel language May 1993 HPF 1.0, January 1997 HPF 2.0 Fortran 95 source code + structured comments (!HPF$) (distributions + parallel properties) Target Code : SPMD parallel code « Owner computes » rule Runtime guards and communication generations A(I) = B(J) + X AAA BBB XYXYXY

5 Frédéric Brégier - LaBRI5 Optimizations at compile-time Loop iteration space Affine expression Local loop bounds Not optimizable !HPF$ INDEPENDENT DO I = 1, N A(I) = A(I) + 1 END DO ! Cyclic Distribution case DO I = PID+1, N, NOP A(I) = A(I) + 1 END DO ! Block Distribution case (N dividable by NOP) LB = BLOC * PID + 1 UB = min(N, LB+BLOC) DO I = LB, UB A(I) = A(I) + 1 END DO ! Indirect distribution DO I = 1, N IF (A(I) is local) THEN A(I) = A(I) + 1 END IF END DO Irregular = « what is not regular », not optimizable

6 Frédéric Brégier - LaBRI6 Plan Optimizations at compile-time Irregular Data Structure (IDS) A Tree to represent an IDS Optimizations at run-time Inspection-Execution principles Irregular communications: irregular active processor sets Irregular iteration spaces Scheduling of loops with irregular loop-carried dependencies New data-parallel irregular operation: progressive irregular prefix operation Conclusion and Perspectives

7 Frédéric Brégier - LaBRI7 Irregular Data Structure (IDS) Standard irregular format: indirect access arrays, example CSC I II III IV V VI VII VIII A(1,1)DA(JA(1))(IA(JA(1)) = 1) A(6,4)DA(JA(4)+1)(IA(JA(4)+1) = 6) A(:,4)DA(JA(4):JA(5)-1) JA(1:9) IA(1:20) DA(1:20) = Non zero values of A Irregular distribution formats: !HPF$ DISTRIBUTE JA(BLOCK)!HPF$ DISTRIBUTE IA(GEN_BLOCK(/5, 10, 5/))

8 Frédéric Brégier - LaBRI8 Problems at compile-time Distribution : unknown alignment between arrays of the IDS Data accesses: unknown indexes (indirection) DA(JA(4)+1)JA(4) = ? 6 6 Implies additional run-time guards and communications Inefficient SPMD code JA(1:9) DA(1:20)

9 Frédéric Brégier - LaBRI9 Related Works Regular to Irregular Compilation Bik et Wijshoff : « Sparse Compiler » Sparse Matrix with known topology Regular analysis + known topology IDS chosen by the compiler Pingali et al. Relational description (between components and access functions) Non standard and difficult notations Compilation of irregular programs Vienna Fortran Compilation System: SPARSE directive Storage format specification Limited to storage formats known by the compiler

10 Frédéric Brégier - LaBRI10 Plan Optimizations at compile-time Irregular Data Structure (IDS) A Tree to represent an IDS Optimizations at run-time Inspection-Execution principles Irregular communications: irregular active processor sets Irregular iteration spaces Scheduling of loops with irregular loop-carried dependencies New data-parallel irregular operation: progressive irregular prefix operation Conclusion and Perspectives

11 Frédéric Brégier - LaBRI11 IIIIIIIVVVIVIIVIII The Tree: a generic data structure with hierarchical access From a data to a tree: I II III IV V VI VII VIII Representation in HPF2: derived data type of Fortran 95 type level2 integer ROW!row number real VAL!non zero value end type level2 type level1 type (level2), pointer :: COL(:)!column end type level1 type (level1), allocatable :: A(:)!matrix with a hierarchical access by column !HPF$ TREE Tree Matrix CSC A(i)%COL(j)%VAL A(j,i) DA(JA(i)+j-1) A(i)%COL(:)%VAL A(:,i) DA(JA(i):JA(i+1)-1)

12 Frédéric Brégier - LaBRI12 Distribution of a TREE IIIIIIIVVVIVIIVIII !HPF$ DISTRIBUTE A(BLOCK)!HPF$ DISTRIBUTE A(INDIRECT(/1,2,3,2,1,2,3,1/))

13 Frédéric Brégier - LaBRI13 Example of improvement !HPF$ DISTRIBUTE A(BLOCK) !HPF$ INDEPENDENT FORALL (I = 3:N-2) A(I)%COL(:)%VAL = A(I-2)%COL(:)%VAL + A(I+2)%COL(:)%VAL END FORALL !HPF$ DISTRIBUTE DA(GEN_BLOCK(array)) !HPF$ INDEPENDENT FORALL (I = 3:N-2) DA(IA(I):IA(I+1)-1) = DA(IA(I-2):IA(I-1)-1) + DA(IA(I+2):IA(I+3)-1) END FORALL TMP(:) = Global Copy with BCAST(DA(:)) DO I = 3, N-2 local_bound(DA(IA(I):IA(I+1)-1), lb, ub) DO J = lb, ub DA(J) = TMP(J1)+TMP(J2) END DO IA(I-2) = ?? : IA(I-1)-1 = ?? Communications on frontiers only As SHADOW in HPF2 Global Copy+Bcast of DA local_bound(A(:), lb, ub) TMP(lb:ub) = Local Copy of Local Part(A(lb:ub)) Shadow_Update(TMP(:), -2,+2) local_bound(A(3:N-2), lb, ub) DO I = lb, ub A(I)%COL(:)%VAL = TMP(I-2)%COL(:)%VAL + TMP(I+2)%COL(:)%VAL END DO

14 Frédéric Brégier - LaBRI14 Arrays DALIB MPI Trees/Derived Types DALIBTriDenT MPI IIIIIIIVVVIVIIVIII I II III IV V VI VII VIII

15 Frédéric Brégier - LaBRI15 Matrix Vector Product IBM SP2-LaBRI4096x4096

16 Frédéric Brégier - LaBRI16 Advantages: Less indirections Less unknown alignments Better compile-time analysis (locality and dependence) Generic (defined by the user) Low overhead Disadvantages: Not necessary implemented in HPF compilers: portability Need to rewrite irregular code (with derived types)

17 Frédéric Brégier - LaBRI17 Plan Optimizations at compile-time Irregular Data Structure (IDS) A Tree to represent an IDS Optimizations at run-time Inspection-Execution principles Irregular communications: irregular active processor sets Irregular iteration spaces Scheduling of loops with irregular loop-carried dependencies New data-parallel irregular operation: progressive irregular prefix operation Conclusion and Perspectives

18 Frédéric Brégier - LaBRI18 Inspection-Execution Inspection: scan the program to analyze in order to get useful information Execution: execute the true computations according to the optimized scheme induced by the inspected information DO I = 1, N A(I) = B(INDEX(I)) END DO Modify B DO I = 1, N if (A(I) is local) then Add INDEX(I) to local_index end if END DO Exchange info on local_index (what indexes to send, to receive) Gather (B(local_index(:)) into Copy_B) I_local = 1 DO I = 1, N if (A(I) is local) then A(I) = Copy_B(I_local) I_local = I_local + 1 end if END DO Modify B DO STEP = 1, S END DO DO STEP = 1, S END DO INSPECTION EXECUTION often iterative schemes Related works: PARTI: iterative scheme CHAOS: iterative and adaptive scheme (by steps) Integrated in Fortran D and Vienna Fortran Compilation System PILAR: iterative and multi-phase scheme, basic element = section Compiler PARADIGM ADAPTOR: directive TRACE, dynamic adaptive scheme

19 Frédéric Brégier - LaBRI19 ON HOME Directive: to control the computation mapping !HPF$ ALIGN (I) WITH A(I) :: B, C !HPF$ INDEPENDENT DO I = 1, N C(INDEX(I)) = A(I) * B(I) END DO DO I = 1, N-1 if (A(I) is local) then call Send(A(I) to Owner( C(INDEX(I)) )) call Send(B(I) to Owner( C(INDEX(I)) )) end if if (C(INDEX(I)) is local) then call Receive(TMP1 from Owner( A(I) )) call Receive(TMP2 from Owner( A(I) )) C(INDEX(I)) = TMP1 * TMP2 end if END DO DO I = 1, N-1 if (A(I) is local) then TMP = A(I) * B(I) call Send(TMP to Owner( C(INDEX(I)) )) end if if (C(INDEX(I)) is local) then call Receive(TMP from Owner( A(I) )) C(INDEX(I)) = TMP end if END DO !HPF$ ON HOME (A(I)) HPF2: communication optimizations with active processor sets

20 Frédéric Brégier - LaBRI20 Irregular Active Processor Sets I II III IV V VI VII VIII ON HOME A(1,I) + ON HOME A(1,V) ON HOME A(2,II) + ON HOME A(2,V) ON HOME A(3,III) Less active processors in collective communications Less communications (reduction or broadcast) Less synchronizations Extensions to the ON HOME directive: !HPF$ ON HOME (A(K,:))!HPF$ ON HOME (A(K,INDEX(K)) FORALL(J=I:VIII, J.eq. K.or. A(K,J).ne. 0.0)!HPF$ ON HOME (A(K,J), J=I:VIII, J.eq. K.or. A(K,J).ne. 0.0) I II III IV V VI VII VIII A B !HPF$ ALIGN A(*,K) with B(K) B(K) = Sum(A(K,:))

21 Frédéric Brégier - LaBRI21 I II III IV V VI VII VIII Cholesky Example: TREE and Set (Matrix with columns) DO K = 1, N allocate (TMP(N)) TMP(:) = 0.0 DO J = 1, K-1 IF (A(K,J).ne. 0.0) THEN CMOD (TMP, A(:,J)) END IF END DO A(:,K) = A(:,K) + TMP(:) CDIV (A(:,K)) END DO !HPF$ INDEPENDENT, REDUCTION (TMP(:)) !HPF$ ON HOME (A(K,J), J = 1:K, J.eq.K.or. A(K,J).ne. 0.0), NEW(TMP), BEGIN !HPF$ END ON IBM SP2-LaBRI2D-Grid 255x255

22 Frédéric Brégier - LaBRI22 Plan Optimizations at compile-time Irregular Data Structure (IDS) A Tree to represent an IDS Optimizations at run-time Inspection-Execution principles Irregular communications: irregular active processor sets Irregular iteration spaces Scheduling of loops with irregular loop-carried dependencies New data-parallel irregular operation: progressive irregular prefix operation Conclusion and Perspectives

23 Frédéric Brégier - LaBRI23 Irregular Iteration Space !HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, K-1 IF (A(K,J).ne. 0.0) THEN … END IF END DO !HPF$ DISTRIBUTE A(:,BLOCK) IBM SP2-LaBRI 2D-Grid 255x255

24 Frédéric Brégier - LaBRI24 Plan Optimizations at compile-time Irregular Data Structure (IDS) A Tree to represent an IDS Optimizations at run-time Inspection-Execution principles Irregular communications: irregular active processor sets Irregular iteration spaces Scheduling of loops with partial loop-carried dependencies New data-parallel irregular operation: progressive irregular prefix operation Conclusion and Perspectives

25 Frédéric Brégier - LaBRI25 Loop with Partial Loop-Carried Dependencies Loop-carried dependencies: DO I = 1, N DO J = 1, I-1 A(I) = A(I) + A(J) END DO Partial loop-carried dependencies: DO I = 1, N DO J = 1, I-1 IF (TEST(I,J)) THEN A(I) = A(I) + A(J) END IF END DO Precomputable partial loop-carried dependencies: PPLD Loop TEST never modified

26 Frédéric Brégier - LaBRI26 PPLD Loop DO I = 1, N B = 0.0 !HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + B END DO !HPF$ ON HOME (A(J), J=I.or. TEST(I,J)) !HPF$ END ON 4 4

27 Frédéric Brégier - LaBRI27 PPLD Loop Scheduling Associates one iteration with one task Precomputable Partial Loop-Carried Dependencies = task graph Scheduling problem: HPF context Known mapping (HPF data distribution => task mapping) Data distribution => possible multi-processor tasks « Scheduling multi-processor tasks on dedicated processors » Related Work: Complexity: Drozdowski 97, Krämer 95: NP-Hard Problem Wennink 95: Scheduling algorithm PYRROS / RAPID libraries: precomputable task graph with mono- processor tasks (inspection-execution)

28 Frédéric Brégier - LaBRI28 Scheduling Tasks Associated to a PPLD Loop 1) DAG Generation New SCHEDULE directive 2) Scheduling Simple and Wennink’s scheduling 3) Execution Static execution / Dynamic execution Single thread / Multi-thread execution 4) Experimental Results

29 Frédéric Brégier - LaBRI SCHEDULE directive Dependencies between iterations (inspection-execution): DO I = 1, N !HPF$ ON HOME (A(J), J=I.or. TEST(I,J)) B = 0.0 !HPF$ INDEPENDENT, REDUCTION(B) DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + B !HPF$ END ON END DO !HPF$ SCHEDULE (J = 1:I-1, TEST(I,J) )

30 Frédéric Brégier - LaBRI30 Distributed Scheduling Algorithms Simple Scheduling: local tasks only a d a b a c a d d c b c a b c d Order in task scheduling: priority criteria based on critical path Problem of scheduling coherence between processors: prevent deadlock By step scheduling algorithm List for task execution

31 Frédéric Brégier - LaBRI31 Scheduling Wennink’s Scheduling: multi-processor tasks + insertion principle Simple: Wennink: Complexity:SimpleWennink ComputationsO(N log N)O(N²) MemoryO(|E|)O(N² + |E|)

32 Frédéric Brégier - LaBRI32 Static execution / Dynamic execution HPF context: task costs not known at compile-time => unit costs Static Critical Path = longest path (in edges) to the virtual « End » vertex Static Scheduling: static order of execution abcdabcd Iterative program: first iteration records times, then re-scheduling Dynamic Scheduling t10 t9 t5t3 t1 t11 t7t6 t2t8 t E

33 Frédéric Brégier - LaBRI33 Single Thread / MultiThread execution independent tasks on the same processor Same priority: which task first ? Single Thread: the lower rank first MultiThread: both User mode thread system: Marcel from PM² HighPerf Computations Waiting for communication Communications Task K Task K’ Task K Task K’ Overlapping communications by computations

34 Frédéric Brégier - LaBRI34 Experimental Results: Matrix with columns Cholesky on sparse matrix with column-block access Irregular data structure: TREE Distribution: INDIRECT (minimizing communications) VSet: V0 + Set Stat: VSet+SCHEDULE (static simple scheduling) Dyn: VSet+SCHEDULE (dynamic simple scheduling) Stat_th: Stat + Threads W:VSet+SCHEDULE (dynamic Wennink’s scheduling) IBM SP2-LaBRI2D-Grid 511x511

35 Frédéric Brégier - LaBRI35 Plan Optimizations at compile-time Irregular Data Structure (IDS) A Tree to represent an IDS Optimizations at run-time Inspection-Execution principles Irregular communications: irregular active processor sets Irregular iteration spaces Scheduling of loops with partial loop-carried dependencies New data-parallel irregular operation: progressive irregular prefix operation Conclusion and Perspectives

36 Frédéric Brégier - LaBRI36 Irregular Progressive PREFIX Operation Irregular Progressive PREFIX Operation: found in PPLD Loop Irregular Coefficient: Exploit independencies with specific communication schemes

37 Frédéric Brégier - LaBRI Irregular Progressive PREFIX Operation 12 Asynchronous communication Synchronous REDUCTION

38 Frédéric Brégier - LaBRI38 Irregular Progressive PREFIX Operation PREFIX directive/clause: differs from REDUCTION clause DO I = 1, N DO J = I+1, N IF (TEST(J,I)) THEN A(J) = A(J) + A(I) END IF END DO DO I = 1, N B = 0.0 DO J = 1, I-1 IF (TEST(I,J)) THEN B = B + A(J) END IF END DO A(I) = A(I) + B END DO !HPF$ INDEPENDENT, REDUCTION(B) !HPF$ INDEPENDENT, PREFIX(B) !HPF$ PREFIX(B) Inspection(A,TEST) DO I = lb, ub(ON HOME A(I)) Finalize(A(I))(receive contributions prev. send) DO J = I+1, N IF (TEST(J,I)) THEN A’(J) = A’(J) + A(I) (send when ready) END IF END DO DO I = 1, N (Set(I)) B = 0.0 DO J = lb, ub (ON HOME A(J)) IF (TEST(I,J)) THEN B = B+ A(J) END IF END DO A(I) = A(I) + REDUCTION(B) END DO IBM SP2-LaBRI

39 Frédéric Brégier - LaBRI39 Irregular Progressive PREFIX Operation: Cholesky Example Irregular coef. = 0.1% IBM SP2-LaBRI 2D-Grid 511x511

40 Frédéric Brégier - LaBRI40 Conclusion TREE: Irregular Data Structure, more information at compile-time Locality and dependence analysis => TriDenT Inspection/Execution: Still information not known at compile-time => CoLUMBO Irregular Active Processor Sets: fundamental inspection/execution Up to a factor of 10 Irregular Iteration Space: minor improvement Loop with Partial Loop Carried Dependencies: DAG associated with loop iterations Semi-automatic task scheduling at run-time PREFIX operation Inspection costs repayed with only one iteration Experimental Results: Efficiency close to hand-made codes (time ratio between 1.25 and 2.5)

41 Frédéric Brégier - LaBRI41 Perspectives Integration in a HPF compiler: preliminary experiments TREE: ADAPTOR Set inspection/execution, PREFIX inspection/execution: NESTOR (Silber 98) Transposition to other parallel languages: Irregular Data Structures: always a problem => TREE Irregular iteration space OpenMP: Virtual shared memory => Data distribution Irregular active processor sets


Download ppt "Frédéric Brégier - LaBRI1 Extensions du langage HPF pour la mise en œuvre de programmes parallèles manipulant des structures de données irrégulières Frédéric."

Similar presentations


Ads by Google