Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Similar presentations


Presentation on theme: "Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University."— Presentation transcript:

1 Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University of Alberta **IBM Research

2 UPC : Unified Parallel C 012345 THREADS = 6 Partitioned Global Address Space

3 Shared arrays Arrays can be shared b/w all threads Eg : shared [2] double A[9]; Assuming THREADS=3 1-d block cyclic distribution : similar to HPF cyclic(k) 012345678

4 Vector addition example #include shared [2] double A[10]; shared [3] double B[10],C[10]; int main(){ int i; upc_forall(i=0;i<10;i++;&C[i]) C[i] = A[i] + B[i]; }

5 Outline of talk upc_forall loops syntax and uses Compiling upc_forall loops Data distributions in UPC Multiblocking distributions Privatization of access Results

6 upc_forall and affinity tests upc_forall is a work distribution construct Form : shared [BF] double A[M]; upc_forall(i=0; i<N; i++; &A[i]){ //loop body } “Affinity test” expression determines which thread executes which iteration. Affinity test expression

7 Affinity test elimination : naive shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=0; i<M; i++){ if(upc_threadof(&A[i])==MYTHREAD){ //loop body }

8 Affinity test elimination : optimized shared [BF] double A[M]; upc_forall(i=0;i<M;i++; &A[i]){ //loop body } shared [BF] double A[M]; for(i=MYTHREAD*BF; i<M; i+=(BF*THREADS)){ for(j=i; j<i+BF; j++){ //loop body }

9 Integer Affinity Tests upc_forall(i=0;i<M;i++; i){ //loop body } for(i=MYTHREAD; i<M; i+=THREADS){ //loop body }

10 Data distributions for shared arrays UPC official spec only supports 1d block cyclic IBM xlupc compiler supports more general data distribution : 'multi-dimensional blocking' Eg : shared [2][3] double A[5][5]; Divide the array into multidimensional tiles Distribute the tiles among processors in cyclic fashion More general than UPC spec, but not as general as ScaLAPACK or HPF

11 Multidimensional Blocking shared [2][2] double A[5][5]; 00 00 11 11 2 2 33001 33001 22330

12 Locality analysis and privatization Consider : shared [2][3] A[5][6],B[5][6]; for(i=0; i<4; i++){ upc_forall(j=0; j<4; j++; &A[i][j]){ A[i][j] = B[i+1][j]; } What code should we generate for references A[i][j] and B[i+1][j]?

13 Shared access code generation for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = shared_deref(B,i+1,j); shared_assign(A,i,j,val); } for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

14 Shared access code generation Do we really need the function calls? A[i][j] should only be a memory load/store?? What about B[i+1][j] on SMP? This should be just a load? On hybrids? for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

15 Locality Analysis Area belonging to thread 0 Area referenced by thread 0 for B[i+1][j] for(i=0;i<4;i++) upc_forall(j=0;j<4;j++;&A[i][j]) A[i][j] = B[i+1][j];

16 Locality Analysis : Intuition The locality can only change if index (i+1) crosses block boundaries in a direction Block boundaries : 0, BF, 2*BF... (i+1)%BF==0 gives block boundary So we only need to see if (i+1)%BF==0 to find places where locality can change! for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

17 Locality Analysis Define offset vector : [k1 k2] k1=1, k2=0 k1 and k2 are integer constants Cross block boundary at (i+k1)%BF ==0 Cases : i%BF = (BF-k1%BF) i%BF<(BF-k1) : we refer it to as 'cut' for(i=0;i<4;i++){ upc_forall(j=0;j<4;j++;&A[i][j]){ A[i][j] = B[i+1][j]; }

18 Shared access code generation for(i=0;i<4;i++){ if((i%2<1){ upc_forall(j=0;j<4;j++;&A[i][j]){ val = memory_load(B,i+1,j); memory_store(A,i,j,val); } }else{ upc_forall(j=0;j<4;j++; &A[i][j]){ val = shared_deref(B,i+1,j); memory_store(A,i,j,val); }

19 Locality analysis : algorithm For each shared reference in loop:  Check if blocking factor matches  Check if distance vector is constant  If reference is eligible: Generate cut expressions Put cut in a sorted “cut list” Replicate loop body as necessary Insert memory load/store if local reference otherwise insert RTS call

20 Improvements of locality analysis in isolation

21 Improvements of affinity test elimination in isolation

22 Results : Vector addition

23 Matrix-vector multiplication

24 Matrix-vector scalability

25 Conclusions UPC requires extensive compiler support  upc_forall is a challenging construct to compile efficiently  Shared access implementation requires compiler support Optimizations working together produce good results  Compiler optimizations can produce >80x speedup over unoptimized code  If one optimization fails, then results can still be bad


Download ppt "Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University."

Similar presentations


Ads by Google