Presentation is loading. Please wait.

Presentation is loading. Please wait.

O PERATING S YSTEMS AND A RCHITECTURES CS-M98: C OURSEWORK S OLUTION Benjamin Mora 1 Swansea University Dr. Benjamin Mora.

Similar presentations


Presentation on theme: "O PERATING S YSTEMS AND A RCHITECTURES CS-M98: C OURSEWORK S OLUTION Benjamin Mora 1 Swansea University Dr. Benjamin Mora."— Presentation transcript:

1 O PERATING S YSTEMS AND A RCHITECTURES CS-M98: C OURSEWORK S OLUTION Benjamin Mora 1 Swansea University Dr. Benjamin Mora

2 M ARKING RANGE 2 Benjamin Mora Swansea University Full understanding of problem and solution (>97) Ready for employment in HPC sector None of you (some very close though)! Almost there with multithreading. (70 to 97) Just need to see and understand solution. Most students in this category. Real issues with multithreading concepts, merging temporary results, and few basic C errors (50 to 70) Some hard work is really needed to understand the full solution <50: Issues with basic (C) programming and algorithmic concepts, including pointers and creating a data-structures Catching-up is crucial!!!

3 Q1 3 Benjamin Mora Swansea University Alignement of Data. Similar to lab exercise. See CPU part 1. 35 marks.

4 Q1 4 Benjamin Mora Swansea University void AoS_to_SoA (float *image, int x, int y) { imageRed=new float[x*y+PADDING]; imageGreen=new float[x*y+PADDING]; imageBlue=new float[x*y+PADDING]; unsigned long long alignR=(((unsigned long long) *imageRed)&31)/4; unsigned long long alignG=(((unsigned long long) *imageGreen)&31)/4; unsigned long long alignB=(((unsigned long long) *imageBlue)&31)/4; alignedRed=imageRed+8-alignR; alignedGreen=imageGreen+8-alignG; alignedBlue=imageBlue+8-alignB; float *R=alignedRed; float *G=alignedGreen; float *B=alignedBlue; for (int i=0;i<x*y;i++) { R[i]=image[3*i]; G[i]=image[3*i+1]; B[i]=image[3*i+2]; }

5 Q2 L OOP FOR K ITERATIONS 5 Benjamin Mora Swansea University for (int k=0;k<knnIterations;k++) { //1.init seed sums to 0 for (int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; } …

6 Q2 T HEN 6 Benjamin Mora Swansea University … //2. Determine and compute average of closer seeds for (int pixel=0;pixel<x*y*3;pixel+=3) { float maxDistance=10; int found=-1; for (int seed=0;seed<N;seed++) //Loop to be optimized { float dx=image[pixel+0]-seeds[0][seed]; float dy=image[pixel+1]-seeds[1][seed]; float dz=image[pixel+2]-seeds[2][seed]; float distanceSquare=dx*dx+dy*dy+dz*dz; if (distanceSquare<maxDistance) { //A closer seed has been found maxDistance=distanceSquare; found=seed; }

7 Q2 R ECOMPUTE NEW SEEDS 7 Benjamin Mora Swansea University //Last step for the iteration: compute average and update the current seed list for (int seed=0;seed<N;seed++) { if (seedCounters[seed]>0.01) { seeds[0][seed]=seedSums[0][seed]/seedCounters[seed]; seeds[1][seed]=seedSums[1][seed]/seedCounters[seed]; seeds[2][seed]=seedSums[2][seed]/seedCounters[seed]; } …//End of iteration

8 Q2 8 Benjamin Mora Swansea University Optimizing the inner loop Process 8 pixels at a time. Compare 8 pixels against one seed! Some were confused and tried 8 pixels vs 8 seeds Use cmplt and blend to replace condition. 2 blend s instructions needed! Some replicated mask computations! The part after the inner loop cannot be parallelized though. Still good speed-up using SIMD Especially when # seeds > 32 Many ways to do it. Extra cast computations done by all of you!

9 Q2 9 Benjamin Mora Swansea University Optimization comes from: Processing 8 pixels at a time. Removing the branch (no if then) Still tricky to get good speed up. Going further Loop unrolling. Minimize the number of computations inside the inner loop. Put all constant operations like set1 outside loop. Avoid shared cache lines when multithreading!

10 Q2 L OOP FOR K ITERATIONS 10 Benjamin Mora Swansea University float seedSums[3][N]; float seedCounters[N]; //Seed initialization; for(int j=0;j<3;j++) for(int i=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); for (int k=0;k<knnIterations;k++) { for (int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; }

11 Q2 L OOP FOR K ITERATIONS 11 Benjamin Mora Swansea University float seedSums[3][N];float seedCounters[N]; float8 seedId[N]; for (int seed=0;seed<N;seed++) seedId[seed]=set1((float &) seed); for(int j=0;j<3;j++) for(int i=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); for (int k=0;k<knnIterations;k++) { float8 seeds8[3][N]; for (int seed=0;seed<N;seed++) { seedSums[0][seed]=0; seedSums[1][seed]=0; seedSums[2][seed]=0; seedCounters[seed]=0; seeds8[0][seed]=set1(seeds[0][seed]); seeds8[1][seed]=set1(seeds[1][seed]); seeds8[2][seed]=set1(seeds[2][seed]); }

12 Q2 T HEN 12 Benjamin Mora Swansea University … //2. Determine and compute average of closer seeds for (int pixel=0;pixel<x*y*3;pixel+=3) { float maxDistance=10; int found=-1; for (int seed=0;seed<N;seed++) //Loop to be optimized { float dx=image[pixel+0]-seeds[0][seed]; float dy=image[pixel+1]-seeds[1][seed]; float dz=image[pixel+2]-seeds[2][seed]; float distanceSquare=dx*dx+dy*dy+dz*dz; if (distanceSquare<maxDistance) { //A closer seed has been found maxDistance=distanceSquare; found=seed; }

13 Q2 T HEN 13 Benjamin Mora Swansea University float8 *R=(float8 *) alignedRed; float8 *G=(float8 *) alignedGreen; float8 *B=(float8 *) alignedBlue; for (int pixel=0;pixel<x*y;pixel+=8) { float8 maxDistance=set1(10); float8 found8=set1(-1.f); //Just for initialization for (int seed=0;seed<N;seed++) //Loop to be optimized { float8 dx=sub8(R[0],seeds8[0][seed]); float8 dy=sub8(G[0],seeds8[1][seed]); float8 dz=sub8(B[0],seeds8[2][seed]); float8 distanceSquare=add8(add8(mul8(dx,dx),mul8(dy,dy)),mul8(dz,dz)); float8 comparison=cmplt8(distanceSquare,maxDistance); maxDistance=blend8(maxDistance,distanceSquare,comparison); found8=blend8(found8,seedId[seed],comparison); }

14 Q2 T HEN 14 Benjamin Mora Swansea University //Sum the pixel values to the appropriate seed for (int i=0;i<8;i++) { int found=(int&) found8.m256_f32[i]; seedCounters[found]+=1.; seedSums[0][found]+=((float *) R)[i]; seedSums[1][found]+=((float *) G)[i]; seedSums[2][found]+=((float *) B)[i]; } R++; G++; B++; } …

15 Q2 R ECOMPUTE NEW SEEDS 15 Benjamin Mora Swansea University Still the same!!! //Last step for the iteration: compute average and update the current seed list for (int seed=0;seed<N;seed++) { if (seedCounters[seed]>0.01) { seeds[0][seed]=seedSums[0][seed]/seedCounters[seed]; seeds[1][seed]=seedSums[1][seed]/seedCounters[seed]; seeds[2][seed]=seedSums[2][seed]/seedCounters[seed]; } …//End of iteration

16 Q3 16 Benjamin Mora Swansea University Most of you got the principles more or less right Practical implementation was wrong! Barriers were sometimes at the wrong location. Most of you added extra, unneeded barriers. Mutex have been accepted. Putting a lock on every seed change is too much/not good! Errors: Only using results from one thread at each iteration.

17 Q3 I DEA 17 Benjamin Mora Swansea University Break down image in 4 pieces For each thread iteration: Copy seeds in local variables (Performance) Loop for the current chunk of pixels. Compute seedSums and seeCounters the same way. Copy results in globally visible but separate variables. Barrier One thread Adds results from other threads to its own results Then Compute RGB average and update seeds. Barrier

18 Q3 C REATING T HREADS 18 Benjamin Mora Swansea University void knnCompressionSIMDPosix(float *image, int x, int y) { AoS_to_SoA(image,x,y); threadJobSize=x*y/nbThreads; pthread_t threads[nbThreads]; pthread_barrier_init(&barrier, NULL, nbThreads); for (int i=0;i<nbThreads;i++) pthread_create(&threads[i], NULL, posixThread, (void *) i); for (int i=0;i<nbThreads;i++) //separate loop pthread_join(threads[i], NULL); }

19 Q3 T HREAD ’ S J OB 19 Benjamin Mora Swansea University void * posixThread(void *arg) { long long threadNumber=(long long) arg; int firstPixel=threadNumber*threadJobSize; int lastPixel=firstPixel+threadJobSize; float seedSums[3][N]; float seedCounters[N]; //Seed initialization; float8 seedId[N]; for (int seed=0;seed<N;seed++) seedId[seed]=set1((float &) seed); if (threadNumber==0) for(int j=0;j<3;j++) for(int i=0;i<N;i++) seeds[j][i]=(rand()+0.5f)/(RAND_MAX+1.f); pthread_barrier_wait(&barrier);

20 Q3 T HREAD ’ S J OB 20 Benjamin Mora Swansea University for (int k=0;k<knnIterations;k++) { … Seed initalization is the same float8 *R=(float8 *) (alignedRed+firstPixel); float8 *G=(float8 *) (alignedGreen+firstPixel); float8 *B=(float8 *) (alignedBlue+firstPixel); for (int pixel=firstPixel;pixel<lastPixel;pixel+=8) { … loop code does not change … R++;G++;B++; }

21 Q3 M ERGING R ESULTS 21 Benjamin Mora Swansea University for (int seed=0;seed<N;seed++) { temporaryResults[threadNumber][0][seed]=seedSums[0][seed]; temporaryResults[threadNumber][1][seed]=seedSums[1][seed]; temporaryResults[threadNumber][2][seed]=seedSums[2][seed]; temporaryCounters[threadNumber][seed]=seedCounters[seed]; } pthread_barrier_wait(&barrier);

22 Q3 M ERGING R ESULTS 22 Benjamin Mora Swansea University if (threadNumber==0) { for (int thread=1;thread<nbThreads;thread++) for (int seed=0;seed<N;seed++) { temporaryResults[0][0][seed]+=temporaryResults[thread][0][seed]; temporaryResults[0][1][seed]+=temporaryResults[thread][1][seed]; temporaryResults[0][2][seed]+=temporaryResults[thread][2][seed]; temporaryCounters[0][seed]+=temporaryCounters[thread][seed]; } …

23 Q3 M ERGING R ESULTS 23 Benjamin Mora Swansea University for (int seed=0;seed<N;seed++) { if (temporaryCounters[0][seed]>0.01) { seeds[0][seed]=temporaryResults[0][0][seed] /temporaryCounters[0][seed]; seeds[1][seed]=temporaryResults[0][1][seed] /temporaryCounters[0][seed]; seeds[2][seed]=temporaryResults[0][2][seed] /temporaryCounters[0][seed]; } } //end condition threadNumber==0 pthread_barrier_wait(&barrier); //end of iteration, seeds have been updated!


Download ppt "O PERATING S YSTEMS AND A RCHITECTURES CS-M98: C OURSEWORK S OLUTION Benjamin Mora 1 Swansea University Dr. Benjamin Mora."

Similar presentations


Ads by Google