Download presentation

Presentation is loading. Please wait.

Published byDamaris Pryde Modified over 4 years ago

1
Presented by: Tal Klein Omer Manor

2
Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining parts of a set of photographs into a single composite picture We focused on one feature: extended depth of field (DOF) DOF is mostly important in Macro photography where the depth of field is very shallow

3
Digital Interactive Photomontage DOF allows a photographer to take several pictures of the same frame, focusing on different areas in each picture and then combine them using this feature Along the benefits of using extended DOF in photography, it is a "Heavy Resource Consumer" due the complex calculations & image manipulations needed here, therefore our goal was to speedup this process

4
Digital Interactive Photomontage

5
System Configuration Intel® Core 2 Duo E6600 @ 2.4Ghz 2Gbyte RAM Microsoft Windows XP x64 Due to the nature of our platform (2 cores) we assumed that by optimization, we can achieve a major boost in performance

6
The Optimization Process Analyzing the application Code Optimization SIMD Multithreading

7
Analyzing The Application Analyzing the application in 3 different ways: 1. VTune performance analyzer in order to search for our program's bottlenecks. 2. We added counters of our own to functions we suspected to be called many times. 3. Call graph (using Intels VTune).

8
transformPixel() - declares unnecessary variable.GetDataCost - we optimized the code and used SIMD instructions. BVZ_interaction_penalty - we optimized the code by merging two loops into one and used SIMD instructions. displace() - we changed its content to macro instead of function call. BVZ_data_penalty – Calls displace which we change into macro. Analyzing The Application

9
BVZ_Expand() - function which calls the small functions and consumes the biggest time on the CPU, We used multithreading on it.

10
Code Optimization Replacement of FP variables with Integer variables when no FP operation is needed Merging of 2 concurrent "for loops" into one Two assignments to the same pointer without using the 1 st assignment Code replacement instead of function Unnecessary variable declaration

11
Optimized Code: float PortraitCut::BVZ_interaction_penalty { int c; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } } M = sqrt (float(ap)) + sqrt(float(anp)); Original Code: float PortraitCut::BVZ_interaction_penalty { int c, k; float a,M=0; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } } M += sqrt(a); Replacement of FP variables with Integer variables Merging of 2 concurrent "for loops" into one Code Optimization

12
Original Code: for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum =.3086f * (float)I[0] +.6094f * (float)I[1] +.082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y Optimized Code: for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum =.3086f * (float)I[0] +.6094f * (float)I[1] +.082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y Code Optimization Two assignments to the same pointer without using the 1st assignment

13
Original Code: float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { assert(0); Coord dp = p; _displace(dp,d); Optimized Code : #define _displacedef(p,l) _idata->images(l)->displace(p) float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { Coord dp = p; _displacedef(dp,d); Code Optimization Code replacement instead of function

14
Original Code: const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { unsigned char *res = _data + 3*(p.y * _size.x + p.x); return res; } else return __black; } Optimized Code: const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { return (_data + 3*(p.y * _size.x + p.x)); } else return __black; } Code Optimization Unnecessary variable declaration

15
Code Optimization Optimized Code vs. Original Code Time Based Comparison Original CodeOptimized Code 18% improvement!

16
SIMD - Single Instruction Multiple Data main issue when using the SIMD instruction is that a 128bit register is available to us so we can use it wisely. We used this 128bit register in some places in our code that we thought that it will boost our application performance

17
Optimized Code: float PortraitCut::BVZ_interaction_penalty { int c; __m128 SimdM; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } } SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap),float(anp))); M = (SimdM.m128_f32[0] + SimdM.m128_f32[1])/6.f; Original Code: float PortraitCut::BVZ_interaction_penalty { int c, k; float a,M=0; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M += sqrt(a); M /=6.f; SIMD

18
In the following example we used SIMD in order to compute a dot-product on 2 vectors In order to make our process efficient, we must align the data in the memory and so we used the __declspec(align(16)) instruction

19
Optimized Code: float ContrastCut::getDataCost (Coord p, ushort d) { float mean=0, lum, contrast=0; const unsigned char* I; int y,x, i; __declspec(align(16)) float lumarr[25]; __m128 SimdMult; __m128 SimdMean; __m128 *pLumArr = (__m128*)lumarr; __m128 *pGaussArr = (__m128*)_gaussianK5; SimdMean = _mm_set1_ps (0.f); for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lumarr[i] =.3086f * (float)I[0] +.6094f * (float)I[1] +.082f * (float)I[2]; } // x } // y for (i = 0; i < 24 ; i+=4) { SimdMult = _mm_mul_ps (*pLumArr, *pGaussArr); SimdMean = _mm_add_ps (SimdMult, SimdMean); pLumArr++; pGaussArr++; } mean = SimdMean. m128_f32[0]+ SimdMean.m128_f32[1]+ SimdMean. m128_f32[2]+ SimdMean. m128_f32[3]; mean =(mean+lumarr[24]*_gaussianK5[24])/.9997f; Original Code: float ContrastCut::getDataCost (Coord p, ushort d) { float mean=0, lum, contrast=0; const unsigned char* I; int y,x, i; for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum =.3086f * (float)I[0] +.6094f * (float)I[1] +.082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y mean /=.9997f; SIMD

20
SIMD vs. Original Code Time Based Comparison Optimized CodeOriginal Code 1.5% improvement?? SIMD Optimization

21
SIMD Instead of storing the data (the variables ap & anp) in the registers, it stores it in the memory, an action that causes store forwarding when using the sqrtps instruction The use of SIMD accelerates the function's speed by approximately 1 sec, however the delay caused by the store forwarding is larger the speedup the SIMD acquired, and so, we got a slow down

22
M = SimdM. m128_f32[0] + SimdM. m128_f32[1]) /6.f; if (_cuttype == C_GRAD) { 00411378 cmp dword ptr [esi+50h],1 0041137C cvtsi2ss xmm0,edx 00411380 movss dword ptr [esp+20h],xmm0 00411386 sqrtps xmm0,xmmword ptr [esp+20h] 0041138B movaps xmmword ptr [esp+20h],xmm0 00411390 movss xmm0,dword ptr [esp+24h] 00411396 addss xmm0,dword ptr [esp+20h] 0041139C mulss xmm0,dword ptr __real@3e2aaaab (5397A0h)] 004113A4 movss dword ptr [esp+0Ch],xmm0 004113AA jne 004114AE SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap), float(anp))); 0041133B xorpsxmm0,xmm0 0041133E sub eax,edi 00411340 mov edi,dword ptr [esp+18h] 00411344 add ecx,ebx 00411346 movzx ebx,byte ptr [edi+2] 0041134A mov edi,dword ptr [esp+20h] 0041134E movzx edi,byte ptr [edi+2] 00411352 sub edi,ebx 00411354 mov ebx,eax 00411356 imul ebx,eax 00411359 mov eax,edi 0041135B imul eax,edi 0041135E movss dword ptr [esp+2Ch],xmm0 00411364 movss dword ptr [esp+28h],xmm0 0041136A add ecx,ebx 0041136C cvtsi2ss xmm0,ecx 00411370 movss dword ptr [esp+24h],xmm0 00411376 add edx,eax Store Forwarding Blocked SIMD

23
Multithreading Our major attempt to improve the original application was to divide the massive calculation into two independent threads that will run simultaneously on each core The main procedure used in this application is the function "compute

24
Original Compute function flow Multithreading ITER_MAX - Defined so the external loop won't loop forever. N - Number of pictures in the stack. Step - Image index descriptor. BVZ_Expand - Calculates max flow on the image's labels and returns the Energy of the current step. According to the calculation, it also updates the final image labels (the outcome) Inner Loop - Executed one time on each image. External Loop - Runs as long as there is improvement in the max flow calculation. As long as the old energy (from the previous step) is bigger than the new one; we continue the iterations to the next image. If no improvement in the flow was made, we achieved the maximum improvement and the function ends.

25
Our goal was to parallelize the energy Computation in each step so we can advance the steps by 2 each iteration. We calculate the odd steps (images) in thread 1 and the even steps in thread 2. Thread synchronization appears in two places: BVZ_Expand - The calculation part of the maxflow is parallelized (both threads) and at this point thread 2 waits for thread 1 to finish his energy calculation & label updates. now thread 2 has the right E_old. Compute - If thread 1 changed the label, thread 2 must recalculate the last step on the updated label. Optimized Compute function flow Multithreading

26
Multithreading Optimization Multithreading vs. Original Code Time Based Comparison 25% improvement! Optimized CodeOriginal Code

27
Theoretically when we are using 2 threads that are working simultaneously we expect that we would get 50% speedup Due to the fact that the results of each thread depends on the previous iteration, synchronization points are required in the code Those synchronization points halts the threads runs and therefore causes delays Multithreading vs. Original Code Time Based Comparison Multithreading Optimization

28
We tried to enhance the speed by taking a different approach to the synchronization. In this attempt, each thread changes the labels (temporary labels) on its own memory segment and we merge the results after the completion of both threads. Each label that thread 2 changes is marked using an auxiliary array. In the merging process, the labels are updated using the temp labels of thread 1 unless the specific label was changed also by thread 2 after that, the specific label is updated using thread 2's temp labels. We can see that the results however do not Have seamlessly differences Second Attempt Multithreading 2

29
Multithreading Optimization Multithreading 2 vs. Original Code Time Based Comparison 99.8% improvement! Optimized CodeOriginal Code

30
Multithreading Thread Profiler view of Multithreaded Code Our codes 2 threads Our 2 threads Our program utilization is high and except for out threads' sync points, both cores are working.

31
The serial part at the beginning of the program is required to load the image stack and to compute the first energy of the image, after that, our threads begin their computation. Multithreading Thread Profiler view of Multithreaded Code

32
The thread's sync points are taking little time - most of the code runtime is done simultaneously. Multithreading Thread Profiler view of Multithreaded Code

33
Intel® Compiler Intel compiler did not run on our SIMD configuration (class error). We used Intel Compiler on 3 configurations we made and compared its runtime with the same configurations that we ran using Visual Studio's compiler

34
Intel® Tuning Assistant Using Intel's tuning Assist, we found no significant areas where our code caused a slowdown All events collected by the tuning assistant indicates that our optimization is satisfactory The store forwarding issue using SIMD was not detected as a "hotspot" because the time consumed by the faulty code was 0.7% than the overall time spent on the entire function (less than 1%)

35
Optimization Summary Time Comparison

36
Optimization Summary Speed Up Comparison (Original is 100%)

37
Thank you, Tal Klein & Omer Manor

Similar presentations

OK

© 2006 Pearson Education, Upper Saddle River, NJ 07458. All Rights Reserved.Brey: The Intel Microprocessors, 7e Chapter 2 The Microprocessor and its Architecture.

© 2006 Pearson Education, Upper Saddle River, NJ 07458. All Rights Reserved.Brey: The Intel Microprocessors, 7e Chapter 2 The Microprocessor and its Architecture.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google