Download presentation

Presentation is loading. Please wait.

Published byDamaris Pryde Modified over 2 years ago

1
Presented by: Tal Klein Omer Manor

2
Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining parts of a set of photographs into a single composite picture We focused on one feature: extended depth of field (DOF) DOF is mostly important in Macro photography where the depth of field is very shallow

3
Digital Interactive Photomontage DOF allows a photographer to take several pictures of the same frame, focusing on different areas in each picture and then combine them using this feature Along the benefits of using extended DOF in photography, it is a "Heavy Resource Consumer" due the complex calculations & image manipulations needed here, therefore our goal was to speedup this process

4
Digital Interactive Photomontage

5
System Configuration Intel® Core 2 Duo 2.4Ghz 2Gbyte RAM Microsoft Windows XP x64 Due to the nature of our platform (2 cores) we assumed that by optimization, we can achieve a major boost in performance

6
The Optimization Process Analyzing the application Code Optimization SIMD Multithreading

7
Analyzing The Application Analyzing the application in 3 different ways: 1. VTune performance analyzer in order to search for our program's bottlenecks. 2. We added counters of our own to functions we suspected to be called many times. 3. Call graph (using Intels VTune).

8
transformPixel() - declares unnecessary variable.GetDataCost - we optimized the code and used SIMD instructions. BVZ_interaction_penalty - we optimized the code by merging two loops into one and used SIMD instructions. displace() - we changed its content to macro instead of function call. BVZ_data_penalty – Calls displace which we change into macro. Analyzing The Application

9
BVZ_Expand() - function which calls the small functions and consumes the biggest time on the CPU, We used multithreading on it.

10
Code Optimization Replacement of FP variables with Integer variables when no FP operation is needed Merging of 2 concurrent "for loops" into one Two assignments to the same pointer without using the 1 st assignment Code replacement instead of function Unnecessary variable declaration

11
Optimized Code: float PortraitCut::BVZ_interaction_penalty { int c; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } } M = sqrt (float(ap)) + sqrt(float(anp)); Original Code: float PortraitCut::BVZ_interaction_penalty { int c, k; float a,M=0; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } } M += sqrt(a); Replacement of FP variables with Integer variables Merging of 2 concurrent "for loops" into one Code Optimization

12
Original Code: for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum =.3086f * (float)I[0] f * (float)I[1] +.082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y Optimized Code: for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum =.3086f * (float)I[0] f * (float)I[1] +.082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y Code Optimization Two assignments to the same pointer without using the 1st assignment

13
Original Code: float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { assert(0); Coord dp = p; _displace(dp,d); Optimized Code : #define _displacedef(p,l) _idata->images(l)->displace(p) float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { Coord dp = p; _displacedef(dp,d); Code Optimization Code replacement instead of function

14
Original Code: const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { unsigned char *res = _data + 3*(p.y * _size.x + p.x); return res; } else return __black; } Optimized Code: const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { return (_data + 3*(p.y * _size.x + p.x)); } else return __black; } Code Optimization Unnecessary variable declaration

15
Code Optimization Optimized Code vs. Original Code Time Based Comparison Original CodeOptimized Code 18% improvement!

16
SIMD - Single Instruction Multiple Data main issue when using the SIMD instruction is that a 128bit register is available to us so we can use it wisely. We used this 128bit register in some places in our code that we thought that it will boost our application performance

17
Optimized Code: float PortraitCut::BVZ_interaction_penalty { int c; __m128 SimdM; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } } SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap),float(anp))); M = (SimdM.m128_f32[0] + SimdM.m128_f32[1])/6.f; Original Code: float PortraitCut::BVZ_interaction_penalty { int c, k; float a,M=0; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M += sqrt(a); M /=6.f; SIMD

18
In the following example we used SIMD in order to compute a dot-product on 2 vectors In order to make our process efficient, we must align the data in the memory and so we used the __declspec(align(16)) instruction

19
Optimized Code: float ContrastCut::getDataCost (Coord p, ushort d) { float mean=0, lum, contrast=0; const unsigned char* I; int y,x, i; __declspec(align(16)) float lumarr[25]; __m128 SimdMult; __m128 SimdMean; __m128 *pLumArr = (__m128*)lumarr; __m128 *pGaussArr = (__m128*)_gaussianK5; SimdMean = _mm_set1_ps (0.f); for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lumarr[i] =.3086f * (float)I[0] f * (float)I[1] +.082f * (float)I[2]; } // x } // y for (i = 0; i < 24 ; i+=4) { SimdMult = _mm_mul_ps (*pLumArr, *pGaussArr); SimdMean = _mm_add_ps (SimdMult, SimdMean); pLumArr++; pGaussArr++; } mean = SimdMean. m128_f32[0]+ SimdMean.m128_f32[1]+ SimdMean. m128_f32[2]+ SimdMean. m128_f32[3]; mean =(mean+lumarr[24]*_gaussianK5[24])/.9997f; Original Code: float ContrastCut::getDataCost (Coord p, ushort d) { float mean=0, lum, contrast=0; const unsigned char* I; int y,x, i; for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum =.3086f * (float)I[0] f * (float)I[1] +.082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y mean /=.9997f; SIMD

20
SIMD vs. Original Code Time Based Comparison Optimized CodeOriginal Code 1.5% improvement?? SIMD Optimization

21
SIMD Instead of storing the data (the variables ap & anp) in the registers, it stores it in the memory, an action that causes store forwarding when using the sqrtps instruction The use of SIMD accelerates the function's speed by approximately 1 sec, however the delay caused by the store forwarding is larger the speedup the SIMD acquired, and so, we got a slow down

22
M = SimdM. m128_f32[0] + SimdM. m128_f32[1]) /6.f; if (_cuttype == C_GRAD) { cmp dword ptr [esi+50h], C cvtsi2ss xmm0,edx movss dword ptr [esp+20h],xmm sqrtps xmm0,xmmword ptr [esp+20h] B movaps xmmword ptr [esp+20h],xmm movss xmm0,dword ptr [esp+24h] addss xmm0,dword ptr [esp+20h] C mulss xmm0,dword ptr (5397A0h)] A4 movss dword ptr [esp+0Ch],xmm AA jne AE SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap), float(anp))); B xorpsxmm0,xmm E sub eax,edi mov edi,dword ptr [esp+18h] add ecx,ebx movzx ebx,byte ptr [edi+2] A mov edi,dword ptr [esp+20h] E movzx edi,byte ptr [edi+2] sub edi,ebx mov ebx,eax imul ebx,eax mov eax,edi B imul eax,edi E movss dword ptr [esp+2Ch],xmm movss dword ptr [esp+28h],xmm A add ecx,ebx C cvtsi2ss xmm0,ecx movss dword ptr [esp+24h],xmm add edx,eax Store Forwarding Blocked SIMD

23
Multithreading Our major attempt to improve the original application was to divide the massive calculation into two independent threads that will run simultaneously on each core The main procedure used in this application is the function "compute

24
Original Compute function flow Multithreading ITER_MAX - Defined so the external loop won't loop forever. N - Number of pictures in the stack. Step - Image index descriptor. BVZ_Expand - Calculates max flow on the image's labels and returns the Energy of the current step. According to the calculation, it also updates the final image labels (the outcome) Inner Loop - Executed one time on each image. External Loop - Runs as long as there is improvement in the max flow calculation. As long as the old energy (from the previous step) is bigger than the new one; we continue the iterations to the next image. If no improvement in the flow was made, we achieved the maximum improvement and the function ends.

25
Our goal was to parallelize the energy Computation in each step so we can advance the steps by 2 each iteration. We calculate the odd steps (images) in thread 1 and the even steps in thread 2. Thread synchronization appears in two places: BVZ_Expand - The calculation part of the maxflow is parallelized (both threads) and at this point thread 2 waits for thread 1 to finish his energy calculation & label updates. now thread 2 has the right E_old. Compute - If thread 1 changed the label, thread 2 must recalculate the last step on the updated label. Optimized Compute function flow Multithreading

26
Multithreading Optimization Multithreading vs. Original Code Time Based Comparison 25% improvement! Optimized CodeOriginal Code

27
Theoretically when we are using 2 threads that are working simultaneously we expect that we would get 50% speedup Due to the fact that the results of each thread depends on the previous iteration, synchronization points are required in the code Those synchronization points halts the threads runs and therefore causes delays Multithreading vs. Original Code Time Based Comparison Multithreading Optimization

28
We tried to enhance the speed by taking a different approach to the synchronization. In this attempt, each thread changes the labels (temporary labels) on its own memory segment and we merge the results after the completion of both threads. Each label that thread 2 changes is marked using an auxiliary array. In the merging process, the labels are updated using the temp labels of thread 1 unless the specific label was changed also by thread 2 after that, the specific label is updated using thread 2's temp labels. We can see that the results however do not Have seamlessly differences Second Attempt Multithreading 2

29
Multithreading Optimization Multithreading 2 vs. Original Code Time Based Comparison 99.8% improvement! Optimized CodeOriginal Code

30
Multithreading Thread Profiler view of Multithreaded Code Our codes 2 threads Our 2 threads Our program utilization is high and except for out threads' sync points, both cores are working.

31
The serial part at the beginning of the program is required to load the image stack and to compute the first energy of the image, after that, our threads begin their computation. Multithreading Thread Profiler view of Multithreaded Code

32
The thread's sync points are taking little time - most of the code runtime is done simultaneously. Multithreading Thread Profiler view of Multithreaded Code

33
Intel® Compiler Intel compiler did not run on our SIMD configuration (class error). We used Intel Compiler on 3 configurations we made and compared its runtime with the same configurations that we ran using Visual Studio's compiler

34
Intel® Tuning Assistant Using Intel's tuning Assist, we found no significant areas where our code caused a slowdown All events collected by the tuning assistant indicates that our optimization is satisfactory The store forwarding issue using SIMD was not detected as a "hotspot" because the time consumed by the faulty code was 0.7% than the overall time spent on the entire function (less than 1%)

35
Optimization Summary Time Comparison

36
Optimization Summary Speed Up Comparison (Original is 100%)

37
Thank you, Tal Klein & Omer Manor

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google