Presentation is loading. Please wait.

Presentation is loading. Please wait.

Software performance enhancement using multithreading and architectural considerations Prepared by: Evgeni Gokhfeld Konstantin Muradov 03/2007.

Similar presentations


Presentation on theme: "Software performance enhancement using multithreading and architectural considerations Prepared by: Evgeni Gokhfeld Konstantin Muradov 03/2007."— Presentation transcript:

1 Software performance enhancement using multithreading and architectural considerations Prepared by: Evgeni Gokhfeld Konstantin Muradov 03/2007

2 Chosen application Interactive Digital Photomontage Graphics manipulation tool Optimized routine: Panoramic Stitching Download Page: http://grail.cs.washington.edu/projects/photomontage/ http://grail.cs.washington.edu/projects/photomontage/

3 Interactive Digital Photomontage Interactive environment for sophisticated image manipulations: Interactive environment for sophisticated image manipulations: Relighting Relighting Extended depth of field Extended depth of field Panoramic stitching Panoramic stitching Clean-plate production Clean-plate production And more… And more…

4 How it looks like?

5 Achievements 58% boost in Precise Mode (100% identical result) 58% boost in Precise Mode (100% identical result) 100% boost in Fast Mode (Visually identical result. Allows a certain skew) 100% boost in Fast Mode (Visually identical result. Allows a certain skew)

6 Algorithm Description The program operates as follows: The input files are read The input files are read A big blank picture is created A big blank picture is created The COMPOSITE The COMPOSITE Using graph-cut optimization algorithm Using graph-cut optimization algorithm Choose good seams to combine source images and place them on the composite Choose good seams to combine source images and place them on the composite We have N source images: S 1,..., S N We have N source images: S 1,..., S N Choose a source image S i for each pixel p Choose a source image S i for each pixel p

7 Algorithm Description (cont.) Mapping between pixels and source images is a labeling L(p) Mapping between pixels and source images is a labeling L(p) A seam exists between two neighboring pixels p, q if L(p) ≠ L(q) A seam exists between two neighboring pixels p, q if L(p) ≠ L(q) In the inner loop at the t’th iteration: In the inner loop at the t’th iteration: Takes specific label α and a current labeling L t Takes specific label α and a current labeling L t Computes optimal labeling L t+1 such that: Computes optimal labeling L t+1 such that: L t+1 (p) = L t (p) or L t+1 (p) = α L t+1 (p) = L t (p) or L t+1 (p) = α The outer loop iterates over each possible label The outer loop iterates over each possible label

8 Algorithm Description (cont.) Terminates when passed over all labels and failed to reduce the cost function Terminates when passed over all labels and failed to reduce the cost function The cost function C of a pixel labeling L is The cost function C of a pixel labeling L is Sum of data penalty C d over all pixels p Sum of data penalty C d over all pixels p And an interaction penalty C i over all pairs of neighboring pixels p, q And an interaction penalty C i over all pairs of neighboring pixels p, q ))(),(,,())(,()(, qLpLqpCpLpCLC qp i p d  

9 Algorithm Description (cont.) Data penalty C d is the distance to the image objective Data penalty C d is the distance to the image objective Euclidean distance in RGB space of the source image pixel S L(p) (p) from the original composite Euclidean distance in RGB space of the source image pixel S L(p) (p) from the original composite Interaction C i penalty is the distance to the seam objective Interaction C i penalty is the distance to the seam objective The seam objective is 0 if L(p)=L(q) The seam objective is 0 if L(p)=L(q) ))(),(,,())(,()(, qLpLqpCpLpCLC qp i p d  

10 Algorithm Description (cont.) )()()()())(),(,,( )()()()( qSqSpSpSqLpLqpC qLpLqLpLi  If L(p) ≠ L(q) then interaction penalty is: If L(p) ≠ L(q) then interaction penalty is: The algorithm employs fast approximate energy minimization via graph cuts which is called “alpha expansion” The algorithm employs fast approximate energy minimization via graph cuts which is called “alpha expansion” When this seam penalty is used, many of the theoretical guarantees of the “alpha expansion” algorithm are lost When this seam penalty is used, many of the theoretical guarantees of the “alpha expansion” algorithm are lost However, in practice the authors have found it still gives good results However, in practice the authors have found it still gives good results

11 Optimization Steps Types & General Optimizations Types & General Optimizations In the original code there were floating points variables which is not necessary In the original code there were floating points variables which is not necessary Because the source values are integers! Because the source values are integers! Original Code Optimized Code int c, k; float a=0, M; for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; k = Il[c] - Inl[c]; a += k*k; a += k*k;} M = sqrt(a); int c, k, a=0; float M; for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; k = Il[c] - Inl[c]; a += k*k; a += k*k;} M = sqrt(float(a));

12 Optimization Steps The above loop is repeated twice (the pointers are different). So we’ve rewritten it as one loop. The above loop is repeated twice (the pointers are different). So we’ve rewritten it as one loop. The variable ‘a’ is zero in most of the cases -> added condition before SQRT computation. The variable ‘a’ is zero in most of the cases -> added condition before SQRT computation. Original Code Optimized Code a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; k = Il[c] - Inl[c]; a += k*k; } a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; k = Il[c] - Inl[c]; a += k*k; } a += k*k; } M += sqrt(a); a=0; Il1 = _imptr(l,p); Inl1 = _imptr(nl,p); Il2 = _imptr(l,np); Inl2 = _imptr(nl,np); for (c=0; c<3; ++c) { k1 = Il1[c] - Inl1[c]; k1 = Il1[c] - Inl1[c]; k2 = Il2[c] - Inl2[c]; k2 = Il2[c] - Inl2[c]; a1 += k1*k1; a1 += k1*k1; a2 += k2*k2; } a2 += k2*k2; } M = a1==0 ? 0 : sqrt(float(a1)); M+= a2==0 ? 0 : sqrt(float(a2));

13 Optimization Steps SIMD instructions & registers SIMD instructions & registers SIMD candidate: Euclidian distance between two RGB vectors: SIMD candidate: Euclidian distance between two RGB vectors: for (c=0; c<3; ++c) { k1 = Il1[c] - Inl1[c]; k2 = Il2[c] - Inl2[c]; a1 += k1*k1; a2 += k2*k2; } M = a1==0 ? 0 : sqrt(float(a1)); M+= a2==0 ? 0 : sqrt(float(a2));

14 Optimization Steps (SIMD) If need to compute SQRT at least once, compute the both values using SIMD: If need to compute SQRT at least once, compute the both values using SIMD: for (c=0; c<3; ++c) { k1 = Il1[c] - Inl1[c]; k2 = Il2[c] - Inl2[c]; a1 += k1*k1; a2 += k2*k2; } if (a1 | a2) { __m128d simd_M = _mm_sqrt_pd( _mm_set_pd(double(a1), double(a2))); __m128d simd_M = _mm_sqrt_pd( _mm_set_pd(double(a1), double(a2))); M=simd_M.m128d_f64[0]+simd_M.m128d_f64[1]; M=simd_M.m128d_f64[0]+simd_M.m128d_f64[1];}

15 Optimization Steps (SIMD) Why not to use SIMD to perform computations done by the loop? Why not to use SIMD to perform computations done by the loop? The loop counter goes from 0 to 2  there are tree sequential bytes used as memory operands. “Three bytes” is not a good granularity for SIMD instructions. The loop counter goes from 0 to 2  there are tree sequential bytes used as memory operands. “Three bytes” is not a good granularity for SIMD instructions. The memory operands are not aligned in the memory (Il1, *Inl1, *Il2, *Inl2 can start at any place)  some of the efficient load operations cannot be performed. The memory operands are not aligned in the memory (Il1, *Inl1, *Il2, *Inl2 can start at any place)  some of the efficient load operations cannot be performed. However, we tried to code the loop in SIMD instructions to see that we indeed get performance penalty. We saw that the (unrolled) non-SIMD loop had much less instructions than the code written in SIMD instructions. In the project paper we provide one of the variants of the SIMD code replacing the loop, in order to show the complexity of the solution. However, we tried to code the loop in SIMD instructions to see that we indeed get performance penalty. We saw that the (unrolled) non-SIMD loop had much less instructions than the code written in SIMD instructions. In the project paper we provide one of the variants of the SIMD code replacing the loop, in order to show the complexity of the solution.

16 Optimization Steps Cache Alignment Cache Alignment Majority of the problems – in Norma computation – non-aligned random accesses to triples of bytes (RGB vectors) Majority of the problems – in Norma computation – non-aligned random accesses to triples of bytes (RGB vectors) Trying to unpack the data would lead to significant increase in memory consumption (may even overtake 2GB!) Trying to unpack the data would lead to significant increase in memory consumption (may even overtake 2GB!) The maxflow graph construction (also memory consuming) has similar problems. But even if we did align graph nodes in memory, probably it wouldn’t lead to a significant speedup (the nodes of each graph are used rarely - if certain node was accessed, it probably will not be accessed another time in future) The maxflow graph construction (also memory consuming) has similar problems. But even if we did align graph nodes in memory, probably it wouldn’t lead to a significant speedup (the nodes of each graph are used rarely - if certain node was accessed, it probably will not be accessed another time in future) Another direction was to reuse the memory occupied by the nodes in different iterations of the algorithm, instead of performing new() and delete() each time, or trying to allocate a big chunk of memory and then divide it to nodes, in order to assure, that the connected (logically) nodes will reside close in memory and hardware prefetcher would bring them one after another. Another direction was to reuse the memory occupied by the nodes in different iterations of the algorithm, instead of performing new() and delete() each time, or trying to allocate a big chunk of memory and then divide it to nodes, in order to assure, that the connected (logically) nodes will reside close in memory and hardware prefetcher would bring them one after another. For both ideas no strait-forward solution was found, and building a generic mechanism would be close to writing kind of Memory Manager or Garbage Collector, which is beyond the scope of this project For both ideas no strait-forward solution was found, and building a generic mechanism would be close to writing kind of Memory Manager or Garbage Collector, which is beyond the scope of this project

17 Optimization Steps Cache misses Cache misses With help of VTune we identified the code were a lot of cache misses take place during searching the graph. It happens in the Graph::process_source_orphan routine, called from Graph::maxflow: With help of VTune we identified the code were a lot of cache misses take place during searching the graph. It happens in the Graph::process_source_orphan routine, called from Graph::maxflow: while ( 1 ) { if (j->TS == TIME) { d += j -> DIST; break;} a = j -> parent;d ++; if (a==TERMINAL) { j -> TS = TIME; j -> DIST = 1; break;} if (a==ORPHAN) { d = INFINITE_D; break; } j = a -> head; } Particularly, the access a->head gets the cache miss. Particularly, the access a->head gets the cache miss. We naturally would like to prefetch this data, but in order to do so, we need to check first that the node ‘a’ is not TERMINAL or ORPHAN, which actually means to get to the access itself. We naturally would like to prefetch this data, but in order to do so, we need to check first that the node ‘a’ is not TERMINAL or ORPHAN, which actually means to get to the access itself. Also we tried to insert the prefetch instruction at the end of the loop and bring the data to the next iteration, but since the same check should be performed (but to the next pointer level), this solution failed as well. Also we tried to insert the prefetch instruction at the end of the loop and bring the data to the next iteration, but since the same check should be performed (but to the next pointer level), this solution failed as well.

18 Optimization Steps Threading Threading Original code was single-threaded Original code was single-threaded Serious obstacle – iterative algorithm! Serious obstacle – iterative algorithm! 1 st attempt came after observation: 1 st attempt came after observation: About half of the steps retain the same max flow About half of the steps retain the same max flow It’s possible to pre-compute the next step! It’s possible to pre-compute the next step! If max flow retains, we win a step… If max flow retains, we win a step…

19 Optimization Steps Drawback: waste of CPU time Drawback: waste of CPU time Improvement: stop the worker when main thread discovers max flow change Improvement: stop the worker when main thread discovers max flow change Otherwise wait for the worker to finish Otherwise wait for the worker to finish Theoretically this should give about 25% speedup Theoretically this should give about 25% speedup In practice much lower: The first steps are VERY heavy In practice much lower: The first steps are VERY heavy Pre-compute Step 2 Step 1 Main Thread Worker Thread GUI thread – shows intermediate results between steps, sleeps most of the time Step 1 reduced flow ? YES Pre-compute Step 3 Step 2 reduced flow ? NO Pre-compute Step 5 Step 4 We gained a step here: From 2 directly to 4!

20 Optimization Steps 2 nd attempt: let’s sacrifice the binary identity! 2 nd attempt: let’s sacrifice the binary identity! The authors themselves introduced the seam penalty and lost theoretical guarantees of the algorithm! The authors themselves introduced the seam penalty and lost theoretical guarantees of the algorithm! Let’s compute 2 neighboring steps simultaneously in different threads and then combine the results Let’s compute 2 neighboring steps simultaneously in different threads and then combine the results We have then 2 label sets for each pixel at steps t and t+1: L t (p) and L t+1 (p) We have then 2 label sets for each pixel at steps t and t+1: L t (p) and L t+1 (p) If there is no improvement at step t, the whole labeling L t+1 is used If there is no improvement at step t, the whole labeling L t+1 is used

21 Optimization Steps Otherwise, we use L t and merge L t+1 to it: Otherwise, we use L t and merge L t+1 to it: If L t+1 has same or worse max flow then L t – it’s discarded. If L t+1 has same or worse max flow then L t – it’s discarded. If not, then for each pixel p: If not, then for each pixel p: If L t+1 decided to move p, then take L t+1 (p), otherwise take L t (p) If L t+1 decided to move p, then take L t+1 (p), otherwise take L t (p) The results are visually identical The results are visually identical There is no binary identity, as expected There is no binary identity, as expected Difference in the flow from the original product: < 0.01% Difference in the flow from the original product: < 0.01% This proves that the results are indeed very close This proves that the results are indeed very close

22 Optimization Steps Sanity proof: we generate this also by single thread Sanity proof: we generate this also by single thread Remark: the program runs much faster on Core Duo machine, then on a P4 Remark: the program runs much faster on Core Duo machine, then on a P4 There is still slight penalty for synchronizing before the merging steps There is still slight penalty for synchronizing before the merging steps Compute Step 2 Step 1 Main Thread Worker Thread GUI thread – shows intermediate results between steps, sleeps most of the time Merge Results, step++ Compute Step 4 Step 3 Merge Results, step++ Compute Step 6 Step 5

23 Optimization Steps Proper compiler flags Proper compiler flags Also we got benefit using the following compiler flags: Also we got benefit using the following compiler flags: SSE usage – compiler generates SSE & SSE2 instruction where appropriate SSE usage – compiler generates SSE & SSE2 instruction where appropriate Floating point model – FAST instead of PRECISE Floating point model – FAST instead of PRECISE Credit goes to Koby Credit goes to Koby

24 Optimization Steps Compiling by Intel® Compiler Compiling by Intel® Compiler Compilation by Intel® compiler failed due to internal compiler error Compilation by Intel® compiler failed due to internal compiler error Link step reported invalid debug info in an object file Link step reported invalid debug info in an object file The project’s code was delivered to Koby so Intel can analyze and correct this issue The project’s code was delivered to Koby so Intel can analyze and correct this issue Intel® Thread Checker Intel® Thread Checker The program was checked and found correct The program was checked and found correct See documentation for details & screenshots See documentation for details & screenshots Due to high memory consumption problem we had to use reduced source images sets Due to high memory consumption problem we had to use reduced source images sets

25 Optimization Steps Tuning Assistant Advices Tuning Assistant Advices Main issues: Main issues: DTLB/ITLB Misses (and Page Walks) DTLB/ITLB Misses (and Page Walks) Page Faults Page Faults Cache Misses Cache Misses ROB Full (memory accesses took a lot of time) ROB Full (memory accesses took a lot of time) All of the above are expected to occur in the memory-consuming application. All of the above are expected to occur in the memory-consuming application.

26 Results (Times) Original Allow SSE2Fast FP General Optimiza tions SIMD SQRT Threading - Precise Threading - Fast All Optimizations - Precise All Optimizations - Fast Run Time183.24180.16141.1120.86149.93141.29108.46115.5889.76 Clock Ticks C0184324177408138136124595155067114553919369975978247 Clock Ticks C10000091069710707348459923 Instructions Retired C016926915471414335312114315121299401920228745378372 Instructions Retired C10000080146628246393152354

27 Results (Boost)

28 More VTune snapshots

29

30


Download ppt "Software performance enhancement using multithreading and architectural considerations Prepared by: Evgeni Gokhfeld Konstantin Muradov 03/2007."

Similar presentations


Ads by Google