Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer.

Copyright © 2008 Intel Corporation. 2 Agenda Mathematics of 3D convolution Mathematics of 3D convolution Main idea of SSE implementation of 1D convolution Main idea of SSE implementation of 1D convolution Basic routine of algorithm: 2D convolution – 1 line Basic routine of algorithm: 2D convolution – 1 line Main routine of algorithm: 3D convolution – line by line Main routine of algorithm: 3D convolution – line by line Adding OpenMP, benchmarking, conclusions Adding OpenMP, benchmarking, conclusions

Copyright © 2008 Intel Corporation. 3 3D convolution (with 3x3x3 kernel K) is computed for each pixel P as 3D convolution (with 3x3x3 kernel K) is computed for each pixel P as where p is source pixels and K – convolution kernel values. In another words, each new pixel is the sum of 27 products of source pixels values with appropriate kernel values inside kernel cubic: In another words, each new pixel is the sum of 27 products of source pixels values with appropriate kernel values inside kernel cubic: 3D convolution – what is it ? KpKpKp KpKpKp KpKpKp P = sum

Copyright © 2008 Intel Corporation. 4 Recombination from 1D convolutions If 1D convolution is defined as If 1D convolution is defined as therefore final line of 3D convolution is i.e. 3D convolution can be presented as double sum of 9 1D convolutions – 3 planes with 3 lines in plane

Copyright © 2008 Intel Corporation. 6 Main part of algorithm: 1D convolution idea of implementation Let start from 3 sequential QUADs from sourse line, multiply all three by different K (kernel) values (denoted as k -, k c,k + ) Let start from 3 sequential QUADs from sourse line, multiply all three by different K (kernel) values (denoted as k -, k c,k + ) -4-3-201234567 k - -4 -3 -2 k-k-00k-k-000 k-k-11k-k-111 k-k-22k-k-222 k-k-33k-k-333 k-k-44k-k-444 k-k-55k-k-555 k-k-66k-k-666 k-k-77k-k-777 k c -4 -3 -2 kckc00kckc000 kckc11kckc111 kckc22kckc222 kckc33kckc333 kckc44kckc444 kckc55kckc555 kckc66kckc666 kckc77kckc777 k + -4 -3 -2 k+k+00k+k+000 k+k+11k+k+111 k+k+22k+k+222 k+k+33k+k+333 k+k+44k+k+444 k+k+55k+k+555 k+k+66k+k+666 k+k+77k+k+777 k-k-k-k- k-k-k-k- k-k-k-k- k-k-k-k- Multiplication kckckckc kckckckc kckckckc kckckckc k+k+k+k+ k+k+k+k+ k+k+k+k+ k+k+k+k+ Selection by PALIGNR Using PALIGNR, select QUAD shifted left for products with k - and QUAD shifted right for products with k +. Sum up them with unshifted QUAD products with k c : Using PALIGNR, select QUAD shifted left for products with k - and QUAD shifted right for products with k +. Sum up them with unshifted QUAD products with k c : Sourse pixels p k - k-k-00k-k-000 k-k-11k-k-111 k-k-22k-k-222 kckc00kckc000 kckc11kckc111 kckc22kckc222 kckc33kckc333 k+k+11k+k+111 k+k+22k+k+222 k+k+33k+k+333 k+k+44k+k+444 P0P0P0P0 P1P1P1P1 P2P2P2P2 P3P3P3P3 k - p 2 +k c p 3 +k + p 4 k - p 1 +k c p 2 +k + p 3 k - p 0 +k c p 1 +k + p 2 k - p -1 +k c p 0 +k + p 1 Resulting sums are convolution expressions for central QUAD !

Copyright © 2008 Intel Corporation. 8 Main loop is treating sequential EIGHTs of 16bit pixels for 3 adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit form) is computed for 2 QUADs of each EIGHT, results for 3 lines are summed up, therefore forming 2D convolution results. Main loop is treating sequential EIGHTs of 16bit pixels for 3 adjacent lines (unrolled inside 1 step). 1D convolution (in 32bit form) is computed for 2 QUADs of each EIGHT, results for 3 lines are summed up, therefore forming 2D convolution results. To avoid using “if”s in the main loop, the very first step is separated into prolog part, being simpler than general step. To avoid using “if”s in the main loop, the very first step is separated into prolog part, being simpler than general step. Below is the description of 1 line (from 3 lines) computations in general main loop step. Below is the description of 1 line (from 3 lines) computations in general main loop step. It starts from loading EIGHT 16bit source pixels and unpacking them into 2 32bit QUADs : Basic routine of algorithm: 2D convolution – 1 line p0p0p0p0 p1p1p1p1 p2p2p2p2 p3p3p3p3 p4p4p4p4 p5p5p5p5 p6p6p6p6 p7p7p7p7 p0p0p0p0 p1p1p1p1 p2p2p2p2 p3p3p3p3 p4p4p4p4 p5p5p5p5 p6p6p6p6 p7p7p7p7 p0p0p0p0 p1p1p1p1 p2p2p2p2 p3p3p3p3 p4p4p4p4 p5p5p5p5 p6p6p6p6 p7p7p7p7 Load EIGHT of 16 bit source pixels Shuffle Equivalence First unpacked 32bit QUAD Second unpacked 32bit QUAD

Copyright © 2008 Intel Corporation. 9 Multiply 2 QUADs (from previous step) with three different K values (denoted as k -, k c, k + ), resulting in 6 product QUADs. Treat them together with 2 similar product QUADs saved at previous step. Multiply 2 QUADs (from previous step) with three different K values (denoted as k -, k c, k + ), resulting in 6 product QUADs. Treat them together with 2 similar product QUADs saved at previous step. 01234567 k - -4 -3 -2 k-k-00k-k-000 k-k-11k-k-111 k-k-22k-k-222 k-k-33k-k-333 k-k-44k-k-444 k-k-55k-k-555 k-k-66k-k-666 k-k-77k-k-777 kckc00kckc000 kckc11kckc111 kckc22kckc222 kckc33kckc333 kckc44kckc444 kckc55kckc555 kckc66kckc666 kckc77kckc777 k + -4 -3 -2 k+k+00k+k+000 k+k+11k+k+111 k+k+22k+k+222 k+k+33k+k+333 k+k+44k+k+444 k+k+55k+k+555 k+k+66k+k+666 k+k+77k+k+777 k-k-k-k- k-k-k-k- k-k-k-k- k-k-k-k- kckckckc kckckckc kckckckc kckckckc k+k+k+k+ k+k+k+k+ k+k+k+k+ k+k+k+k+ Using PALIGNR, select appropriate QUAD and start/continue forming 3 sum QUADs: Using PALIGNR, select appropriate QUAD and start/continue forming 3 sum QUADs: –(1) RED frame: 2D convolution of 1 st sourse QUAD: will be finalized and stored at the end of current step, –(2) GREEN frame: 2D convolution of 2 nd sourse QUAD: will be finalized and stored at the end of next step/epilog, –(Prev) YELLOW frame: 2D convolution of previous 2 nd sourse QUAD: will be finalized and stored at the end of current step Therefore, at the end of current step, 2 resulting 2D convolution QUADs– PREVIOUS 2 nd and CURRENT 1 st - will be stored. Therefore, at the end of current step, 2 resulting 2D convolution QUADs– PREVIOUS 2 nd and CURRENT 1 st - will be stored. Basic routine of algorithm: 2d convolution – 1 line Saved product QUADs from previous step 2 21 1 Prev 1 Multiplication SSE4 mullo_epi32 Multiplication SSE4 mullo_epi32

Copyright © 2008 Intel Corporation. 10 As already mentioned, each step treats and sums up data from 3 adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly. As already mentioned, each step treats and sums up data from 3 adjacent lines – performs computations from previous foils for 2 other lines and sets of kernel components accordingly. Prolog step doesn’t include PREVIOUS sum computation and certainly doesn’t save it. Prolog step doesn’t include PREVIOUS sum computation and certainly doesn’t save it. The epilog step includes the very last 2D convolution QUAD computation and store that is fully similar to PREVIOUS computation in regular step. Finally, the above routine builds ONE 32bit line of 2D convolution resulting points. Finally, the above routine builds ONE 32bit line of 2D convolution resulting points. Basic routine of algorithm: 2d convolution – 1 line finalizing

Copyright © 2008 Intel Corporation. 12 To build full 3D convolution stack, this routine runs on lines (inner loop) of all slices (external loop). To build full 3D convolution stack, this routine runs on lines (inner loop) of all slices (external loop). For each source line, it computes 3 32bit 2D convolution lines – based on previous, current and next slices, using “2D convolution -1 line” routine described above. For each source line, it computes 3 32bit 2D convolution lines – based on previous, current and next slices, using “2D convolution -1 line” routine described above. Main routine of algorithm: 3D convolution – line by line Slice 1 (next) Slice 1 (next) Slice 0 (current) Slice 0 (current) Slice -1 (previous) Slice -1 (previous) Line -1 Line 0 Line 1 2D convolution Summing up Resulting 3D convolution line is built by summing up these 3 lines, normalizing by arithmetical shift and converting result to 16 bit as following: Resulting 3D convolution line is built by summing up these 3 lines, normalizing by arithmetical shift and converting result to 16 bit as following: 0123 0123 0123 4567 4567 4567 Line -1 2D conv. Line 0 2D conv. Line +1 2D conv. Summing up 0123 4567 32bit 3D convolution 01 23 45 67 After shift: actually – 16bit packs_epi32 0123 4567 Final 16bit 3D convolution EIGHT Store Shift

Copyright © 2008 Intel Corporation. 14 Parallelizing by OpenMP and benchmarking To parallelize the above algorithm by using OpenMP for external (slices) loop, 3 32bit working lines for each thread are allocated. To parallelize the above algorithm by using OpenMP for external (slices) loop, 3 32bit working lines for each thread are allocated. See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores). See below benchmarks with and without OpenMP on 2-way HPTN machine (8 cores). 3 runs – equivalent of 3D gradient computation: SSE only SSE+OpenMP Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~5.5, Serial/(SSE+OpenMP) = ~16.3 10 runs: SSE only SSE+OpenMP Serial/SSE = ~3, SSE/(SSE+OpenMP) = ~6.3, Serial/(SSE+OpenMP) = ~18.6 Speed-up of SSE (3x) is close to theoretical limit for 4-32bit-vector operations ! Additional OpenMP speed-up (5.5x-6.3x) brings overall speed-up to 16.3x-18.6x !

Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer.

Similar presentations

Presentation on theme: "Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer.

Similar presentations

Presentation on theme: "Copyright © 2007 Intel Corporation. ® 16bit 3D Convolution Implementation SSE + OpenMP Benchmarking on Penryn Dr. Zvi Danovich, Senior Application Engineer."— Presentation transcript:

Similar presentations

About project

Feedback