Download presentation

Presentation is loading. Please wait.

Published byAnika Bouler Modified over 3 years ago

1
ACCELERATING MULTIMEDIA APPLICATIONS USING THE INTEL SSE AND AVX ISA MIN LI 05/08/2013

2
INTEL SSE AND AVX ISA Intel ISA SSE1, SSE2, SSE3, SSE4 (SSE4.1, SSE4.2) SSE4.2 Specialized for String and Text applications (suitable for applications like template matching, Genome Sequence Comparison) AVX (mainly for floating point operations) AVX1: 256bits AVX2: 256bits (with some instructions extension) XMM register and YMM register XMM: 128bits YMM: 256bits

3
INTEL OPENCV LIBRARY Opencv Library Various of multimedia applications Object detection, face recognition, image processing… Good candidates for using Intel SSE or AVX ISA for speedup Intensive computations I made a video on Youtube to show some tricks in using Opencv library https://www.youtube.com/watch?v=ISap9zEGE2I https://www.youtube.com/watch?v=pqSgT0quMBc

4
GUIDELINES FOR ENABLING THE ISA Intel SSE and AVX cat /proc/cpuinfoMake sure SSE and AVX are enabled. Otherwise enable them. As you can see All SSE ISA are activated However only AVX1 is activated, which means I can only use 128bits XMM registers Note: AVX2 is released in the mid of 2012

5
INTEL OPENCV LIBRARY Opencv Library Various of multimedia applications Object detection, face recognition, image processing…

6
ACCELERATION CASE I Original: for( int i = 0; i < length; i += 4 ){ double t0 = d1[i] - d2[i]; double t1 = d1[i+1] - d2[i+1]; double t2 = d1[i+2] - d2[i+2]; double t3 = d1[i+3] - d2[i+3]; total_cost += t0*t0 + t1*t1 + t2*t2 + t3*t3; } After modification: int chunk = length / 4; for(i = 0; i < chunk; i++){ __m128 m0, m1; m0 = _mm_load_ps(&d1[4 * i]); m1 = _mm_load_ps(&d2[4 * i]); m1 = _mm_sub_ps(m0, m1); m1 = _mm_mul_ps(m1, m1); m1 = _mm_hadd_ps(m1, m1); m2 = _mm_shuffle_ps(m1, m1, _MM_SHUFFLE(2,3,0,1)); m1 = _mm_add_ps(m1, m2); total_cost += ((float*)&m1)[0]; if( total_cost > best ) break; }

7
ACCELERATION CASE II Original: float minval = FLT_MAX, maxval = -FLT_MAX; for( i = 0; i < N; i++, ++it ) { float v = *(const float*)it.ptr; if( v < minval ) { minval = v; minidx = it.node()->idx; } if( v > maxval ) { maxval = v; maxidx = it.node()->idx; } if( _minval ) *_minval = minval; if( _maxval ) *_maxval = maxval; After modification : __mm128 m0, m1, m2, m3, m4, minArray, maxArray; int chunk = N / 4; for(i = 1; i < chunk; i++){ m0 = __mm_load_ps( (const float*)it.ptr ); it += 4; m1 = _mm_min_ps(m0, minArray); m2 = _mm_max_ps(m0, maxArray); m3 = _mm_cmp_ps(m0, minArray, _CMP_LT_OS); m4 = _mm_cmp_ps(m0, maxArray, _CMP_GT_OS); int* mask1 = (int*) &m3; int* mask2 = (int*) &m4; for(int j = 0; j < 4; j++){ if(mask1[j] == -1) minPos[j] = 4 * i + j; if(mask2[j] == -1) maxPos[j] = 4 * i + j; } minArray = m3; maxArray = m4; }

8
LOAD OF STRUCTURES Structues like this : typedef point_{ int x; int y; } point; _mm_load_ only takes consecutive mem space! What is it like insider the XMM register? How to achieve the following using SSE && AVX ISA? point* points; points[0]. x points[0]. y points[1]. x points[1]. y... X0X0 Y0Y0 X1X1 Y1Y1 X2X2 Y2Y2 X3X3 Y3Y3 X0X0 X1X1 X2X2 X3X3 Y0Y0 Y1Y1 Y2Y2 Y3Y3 Not easy!!!

9
PERMUTE AND BLEND (1) __m256i temp = _mm256_load_si256((__m256i*) &points[4 * i]); (2) __m256 temp2 = _mm256_cvtepi32_ps(temp); (3) v4si mask1 = {9,8,8,9}; (4) __m256 temp3 = _mm256_permutevar_ps(temp2, mask1); (5) __m256 temp4 = _mm256_permute2f128_ps(temp3, temp3, 0x01); (6) temp3 = _mm256_blend_ps(temp3, temp4, 0b00110011); (7) v4si mask2 = {0xd,4,4,0xd}; (8) temp3 = _mm256_permutevar_ps(temp2, mask2); (9) __m128 m1 = _mm256_extractf128_ps(temp3, 1); (10) __m128 m2 = _mm256_extractf128_ps(temp3, 0); X0X0 Y0Y0 X1X1 Y1Y1 X2X2 Y2Y2 X3X3 Y3Y3 X0X0 X1X1 X2X2 X3X3 X0X0 X1X1 Y0Y0 Y1Y1 Y2Y2 Y2Y2 X2X2 X3X3 Y2Y2 Y3Y3 X2X2 X3X3 X0X0 X1X1 Y0Y0 Y1Y1 X0X0 X1X1 X2X2 X3X3 Y2Y2 Y3Y3 Y0Y0 Y1Y1 X0X0 X1X1 X2X2 X3X3 Y0Y0 Y1Y1 Y2Y2 Y3Y3 Y0Y0 Y1Y1 Y2Y2 Y3Y3

10
SIMULATION RESULTS Not only finding min/max, but also the position Too many overhead for loading structures

11
CONCLUSION AND FUTURE WORK Opencv suitable for SSE or AVX acceleration Single task has more chance to get speedup Loading and arranging a structure is really a cumbersome task Hints for smart automated compilation (such as loading structure) Suggestions for the expansion of the ISA (new instruction introduced)

Similar presentations

OK

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on fire extinguisher types Ppt on current environmental issues Free ppt on mobile number portability solutions Ppt on our environment for class 7th Ppt on threatened abortion Ppt on current account deficit india Download ppt on endangered species in india Download ppt on connectors in english grammar Ppt on second law of thermodynamics examples Ppt on limitation act of trinidad