Download presentation
Presentation is loading. Please wait.
Published byAlfred Hight Modified over 9 years ago
1
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development Products Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Getting Existing Code to Auto-Vectorize with Intel® Composer XE Alex Wells and Brandon Hewitt March 10, 2011
2
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Before We Start All webinars are recorded and will be available on Intel Learning Lab within a week –http://www.intel.com/go/learninglabhttp://www.intel.com/go/learninglab Use the Question Module to ask questions If you have audio issues, give it 5 seconds to resolve. –If audio issues persist, we suggest: –Drop and reenter the webinar –Call into the phone bridge 4/12/20152
3
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice How Intel® Composer XE can improve your application performance Intel® Composer XE is the latest compiler offering from Intel for 32-bit and 64-bit applications. What are SIMD instructions and how do they help performance What is auto-vectorization and how does it help generate optimized SIMD code What are the typical obstacles to using the auto-vectorizer and how do I resolve them What’s improved in the vectorizer in Composer XE Skinning Kernel Example Using high-level objects while still vectorizing Using Strided Array Access to Vectorize Arrays of Structures Combining these Techniques into a Kernel Template Library Performance with Intel(R) Advanced Vector Extensions More options and info 4/12/20153
4
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice SIMD: Single Instruction, Multiple Data Scalar mode –one instruction produces one result SIMD processing –With Intel® Streaming SIMD Extensions (SSE) or Advanced Vector Extensions (AVX) instructions –one instruction can produce multiple results +XY X + Y + X Y = = x0+y0x1+y1x2+y2x3+y3 x4+y4x5+y5x6+y6x7+y7 y0y1y2y3 y4y5y6y7 x0x1x2x3 x4x5x6x7 4/12/20154
5
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Automatic Vectorization by Compiler Translates Loops into SIMD Parallelism loop is stripmined (unrolled), strip length of 8 for floats with Intel® AVX of 4 for floats with Intel® SSE 128-bit Registers for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; A[7] A[6] A[5]A[4]A[3]A[2]A[1] A[0] B[7] B[6] B[5]B[4]B[3]B[2]B[1] B[0] C[7] C[6] C[5]C[4]C[3]C[2]C[1] C[0] + + + + + + + + 4/12/20155
6
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Did my loop auto-vectorize? icc -vec-report1 -c Multiply.c Multiply.c(31): (col. 3) remark: LOOP WAS VECTORIZED. Vectorization and reports are enabled only at –O2 and above Intel® SSE2 enabled by default Intel® AVX enabled with /Q[a]xAVX, –[a]xAVX –Read the documentation to find the option that works best for you If you use /arch:IA32 or –mia32, vectorizer is disabled. /Qvec-reportN or –vec-reportN options enable reports –N is a number from 0-5 –1 reports any loops vectorized –Recommend 3 to get information on loops vectorized and not vectorized and why 4/12/20156
7
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Obstacles to Successful Vectorization: Pointer Aliasing and Loop iteration dependencies Vectorization entails changes in the order of operations within a loop, since each SIMD instruction operates on several data elements at once. Vectorization is only possible if this change of order does not change the results of the calculation. One major cause can be the assumption of pointer aliasing void add(char *cp_a, char *cp_b, int n) { for (int i = 0; i < n; i++) { cp_a[i] += cp_b[i]; } The compiler must by default assume cp_a and cp_b can overlap in memory, thus causing potential loop dependencies (e.g. a write to cp_a[1] might affect the read from cp_b[12]) For simple cases like this one, the compiler can insert checks at runtime for aliasing and still vectorize –Extra checks can impact performance –Difficult to resolve for more complicated algorithms 4/12/20157
8
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Solutions for Loop Dependencies and Pointer Aliasing #pragma ivdep –Just place before the loop, and compiler will assume no dependencies unless it can prove otherwise. –Compiler may still be able to prove dependencies which will still cause auto-vectorization problems –Applied per loop restrict keyword –Use with pointer declarations to assert that memory is only referenced through that pointer –Requires extra compiler option (/Qrestrict or –restrict, or C99) –Applied per pointer Also compiler options to affect compiler assumptions about loop aliasing –/Qansi-alias, /Oa, /Qalias-args- (Windows*) –-ansi-alias, -fno-alias, -fargument-noalias (Linux*/Mac OS*) Note that any of these changes may result in incorrect code if the assumptions being changed are not correct (e.g. your pointers do overlap) 4/12/20158
9
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Obstacles to Successful Vectorization: Function Calls Vectorization cannot occur when function calls are made in the loop –Can be hard to recognize that calls are happening, especially in C++ Solutions –Inline wherever possible –Use __forceinline on function definitions –Use #pragma forceinline recursive to inline all calls (and nested calls) in a statement –Use __declspec(vector) on function declarations and definitions to safely auto-vectorize them –Part of Intel® Cilk™ Plus –Many standard math library functions already vectorize 4/12/20159
10
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Obstacles to Successful Vectorization: Not Enough Work Compiler may decide that vectorizing a loop will not generate more efficient code. Solution –Use #pragma vector always to override compiler heuristics –Note that compiler will still not vectorize if it determines it’s unsafe or illegal to do so –Can use #pragma vector always assert to give a compile-time error if loop does not vectorize –Be sure that vectorization does improve performance before use 4/12/201510
11
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Obstacles to Successful Vectorization: C++ STL Data Types Can be done, but need to do help the compiler $ cat vec-stl-1.cpp #include std::vector XX, YY; void foo(int iters, int x, int y) { for (int i=0; i < iters; i++) { XX[x+i] += YY[y+i]; } $ icc -vec-report3 -c vec-stl-1.cpp vec-stl-1.cpp(6): (col. 3) remark: loop was not vectorized: existence of vector dependence. vec-stl-1.cpp(7): (col. 7) remark: vector dependence: assumed FLOW dependence between _M_start line 7 and _M_start line 7. Some similar problems with loop dependence (due to STL implementation) Also problem with compiler not recognizing global addresses as invariant 4/12/201511
12
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Solution for Enabling Auto-vectorization $ cat vec-stl-2.cpp #include <vector> std::vector<double> XX, YY; void foo(int iters, int x, int y) { double *x1 = &XX[x]; double *y1 = &YY[y]; #pragma ivdep for (int i=0; i < iters; i++) { x1[i] += y1[i]; } } $ icc -vec-report3 -c vec-stl-2.cpp vec-stl-2.cpp(9): (col. 3) remark: LOOP WAS VECTORIZED. 4/12/201512
13
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Obstacles to Successful Vectorization: Data Alignment Data alignment (on 16-bytes) is a major issue with Intel® SSE performance prior to Intel® Core™ i7 It’s still a potential performance issue as the compiler will generate runtime checks for alignment It will become important again in the next generation of Core i7 (with Intel® AVX instructions) (this time on 32-bytes) Two places where explicit coding is needed –When declaring new pointers –When declaring function arguments 4/12/201513
14
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Data Alignment Solutions For new arrays: –__declspec(align(N)) type name[bounds]; For new dynamically allocated memory: –type * p = (type*) _mm_malloc(size, N); –_mm_free(p); –Threading Building Blocks memory allocator –scalable_aligned_malloc / scalable_aligned_free –Can also overload new/delete operators for classes For function arguments: –__assume_aligned(name, N); For specific loops: –#pragma vector aligned/unaligned Can cause runtime exceptions if assumption is false 4/12/201514
15
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Improvements in Intel® Composer XE (compared to previous versions of Intel® C++ Compiler) Mixed data types –Loops containing mixed data types (such as float and double) will now auto-vectorize $ cat mix-data.c void foo(int n, float *restrict A, double *restrict B){ int i; float t = 0.0f; for (i=0; i<n; i++) { A[i] = t; B[i] = t; t += 1.0f; } 11.1 Update 7: $ icc -vec-report3 -restrict -c mix-data.c mix-data.c(4): (col. 3) remark: loop was not vectorized: unsupported data type. 12.0 Update 2: $ icc -vec-report3 -restrict -c mix-data.c mix-data.c(4): (col. 3) remark: LOOP WAS VECTORIZED. 4/12/201515
16
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Improvements in Intel® Composer XE (compared to previous versions of Intel® C++ Compiler) Support for more complicated nested if statements $ cat vec-if.c void foo(int n, double *A, double *B, double *C){ int i; #pragma ivdep for (i=0; i<n; i++){ if (A[i] > 0 || B[i] < 0) { C[i] = 0; } else { C[i] = 1; } 11.1 Update 7: $ icc -vec-report3 -c vec-if.c vec-if.c(9): (col. 7) remark: loop was not vectorized: statement cannot be vectorized. 12.0 Update 2: $ icc -vec-report3 –c vec-if.c vec-if.c(4): (col. 3) remark: LOOP WAS VECTORIZED. 4/12/201516
17
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Application of these Techniques in Skinning Kernel Example 4/12/201517
18
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201518 For Auto-Vectorization, data needs to be accessed component wise as arrays Skinning Kernel Example: compute influence of a joint over a set of vertices. Array per component of Input And Output data streams Expand high level math to be performed per component #pragma vector always assert #pragma ivdep for(unsigned int i=0; i < count; ++i) { const float x = offsetX[i]; const float y = offsetY[i]; const float z = offsetZ[i]; const float nw = normalizedWeight[i]; outX[i] = (x * joint.m[0][0] + y * joint.m[1][0] + z * joint.m[2][0] + joint.m[3][0]) * nw; outY[i] = (x * joint.m[0][1] + y * joint.m[1][1] + z * joint.m[2][1] + joint.m[3][1]) * nw; outZ[i] = (x * joint.m[0][2] + y * joint.m[1][2] + z * joint.m[2][2] + joint.m[3][2]) * nw; } out[i] = (offset[i]*joint)*normalizedWeight[i];
19
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201519 You can express a kernel’s algorithm using higher level objects, like Point or Matrix and still Auto-Vectorize! Within a Kernel Body –Local objects can be created on the stack –Only arrays external to the kernel body need to be accessed by control loop’s index –Fixed offset Data Members can be accessed. –Methods may be called as long as they inline Auto-Vectorization will still work! float length = offset.length(); Vector3 scaledOffset = offset*scale; const Vector3 vertex(offsetX[i], offsetY[i], offsetZ[i]); float lengthSquared = (offset.x*offset.x) + (offset.y*offset.y) + (offset.z*offset.z); float length = sqrt(lengthSquared);
20
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201520 Define high level helper classes for use inside the kernel body All helper classes & methods inline code to the stack Compiler optimizes out needless copies –By the time the auto-vectorizer gets to it, the code resembles the long hand per component version const Matrix4x3 joint(usersJoint); #pragma vector always assert #pragma ivdep for(unsigned int i=0; i < count; ++i) { const Vector3 offset(offsetX[i], offsetY[i], offsetZ[i]); const float nw = normalizedWeight[i]; const Point3 out = (offset*joint)*nw; outX[i]=out.x(); outY[i]=out.y(); outZ[i]=out.z(); } Algorithm Expressed using High Level Objects
21
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201521 Anything special about the helper classes? Flat data layout All methods are __forceinline Otherwise plain old C++ struct Matrix4x3 { private: float m00; float m01; float m02; float m10; float m11; float m12; float m20; float m21; float m22; float m30; float m31; float m32; public: //... implement methods } struct Matrix4x3 { private: float m[4][3]; public: //... implement methods } VS __forceinline Vector3 Vector3::operator*(float aScalar) { x() *= aScalar; y() *= aScalar; z() *= aScalar; return *this; }
22
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201522 Multiple Arrays: getting data to the “vector” SSE registers in the CPU C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 SIMD lanes (registers) must be loaded with the same logical kernel value from multiple elements. Best is linear memory access (contiguous array per component) - 1 instruction to load Multiple arrays can lose cache locality and pressure hardware prefetchers and the TLB vs. single AOS 4/12/201522 C[i] = A[i] + B[i]; CPU SSE Registers A8 A9 A10 A11 A12 A13 A14 A15 B8 B9 B10 B11 B12 B13 B14 B15 X X M M M M 0 0 X X M M M M 1 1 + A0 + B0 A0 + B0 A1 + B1 A1 + B1 A2 + B2 A2 + B2 A3 + B3 A3 + B3 = A4 + B4 A4 + B4 A5 + B5 A5 + B5 A6 + B6 A6 + B6 A7 + B7 A7 + B7 A4 A5 A6 A7 B4 B5 B6 B7 A0 A1 A2 A3 B0 B1 B2 B3 Array A Array B Array C
23
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201523 Tiled SOA: A Hybrid Data Layout We can bring back cache locality by changing the data layout to a Array Of Structures Of Fixed Length Arrays –Lets just call it “Tiled SOA”. –Requires outer loop to walk through tiles while inner loop vectorizes over the tile’s Width A Single Data Stream made up of Blocks of Fixed Sized Arrays. –Provides linear memory access pattern 4/12/201523 const size_t Width = 4; struct InputTile { float a[Width]; float b[Width]; }; InputTile inStreamOfTiles[2500]; CPU SSE Registers X X M M M M 0 0 X X M M M M 1 1 A0 A1 A2 A3 B0 B1 B2 B3 A4 A5 A6 A7 B4 B5 B6 B7 A8 A9 A10 A11 B8 B9 B10 B11 A12 A13 A14 A15 B12 B13 B14 B15 for(size_t tileIndex=0u; tileIndex < 2500; ++tileIndex) { const InputTile & tile = inStreamOfTiles[tileIndex]; const float * A = tile.a; const float * B = tile.b; float * C_forTile = *C[tileIndex*Width]; #pragma ivdep for(size_t index=0u; index < Width; ++index) { C_forTile[index] = A[index] + B[index]; } }
24
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201524 Array of Structures: Use Strided Array Access to pack component data into SIMD lanes If A and B were in AOS format… Keep pointers to the each component Use strided array access to operate on the correct component within the elements struct Input { float a; float b; }; Input in[10000]; C[i] = in[i].a + in[i].b; C0 C1 C2 C3 C4 C5 C6 C7 float *A = &in[0].a; float *B = &in[0].b; const unsigned int Stride=2; C[i] = A[i*Stride] + B[i*Stride]; CPU SSE Registers X X M M M M 0 0 X X M M M M 1 1 + = A5 A6 A7 B5 B6 B7 A2 A1 A3 B1 B2 B3 A4 A0 B4 B0 A0 + B0 A0 + B0 A1 + B1 A1 + B1 A2 + B2 A2 + B2 A3 + B3 A3 + B3 Array in Array C
25
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201525 Improving AOS access Maintaining a pointer per component increases register pressure as different pointers are used to access each component. Alternatively, the data accesses could be defined in terms of a single pointer + fixed offset to the data member of the structure + (stride of structure * index) stddef.h contains a macro that computes the fixed offset to a structure’s data member at compile time: offsetof(structure, member) Would be nice to have a template library that simplifies accessing AOS and Tiled SOA data for vectorization… typedef unsigned char BYTE; const BYTE * inArray = reinterpret_cast (&in[0]); for(size_t index=0u; index (pointerA); float localB = *(reinterpret_cast (pointerB); C[index] = localA + localB; }
26
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201526 Kernel Template Library (KTL) Express algorithms with High level helper objects Maps accessing multiple arrays into high level objects Hides accessing AOS as Strided Arrays for the user Support Tiled SOA data streams Apply the kernel to each element of the data streams in a data parallel manner –vectorized, –threaded, –vectorized + threaded, –or serial (for non vectorizing compilers or comparison) Utilizes only c++0x standard features allowing other compilers to still compile the code. Generate code that optimizes well
27
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201527 Kernel Code that Optimizes Well GOAL: layout code to give the compiler all available information, enabling it to perform better optimizations. Put the constants, loop control, and kernel body on the same stack level with no interruptions. –A function call interrupts the optimizer’s view of the stack. No Function calls –kernel body must inline inside the control loop –All functions or methods called must inline Constants and variables used inside the kernel must be declared on the stack –A reference or pointer to a constant isn’t good enough –The compiler can’t assume it hasn’t changed in-between iterations –With a local stack instance the compiler can prove it wasn’t changed and only load a register once Identify array data as independent
28
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201528 Code Layout that Optimizes Well #pragma forceinline recursive { // Define the context for the kernel body here // local stack variables for all constants Constant1 constant1; Constant2 constant2; //... const Array1 * array1; Array2 * array2; //... // Loop control structure #pragma ivdep for(size_t index=0; index < ElementCount; ++index) { // define kernel body here, using only context variables // declared on the stack above array2[index] = array1[index]*constant1 + constant2; } } All of the objects and helpers need to inline for auto-vectorization to work Identify array data as independent No function calls between local declarations and loop to interrupt stack
29
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201529 C++0x Lambda Functions with a template function can generate such a Code Layout // User code defines kernel as lambda function that operates // on a single element auto kernel = [=](const size_t &iIndex) { array2[iIndex] = array1[iIndex]*constant1 + constant2; }; // template library function generates the code for loop control layoutCodeForKernel(kernel, ElementCount); template __declspec(noinline) void layoutCodeForKernel(const KernelT &iKernel, const size_t iElementCount) { #pragma forceinline recursive { // Copy kernel closure onto the local stack KernelT localKernel(iKernel); // Loop control structure #pragma ivdep for(size_t index=0u; index < iElementCount; ++index) { localKernel(index); } } } Use”auto” keyword to hide type of the right hand side Use Lambda Function with capture by value to define the body of Kernel Any external variables accessed will be implicitly captured by value inside the Lambda Function’s closure
30
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201530 Kernel Template Library Example Using KTL to vectorize our Skinning example when data is in Array of Structures (AOS) format. struct InputElement { Vector3 offset; float weight; }; struct OutputElement { Vector3 position; }; const size_t elementCount = 10000; InputElement containerInput[elementCount]; OutputElement containerOutput[elementCount]; const Matrix4x3 * joint;
31
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201531 Kernel Template Library Example Using KTL to vectorize our Skinning example when data is in Array of Structures (AOS) format. const InputElement * inArray = &containerInput[0]; OutputElement * outArray = &containerOutput[0]; const Matrix4x3 localJoint(*joint); auto kernel = [=](const ElementAccess &access) { // Access array pointers as ktl elements // ktl elements encapsulate strided arrays and loop control indexes) auto inElem = access.inputElementOfAos(inArray); auto outElem = access.outputElementOfAos(outArray); // Get all input data out of the element(s) into local high level objects Vector3 localOffset; float localWeight; inElem.get (localOffset); inElem.get (weight); // Express algorithm using high level objects const Vector3 result = (localOffset*localJoint)*localWeight; // Set all output element(s) with the results outElem.set (result); }; ktl::Engine::vectorizeKernel(kernel, elementCount); ElementAccess will create elements that encapsulate getting or setting data for current iteration Use C++0x keyword “auto” to hide template types Use C++0x Lambda Function to define body of Kernel Get the element data onto the stack in terms of High Level Objects Algorithm Expressed using High Level Objects Store the results out in terms of High Level Objects Ask the Engine to vectorize the kernel body
32
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201532 Kernel Template Library Example Can use MACROS to simplify KTL_KERNEL_BEGIN(kernel) { KTL_ELEM_IN_AOS(inElem, inArray) KTL_ELEM_OUT_AOS(outElem, outArray); KTL_LOCAL_FROM_MEMBER(localOffset, inElem, offset) KTL_LOCAL_FROM_MEMBER(localWeight, inElem, weight) const Vector3 result = (localOffset*localJoint)*localWeight; KTL_SET_MEMBER_WITH(outElem, position, result) } KTL_KERNEL_END ktl::Engine::vectorizeKernel(kernel, elementCount);
33
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201533 Intel® AVX performance Automatic CPU dispatch to add Intel® AVX code paths –/QaxAVX /Qdiag-enable:cpu-dispatch Weighted Joint Deformation 256 elements (fits in L1), 1000000 iterations Test Platform Intel® Core™ i7 2nd generation processor [Sandy Bridge] –3.10 Ghz, 6MB L3, 4GB Ram, Windows Server 2008* r2 Compiled with Intel® C++ Composer XE [Version 12.0.0.104 for Intel® 64] Microsoft Visual Studio 2010* Release Config plus:/O3 /Oi /Ot /fp:fast /QaxAVX
34
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice 4/12/201534 What’s Next for Kernel Template Library? Containers –Runtime Tiled SOA defined and accessed by AOS –Take on managing tiled structure of arrays allowing user to pretend it’s a AOS for setup/editting and when defining a kernel, but under the hood the it will be Tiled SOA. More Data Types for use in Kernel –Current data types are just for example purposes Gather/Scatter or other staging data streams –Gather and scatter currently don’t vectorized and an additional stage needs to be introduced to gather/scatter to/from an intermediate buffer before the rest of the kernel can vectorize.
35
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice More Options and Information Lots of other ways to get the benefits of vectorization and SIMD instructions –Automatic CPU Dispatch (/Qax, -ax) –Manual CPU Dispatch (APIs to dispatch multiple versions of specific functions explicitly) –Intel® Cilk™ Plus Array Notations and #pragma simd –Intel® Integrated Performance Primitives and Math Kernel Library KTL is available at http://software.intel.com/en- us/articles/kernel-template-library/http://software.intel.com/en- us/articles/kernel-template-library/ More presentations at http://software.intel.com/en- us/articles/intel-software-development-products- technical-presentations/http://software.intel.com/en- us/articles/intel-software-development-products- technical-presentations/ Go to http://software.intel.com for morehttp://software.intel.com 4/12/201535
36
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Optimization Notice Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel ® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors. Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel Streaming SIMD Extensions 2 (Intel SSE2), Intel Streaming SIMD Extensions 3 (Intel SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor- dependent optimizations in this product are intended for use with Intel microprocessors. While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20110228 4/12/201536
37
Legal Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548- 4725, or go to: http://www.intel.com/design/literature.htm Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. http://www.intel.com/design/literature.htm Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement) http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement) Note: The below disclaimer should be included whenever the general performance disclaimer is used, but should be numbered separately: Configurations: Details on slide 33. For more information go to http://www.intel.com/performancehttp://www.intel.com/performance Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_numberhttp://www.intel.com/products/processor_number Intel, Core i7 and Cilk Plus are trademarks of Intel Corporation in the U.S. and/or other countries. Copyright © 2011 Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice 4/12/201537
38
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimization Notice Q&A 4/12/201538
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.