Template Library for Vector Loops A presentation of P0075 and P0076

Name: Template Library for Vector Loops A presentation of P0075 and P0076
Uploaded: 2017-10-02T16:09:25+00:00
Duration: PTM16S33
Channel: Winfred Hensley
Description: Template Library for Vector Loops A presentation of P0075 and P0076

Template Library for Vector Loops A presentation of P0075 and P0076
Pablo Halpern, Intel Corp

Overview What are the goals of vector and parallel extensions?
An overview of the parallelism TS Summary of proposed index-based loops (P0075) Library syntax for vector and parallel loops based on indexes, not iterators Support for arbitrary reductions and inductions Description of proposed vector execution policies (P0076) Range of vector architectures supported Wavefront execution: how vector execution differs from thread parallelism The difference between unseq and vec execution policies

What are the goals of vector and parallel extensions?
Efficient exploitation of modern parallel hardware Multicore processors Vector (SIMD) units GPUs and other coprocessors Conformance to the style and tradition of modern C++ Friendly to programmers already familiar with other parallel-programming systems Reasonable conformance to thread and vector progress assumptions.

Summary of the Parallelism TS (N4507)
A collection of algorithms that can be executed in parallel using one of a set of parallel execution policies. The parallel execution policies defined in N4507 are: sequential_execution_policy (seq): No parallelism parallel_execution_policy (par): Thread-based parallelism parallel_vector_execution_policy (par_vec): Same as par but with restricted synchronization, allowing for use of SIMD vector units. Notably absent: vector_execution_policy (vec): vector order of evaluation Example: parallel::for_each(parallel::par, v.begin(), v.end(), [&](double& x){ f(x, 9.5); });

Index-based loops (P0075) Overview
P0075 for_loop(par, 0, n, [&](int i){ A[i] = f(B[i], C[2*i]); }); OpenMP Equivalent #pragma omp parallel for for (int i=0; i<n; ++i) { A[i] = f(B[i], C[2*i]); }

Strided loops and flow control
P0075 for_loop_strided(par, n, 0, -2, [&](int i){ if (B[i] < 0) return; // return from lambda A[i] = f(B[i], C[2*i]); }); OpenMP Equivalent #pragma omp parallel for for (int i=n; i>0; i -= 2) { if (B[i] < 0) continue; A[i] = f(B[i], C[2*i]); }

Induction variables P0075 int j = 0; for_loop(par, 0, n, induction(j, 2), [&](int i, int jv){ A[i] = f(B[i], C[jv]); }); assert(j == 2*n); OpenMP Equivalent int j = 0; #pragma omp parallel for for (int i=0; i<n; ++i, j+=2) { A[i] = f(B[i], C[j]); } assert(j == 2*n); could reuse name “j”

local (race-free) partial sum
Reduction P0075 float sum = 0.0; for_loop(par, 0, n, reduction_plus(sum), [&](int i, float& sum) { sum += f(B[i], C[2*i]); }); OpenMP Equivalent float sum = 0.0; #pragma omp parallel for \ reduction(+:sum) for (int i=0; i<n; ++i) { sum += f(B[i], C[2*i]); } reuse name “sum” local (race-free) partial sum

User-defined Reductions
P0075 MyType accum; constexpr MyType ident{…}; MyType op(MyType, MyType); for_loop(par, 0, n, reduction(accum, ident, // identity value op), // reduction operation [&](int i, auto& accum){ accum = op(accum, f(B[i])); }); OpenMP Equivalent MyType accum; constexpr MyType ident{…}; MyType op(MyType, MyType); #pragma declare reduction( rop : MyType : omp_out=op(omp_out,omp_in) : omp_priv=ident) #pragma omp parallel for \ reduction(rop : accum) for (int i=0; i<n; ++i) { accum = op(accum, f(B[i])); }

Overview of vector execution policies (P0076)
seq par unseq par_vec vec unsequenced_execution_policy (unseq) relaxed sequencing applicable to STL algorithms vector_execution_policy (vec) necessary conditions for classic vector loop execution applicable to for_loop and for_loop_strided Allows vectorization of loops with certain dependence patterns Both policies use a single OS thread Let applications avoid disturbing existing threading

Step A(i+1) can begin at or before end of step A(i)
Vector architectures: “Long vector” machines: Cray (CM1 & CM2), CDC (Star-100) Step A(i+1) can begin at or before end of step A(i) A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Steps A(i) and A(i+1) execute concurrently in fixed-width registers
Vector architectures: SIMD: x86 (AVX & SSE), ARM (NEON), Power (AltiVec) Steps A(i) and A(i+1) execute concurrently in fixed-width registers A(0) A(1) A(2) A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Vector architectures: Software pipelining
B(0) B(1) Compiler orders instructions to maximize use of CPU pipeline and minimize latency. B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Wavefront Application (Sequencing for vec)
for_loop template applies a function to a sequence of arguments All of the preceding vector architectures execute instructions in a predictable wavefront. No earlier application may fall behind a later application Enables exploitation of “forward dependencies” Makes vector_execution_policy safe to use on any loop that can be auto-vectorized. Rules phrasing in P0076r0 are complete but complex. A simplification is being investigated.

Wavefront for “Long vector” machines
Time A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Vector architectures: SIMD: x86 (AVX & SSE), ARM (NEON), Power (AltiVec)
Time B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

Wavefront for Software pipelining
Time A(3) B(0) B(1) B(2) B(3) C(0) C(1) C(2) C(3) D(0) D(1) D(2) D(3)

vec Covers Gap Between seq and unseq
for_loop(unseq, 1, N, [&](int i) { V[i] = U[i]*A; U[i] = V[i]+B; }); unseq loops that work with unseq semantics for_loop(vec, 1, N, [&](int i) { V[i] = U[i+1]*A; U[i] = V[i-1]+B; }); vec loops that work with vector semantics for_loop(seq, 1, N, [&](int i) { V[i] = U[i-1]*A; U[i] = V[i+1]+B; }); seq Without ‘vec’, middle loop would have to run as ‘seq’, and programmer could pray that auto-vectorizer kicks in. Or be fissioned into two loops and pay bandwidth overheads. loops requiring sequential execution

vec_off Invokes its argument, but sequenced as if entire invocation is one big instruction. extern int* p; for_loop( vec, 0, n, [&](int i) { y[i] += y[i+1]; if(y[i]<0) { vec_off([]{ *p++ = i; }); }

Vendor Extension via Subclassing
OpenMP Equivalent (without vectorize_remainder) Vendor Extension via Subclassing #pragma omp simd safelen(8) for( int i=0; i<1912; ++i ) { Z[i+8] = Z[i]*A; }); struct my_policy: vector_execution_policy { static const int safelen = 8; static const bool vectorize_remainder = true; }; Compiler can find these compile-time values knowing just the type of the policy. No interprocedural analysis required. for_loop( my_policy(), 0, 1912, [&](int i) { Z[i+8] = Z[i]*A; }); The “vectorize_remainder” is an extension not available in OpenMP. The scheme is extensible in the sense that vendors could specify more members that their compilers would recognize. Other compilers would just ignore the extra members.

Possible Future Directions
Algorithms with certain dependence patterns do not prevent vectorization of enclosing algorithms and, depending on the target architecture, may themselves be vectorized (may or may not be profitable). These vector algorithms are not part of the current proposal – they are future work, consistent with the current proposal. // Histogram a[b[i]]++; // compress / expand if (cond(i)) { a[i] = b[i] * c[j++]; }

Why vec Only For for_loop?
The semantics of the vec execution are only well-defined for loops We are not yet sure how to specify them for algorithms Possible area for future work Not clear that vec has useful meaning for STL algorithms Nonetheless it is extremely valuable for for_loop and for_loop_strided.

Summary unseq_execution_policy vec_execution_policy
par unseq par_vec vec unseq_execution_policy relaxed sequencing applicable to STL algorithms vec_execution_policy necessary conditions for classic vector loop execution applicable to for_loop and for_loop_strided Both policies use a single OS thread Let applications avoid disturbing existing threading

Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

Alternatives (from N4238) Lock-step model
Not consistent with seq fallback Explicit ordering point model Warts grew Explicit temporaries and helper functions proliferated Seemed to increase the difficulty of vector programming Why mess with decades of success?

Examples of the Complications
auto tmp = A[i + 1]; parallel::wavefront_ordering_pt(); A[i] = 2*tmp; A[i] = 2*A[i+1]; OR A[i] = 2*parallel::wavefront_rvalue(A[i + 1]); auto tmp = expr; auto& ref = A[B[i]]; parallel::wavefront_off([&]{ ref = tmp; }); A[B[i]] = expr; OR parallel::wavefront_assign(A[B[i]]) = expr;

Template Library for Vector Loops A presentation of P0075 and P0076

Similar presentations

Presentation on theme: "Template Library for Vector Loops A presentation of P0075 and P0076"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Template Library for Vector Loops A presentation of P0075 and P0076

Similar presentations

Presentation on theme: "Template Library for Vector Loops A presentation of P0075 and P0076"— Presentation transcript:

Similar presentations

About project

Feedback