Auto-Vectorization of Interleaved Data for SIMD

Slides:



Advertisements
Similar presentations
Accelerating a climate physics model with OpenCL CMSC 601 Spring 11 – Research Skills Dibyajyoti Ghosh.
Advertisements

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.
Machine cycle.
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
GCC Tutorial – The compilation flow of the auto-vectorizer
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Compiler Challenges for High Performance Architectures
Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu 1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
New Algorithms for SIMD Alignment Liza Fireman - Technion Ayal Zaks – IBM Haifa Research Lab Erez Petrank – Microsoft Research & Technion.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
IBM Labs in Haifa Autovectorization in GCC Dorit Naishlos
University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Vector computers.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Prof. Hsien-Hsin Sean Lee
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
CS4961 Parallel Programming Lecture 11: Data Locality, cont
5.2 Eleven Advanced Optimizations of Cache Performance
Compilers for Embedded Systems
CSCE430/830 Computer Architecture
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Compiler Back End Panel
Register Pressure Guided Unroll-and-Jam
Compiler Back End Panel
Multivector and SIMD Computers
Adapted from the slides of Prof
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
October 18, 2018 Kit Barton, IBM Canada
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
ECE 498AL Lecture 10: Control Flow
ECE 498AL Spring 2010 Lecture 10: Control Flow
Introduction to Optimization
Code Optimization.
Presentation transcript:

Auto-Vectorization of Interleaved Data for SIMD Dorit Nuzman, Ira Rosen, Ayal Zaks IBM Haifa Research Lab – HiPEAC member, Isreal {dorit, ira, zaks}@il.ibm.com used this slide as opportunity to mention that this is a collaboration between different companies/vendors... Not IBM confidential, but not GPL either … those who stayed-over from yesterday are welcome to stay.

Main Message Most SIMD targets support access to packed data in memory (“SIMpD”), but there are important applications which access non-consecutive data We show how a classic compiler loop-based auto-SIMDizing optimization was augmented to support accesses to strided, interleaved data This can serve as a first step to combine traditional loop-based vectorization with (if-converted) basic-block vectorization (“SLP”) First two constitute the problem; emphasize that we’re not trying to devise new targets (as we and others did, e.g. eLite’s SIMdD) nor try to change the application (e.g. data layout xformations) We’re focusing on optimizing a range of existing apps on a range of existing targets Hence, in the next point Point 3 constitutes the solution Where we used the existing, multi-targeted gcc compiler Point 4 constitutes an extension/future work PLDI 2006

SIMD: Single Instruction Multiple Data Packed VR1 VR2 VR3 VR4 VR5 1 2 3 a a b c d b c d OP(a) OP(b) OP(c) OP(d) VOP( a, b, c, d ) VR1 Vector Operation Vectorization Vector Registers The context of this work is the DSP domain. One of the …..characteritics of applications in the DSP domain is the abundant parallelism that is present in the computations they perform. What happens often that you execute the same instruction many times, each time of different data. Modern DSP architectures often have special hardware that allows executing the same instructions simultanuously on multiple data elements. Usually, the way it is done – the operands/data must be packed in advance in vector registers. And then the vector instruction takes that register as operans and performs the operation on the (4) data elements. SIMD  SIMpD The process of transforming groups of scalar instructions into vector ones is called vectorization. Data in Memory: a b c d e f g h i j k l m n o p PLDI 2006

SIMD: Single Instruction Multiple Data Packed VR1 VR2 VR3 VR4 VR5 1 2 3 a a b c d b c d OP(a) OP(b) OP(c) OP(d) VOP( a, b, c, d ) VR1 Vectorization The context of this work is the DSP domain. One of the …..characteritics of applications in the DSP domain is the abundant parallelism that is present in the computations they perform. What happens often that you execute the same instruction many times, each time of different data. Modern DSP architectures often have special hardware that allows executing the same instructions simultanuously on multiple data elements. Usually, the way it is done – the operands/data must be packed in advance in vector registers. And then the vector instruction takes that register as operans and performs the operation on the (4) data elements. SIMD  SIMpD The process of transforming groups of scalar instructions into vector ones is called vectorization. Data in Memory: a a b b c c d d e f g h i j k l m n o p PLDI 2006

SIMD: Single Instruction Multiple Data Packed VR1 VR2 VR3 VR4 VR5 1 2 3 a a b c d OP(a) OP(f) OP(k) OP(p) e f g h f VOP( a, f, k, p ) VR5 i j k l k m n o p p a f k p Vectorizing for a SIMpD Architecture The context of this work is the DSP domain. One of the …..characteritics of applications in the DSP domain is the abundant parallelism that is present in the computations they perform. What happens often that you execute the same instruction many times, each time of different data. Modern DSP architectures often have special hardware that allows executing the same instructions simultanuously on multiple data elements. Usually, the way it is done – the operands/data must be packed in advance in vector registers. And then the vector instruction takes that register as operans and performs the operation on the (4) data elements. SIMD  SIMpD The process of transforming groups of scalar instructions into vector ones is called vectorization. Data in Memory: a a b c d e f f g h i j k k l m n o p p PLDI 2006

SIM D: Single Instruction Multiple Data Packed p VR1 VR2 VR3 VR4 VR5 1 2 3 a a b c d mask  … loop: (VR1,…,VR4)  vload (mem) VR5  pack (VR1,…,VR4),mask VOP(VR5) OP(a) OP(f) OP(k) OP(p) e f g h f VOP( a, f, k, p ) VR5 i j k l k m n o p p a f k p operation Reorder buffer Reorder buffer memory Data in Memory: a a b c d e f f g h i j k k l m n o p p PLDI 2006

Application accessing non-consecutive data – Viterbi decoder (before) Stride 1 Stride 2 Stride 2 + - - + << 1 << 1|1 << 1 << 1|1 Butterfly, perform max-with-index, recording where each max came from. Applications may unroll code, e.g. if programmer knows loop-bound is even Or if different entries have different names – fields of a struct max sel max sel Stride 4 PLDI 2006

Application accessing non-consecutive data – Viterbi decoder (after) Stride 1 Stride 2 Stride 2 + - - + << 1 << 1|1 << 1 << 1|1 Butterfly, perform max-with-index, recording where each max came from. Applications may unroll code, e.g. if programmer knows loop-bound is even Or if different entries have different names – fields of a struct max sel max sel Stride 4 PLDI 2006

Application accessing non-consecutive data – Audio downmix (before) Stride 4 >> 1 >> 1 >> 1 >> 1 Butterfly, perform max-with-index, recording where each max came from. Applications may unroll code, e.g. if programmer knows loop-bound is even Or if different entries have different names – fields of a struct + + Stride 2 PLDI 2006

Application accessing non-consecutive data – Audio downmix (after) Stride 4 >> 1 >> 1 >> 1 >> 1 Butterfly, perform max-with-index, recording where each max came from. Applications may unroll code, e.g. if programmer knows loop-bound is even Or if different entries have different names – fields of a struct + + Stride 2 PLDI 2006

Basic unpacking and packing operations for strided access Use two pairs of inverse operations widely supported on SIMD platforms: extract_even, extract_odd: interleave_high, interleave_low: Use them recursively to support strided accesses with power-of-2 strides Support several data types Compared to most general “permute” PLDI 2006

Classic loop-based auto-vectorization vect_analyze_loop (loop) { if (!1_analyze_counted_single_bb_loop (loop)) FAIL if (!2_determine_VF (loop)) FAIL if (!3_analyze_memory_access_patterns (loop)) FAIL if (!4_analyze_scalar_dependence_cycles (loop)) FAIL if (!5_analyze_data_dependence_distances (loop)) FAIL if (!6_analyze_consecutive_data_accesses (loop)) FAIL if (!7_analyze_data_alignment (loop)) FAIL if (!8_analyze_vops_exist_forall_ops (loop)) FAIL SUCCEED } vect_transform_loop (loop) { FOR_ALL_STMTS_IN_LOOP(loop, stmt) replace_OP_by_VOP (stmt); decrease_loop_bound_by_factor_VF (loop); Xform stmt by stmt, top-down When xforming a stmt, may add code in prolog/epilog (reduction), Or handling of misalignment PLDI 2006

Vectorizing non unit stride access One VOP accessing data with stride d requires loading of dVF elements Several, otherwise unrelated VOPs can share these loaded elements If they all share the same stride d If they all start close to each other Upto d VOPS; if less, there are ‘gaps’ Recognize this spatial reuse potential to eliminate redundant load and extract operations Better make the decision earlier than later – without such elimination vectorizing the loop may be non beneficial (for loads) vectorizing the loop may be prohibited (for stores) If stride is greater than VF, can load only VF*VF elements, But we always load consecutive intervals to facilitate reuse Having the same type can simplify PLDI 2006

Augmenting the vectorizer: step 1/3 – build spatial groups 5_analyze_data_dependence_distances already traversed all pairs of load/stores to analyze their dependence distance: if (cross_iteration_dependence_distance <= (VF-1)*stride) if (read,write) or (write,read) or (write,write) ok = dep_resolve(); endif Augment this traversal to look for spatial reuse between pairs of independent loads and stores, building spatial groups: if ok and (intra_iteration_address_distance < stride*u) if (read,read) or (write,write) ok = analyze_and_build_spatial_groups(); PLDI 2006

Augmenting the vectorizer: step 2/3 – check spatial groups 6_analyze_consecutive_data_accesses already traversed each individual load/store to analyze its access pattern Augment this traversal by Allowing non-consecutive accesses Building singleton groups for strided ungrouped load/stores Checking for gaps and profitability of spatial groups PLDI 2006

Augmenting the vectorizer: step 3/3 – transformation vect_transform_stmt generates vector code per scalar OP Augment this by considering If OP is a load/store in first position of a spatial group generate d load/stores handle their alignment according to the starting address generate d log d extract/interleaves If OP belongs to a spatial group, connect it to the appropriate extract/interleave according to its position Unused extract/interleaves are discarded by subsequent DCE PLDI 2006

Performance – qualitative: VF/(1 + log d) 4 8 16 2 1.3 2.6 5.3 0.8 1.6 3.2 32 0.6 1.2 2.4 Vectorized code has d load/stores and (d log d) extract/interleaves Scalar code has dVF loads/stores Performance improvement factor in # of load/store/extract/interleave is VF/(1 + log d) PLDI 2006

Performance – empirically (on PowerPC 970 with Altivec) Stride of 2 always provides speedups Strides of 8, 16 suffer from increased code-size – turns off loop unrolling Stride of 32 suffers from high register pressure (d+1) If non-permute operations exist – speedups for all strides if VFm8 PLDI 2006

Performance – stride of 8 with gaps Position of gaps affects the number of extract (interleaves) needed Improvement is observed even for a single strided access (VF=16 with arithmetic operations) PLDI 2006

Performance - kernels 4 groups: VF=4, 8, 16, 16-with-gaps Strides prefix each kernel Slowdown when doing only memory operations at VF=4, d=8 PLDI 2006

Future direction – towards loop-aware SLP When building spatial groups, we consider distinct operations accessing adjacent/close addresses; this is the first step of building SLP chains SLP looks for VF fully interleaved accesses, without gaps; may require earlier loop unrolling Next step is to consider the operations that use a spatial group of loads – if they’re isomorphic, try to postpone the extracts Analogous to handling alignment using zero-shift, lazy-shift, eager-shift policies PLDI 2006

Conclusions Existing SIMD targets supporting SIMpD can provide improved performance for important power-of-2 strided applications – don’t be afraid of d > 2 Existing compiler loop-based auto-vectorization can be augmented efficiently to handle such strided accesses This can serve as a first step combining traditional loop-based vectorization with (if-converted) basic-block vectorization (“SLP”) This area of work is fertile; consider details (d, gaps, positions, VF, non-mem ops) for it not to be futile! PLDI 2006

Questions ? PLDI 2006