NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Streaming SIMD Extension (SSE)
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Programmability Issues
SIMD Single instruction on multiple data – This form of parallel processing has existed since the 1960s – The idea is rather than executing array operations.
The University of Adelaide, School of Computer Science
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Chapter 8. Pipelining.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
© 2009, Renesas Technology America, Inc., All Rights Reserved 1 Course Introduction  Purpose:  This course provides an overview of the SH-2 32-bit RISC.
Performance of mathematical software Agner Fog Technical University of Denmark
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
EKT303/4 Superscalar vs Super-pipelined.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Single Node Optimization Computational Astrophysics.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
1 Lecture 5a: CPU architecture 101 boris.
Chapter Six.
Topics to be covered Instruction Execution Characteristics
COMP 740: Computer Architecture and Implementation
15-740/ Computer Architecture Lecture 3: Performance
Programming Technique in Matlab Allows Coder Generate Vectorized Friendly C Code Zijian Cao.
Embedded Systems Design
Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Prof. Zhang Gang School of Computer Sci. & Tech.
Morgan Kaufmann Publishers
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Superscalar Processors & VLIW Processors
Compiler Back End Panel
Compiler Back End Panel
Chapter Six.
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
MARIE: An Introduction to a Simple Computer
Chapter 8. Pipelining.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 11: Machine-Dependent Optimization
CSE 502: Computer Architecture
Presentation transcript:

NCCS Brown Bag Series

Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia SSSO code January 28, 2014

Motivation 3 Compilers guarantee that the results of a program are the same as the results obtained as if the program is executed sequentially! Programs in general are not executed sequentially. Compilers try to exploit the inherent parallelism in the programs. Programmers sometimes need to help out the compiler. Programs need to be written in a way that exploits the parallel mechanisms available on modern CPUs! Exploiting parallelism in programs will increase throughput and more efficiently utilize modern CPUs.

Vector instructions (SIMD) 4 Single Instruction Multiple Data (SIMD) is a commonly occurring situation! Assumptions: (1)Instructions are independent (no dependencies). (2) Instructions are identical (same operation). (3) Multiple data (multiple iterations). Q: How to exploit parallelism? A: Modern CPUs use vector registers!

Vector processes 5 Intel offers vector registers that are essentially large registers (“register banks”) that can fit several data elements. Vector registers (Intel) are fully pipelined and execute with 4-cycle latency and single cycle throughput.

NCCS systems overview 6 SSE = Streaming SIMD Extensions (1999) AVX = Advanced Vector eXtensions (2011)

Guidelines for vectorizable code by example 7 Use simple for loops. Loop bounds must be known at compile time. No conditional exit of loops. ✗ ✗

Guidelines for vectorizable code by example 8 Loops with dependencies can’t vectorize. Dependences can appear in a single line. In larger loops dependencies can be more subtle! ✗ ✗ ✓

Guidelines for vectorizable code by example 9 No control logic inside loop unless it can be masked. Simple binary logic is allowed because it can be masked! More complicated branching is not allowed! ✗ ✓

Guidelines for vectorizable code by example 10 No function calls inside the loop … with some exceptions. Intrinsic math functions are allowed e.g. trigonometric, exponential, logarithmic, and round/floor/ceil … Functions which can be inlined may vectorize. ? ✓

Directing the compiler 11

Some more exotic directives 12

Memory layout 13 Choose data layout carefully. Structure of arrays SoA = good for vectorization. Array of Structures = good for encapsulation. Use efficient memory access Favor unit stride. Minimize indirect addressing (using pointers to pointers). Alignment of data (next slide). More efficient for SIMD parallelization Better for data encapsulation

Memory layout 14 Memory is aligned on physical n-byte boundaries. SANDY = WESTMERE = 32 byte, Phi = 64 byte. load/stores run faster on aligned data. Largest penalty paid on the Xeon Phi architecture. Directives !DEC$ VECTOR {ALIGNED | UNALIGNED}. Caution: if data is not aligned when accessed incorrect results or exceptions will be thrown (segmentation fault). Fortran –align array[n]byte for alignment of all arrays. Peel loop used to achieve alignment. Main loop has aligned access and is a multiple of the vector length. Large main loop  alignment, peel, and remainder amortized. Small main loop  alignment and remainder important. First element in vectorized loop MUST be aligned if using aligned directive.

Alignment examples 15 ✗ ✓ Not all iterations in “j” are aligned! Adding padding we can achieve alignment at the cost of computation and a larger memory footprint. Is it worth it?

Alignment examples 16 Choosing domains that are a multiple of the boundary length can sometimes help with alignment e.g. in finite differencing. ✗ ✓

Vectorization analysis 17 Q: How do I know if loops are vectorized? A: Use –vec-report[0-6]. Q: How do I prioritize my vectorization efforts? A: Simple option is to use the loop profiler provided by Intel to determine the most time consuming loops. 1.Compile with the proper flags: -profile-functions -profile-loops=all -profile-loops-report=2. 2.Running the executable will create an XML file. 3.Run loopprofileviewer.sh to start a GUI. 4.Open XML file with GUI. Q: How do I know how much vectorization is improving things? A: Compiling with the flag –no-vec will turn off all vectorization. You can also use more sophisticated software such as VTune amplifier see brown bag.see brown bag

Loop profiler output 18

Regression testing 19 A vectorized loop may give slightly different results than the original serial one. Variations in memory alignment from one run to the next may cause drifts in answers for different runs. Using “-fp-model precise” may disable vectorization. Loops containing transcendental functions will not be vectorized. Warning! Reproducibility and optimization are sometimes at odds. Vectorization of loops will alter the bitwise reproducibility.

20 Thank You! Questions?